_
Pandas Cheatsheet¶
• Installation
| Platform | Command |
|---|---|
| pip (All platforms) | INLINE_CODE_8 |
| conda | INLINE_CODE_9 |
| Ubuntu/Debian | INLINE_CODE_10 |
| macOS (Homebrew) | INLINE_CODE_11 |
| With performance libs | INLINE_CODE_12 |
| With all dependencies | INLINE_CODE_13 |
| Excel support | INLINE_CODE_14 |
| SQL support | INLINE_CODE_15 |
| Verify installation | INLINE_CODE_16 |
oder Grundlegende Befehle - DataFrame Creation & I/O
| Command | Description |
|---|---|
| INLINE_CODE_17 | Create DataFrame from dictionary |
| INLINE_CODE_18 | Create DataFrame from list of lists |
| INLINE_CODE_19 | Create a Series (single column) |
| INLINE_CODE_20 | Read CSV file into DataFrame |
| INLINE_CODE_21 | Read CSV with custom delimiter and encoding |
| INLINE_CODE_22 | Read Excel file |
| INLINE_CODE_23 | Read JSON file |
| INLINE_CODE_24 | Read from SQL database |
| INLINE_CODE_25 | Read Parquet file |
| INLINE_CODE_26 | Read data from clipboard |
| INLINE_CODE_27 | Write DataFrame to CSV |
| INLINE_CODE_28 | Write DataFrame to Excel |
| INLINE_CODE_29 | Write DataFrame to JSON |
| INLINE_CODE_30 | Write DataFrame to SQL database |
oder Grundlegende Befehle - Inspektion & Information
| Command | Description |
|---|---|
| INLINE_CODE_31 | Display first 5 rows |
| INLINE_CODE_32 | Display first 10 rows |
| INLINE_CODE_33 | Display last 5 rows |
| INLINE_CODE_34 | Get dimensions (rows, columns) |
| INLINE_CODE_35 | Get column names |
| INLINE_CODE_36 | Get data types of all columns |
| INLINE_CODE_37 | Get index information |
| INLINE_CODE_38 | Get concise summary of DataFrame |
| INLINE_CODE_39 | Get statistical summary of numeric columns |
| INLINE_CODE_40 | Get statistical summary of all columns |
| INLINE_CODE_41 | Count non-null values per column |
| INLINE_CODE_42 | Count unique values per column |
| INLINE_CODE_43 | Get memory usage of DataFrame |
| INLINE_CODE_44 | Get random sample of 5 rows |
| INLINE_CODE_45 | Get column names as list |
| _ | |
| oder Grundlegende Befehle - Auswahl & Filtern |
| Command | Description |
|---|---|
| INLINE_CODE_46 | Select single column (returns Series) |
| INLINE_CODE_47 | Select multiple columns (returns DataFrame) |
| INLINE_CODE_48 | Select first row by position |
| INLINE_CODE_49 | Select first 5 rows by position |
| INLINE_CODE_50 | Select all rows, first 2 columns |
| INLINE_CODE_51 | Select row by label/index |
| INLINE_CODE_52 | Select columns by label range |
| INLINE_CODE_53 | Filter rows where column > 5 |
| INLINE_CODE_54 | Filter rows by list of values |
| INLINE_CODE_55 | Filter with multiple conditions (AND) |
| INLINE_CODE_56 | Filter with multiple conditions (OR) |
| INLINE_CODE_57 | Filter with NOT condition |
| INLINE_CODE_58 | Filter using SQL-like query string |
| INLINE_CODE_59 | Get 5 rows with largest values in column |
| INLINE_CODE_60 | Get 5 rows with smallest values in column |
oder Grundlegende Befehle - Daten Manipulation
| Command | Description |
|---|---|
| INLINE_CODE_61 | Add new column with calculation |
| INLINE_CODE_62 | Delete column |
| INLINE_CODE_63 | Delete rows by index |
| INLINE_CODE_64 | Rename columns |
| INLINE_CODE_65 | Sort by column values descending |
| INLINE_CODE_66 | Sort by multiple columns |
| INLINE_CODE_67 | Sort by index |
| INLINE_CODE_68 | Reset index to default integer index |
| INLINE_CODE_69 | Set column as index |
| INLINE_CODE_70 | Get unique values in column |
| INLINE_CODE_71 | Count occurrences of each value |
| INLINE_CODE_72 | Remove duplicate rows |
| INLINE_CODE_73 | Remove duplicates based on specific columns |
| INLINE_CODE_74 | Fill missing values with 0 |
| INLINE_CODE_75 | Remove rows with any missing values |
| _ | |
| oder Grundlegende Befehle - Aggregation |
| Command | Description |
|---|---|
| INLINE_CODE_76 | Sum of column values |
| INLINE_CODE_77 | Mean of column values |
| INLINE_CODE_78 | Median of column values |
| INLINE_CODE_79 | Standard deviation |
| INLINE_CODE_80 | Variance |
| INLINE_CODE_81 | Minimum value |
| INLINE_CODE_82 | Maximum value |
| INLINE_CODE_83 | Get 75th percentile |
| INLINE_CODE_84 | Group by column and sum |
| INLINE_CODE_85 | Group by and calculate mean of specific column |
| INLINE_CODE_86 | Count rows per group |
| INLINE_CODE_87 | Multiple aggregations |
| INLINE_CODE_88 | Correlation matrix of numeric columns |
| INLINE_CODE_89 | Correlation between two columns |
Erweiterte Nutzung - Komplexe Auswahl & Filterung
| Command | Description |
|---|---|
| INLINE_CODE_90 | Select columns by data type |
| INLINE_CODE_91 | Exclude columns by data type |
| INLINE_CODE_92 | Filter by string pattern |
| INLINE_CODE_93 | Filter by string prefix |
| INLINE_CODE_94 | Filter by string suffix |
| INLINE_CODE_95 | Filter by regex pattern |
| INLINE_CODE_96 | Filter using callable/lambda |
| INLINE_CODE_97 | Fast scalar value access |
| INLINE_CODE_98 | Fast scalar value access by position |
| INLINE_CODE_99 | Cross-section selection in MultiIndex |
| INLINE_CODE_100 | Filter columns by name pattern |
| INLINE_CODE_101 | Filter columns by regex |
| _ | |
| Erweiterte Nutzung - Datentransformation |
| Command | Description |
|---|---|
| INLINE_CODE_102 | Apply function to each element |
| INLINE_CODE_103 | Apply function to each column |
| INLINE_CODE_104 | Apply function element-wise (deprecated, use INLINE_CODE_105) |
| INLINE_CODE_106 | Apply function element-wise to DataFrame |
| INLINE_CODE_107 | Map values using dictionary |
| INLINE_CODE_108 | Replace values |
| INLINE_CODE_109 | Replace infinite values with NaN |
| INLINE_CODE_110 | Change data types |
| INLINE_CODE_111 | Convert to datetime |
| INLINE_CODE_112 | Convert to numeric, invalid → NaN |
| INLINE_CODE_113 | Convert strings to lowercase |
| INLINE_CODE_114 | Remove leading/trailing whitespace |
| INLINE_CODE_115 | Split strings into columns |
| INLINE_CODE_116 | Bin continuous values into intervals |
| INLINE_CODE_117 | Bin values into quantile-based intervals |
| _ | |
| Fortgeschrittene Nutzung - Verschmelzung & Joining |
| Command | Description |
|---|---|
| INLINE_CODE_118 | Inner join on key column |
| INLINE_CODE_119 | Left join |
| INLINE_CODE_120 | Right join |
| INLINE_CODE_121 | Full outer join |
| INLINE_CODE_122 | Join on different column names |
| INLINE_CODE_123 | Join on multiple keys |
| INLINE_CODE_124 | Add merge indicator column |
| INLINE_CODE_125 | Index-based join with suffixes |
| INLINE_CODE_126 | Concatenate vertically |
| INLINE_CODE_127 | Concatenate horizontally |
| INLINE_CODE_128 | Concatenate with hierarchical index |
| INLINE_CODE_129 | Merge on nearest key (time series) |
| _ | |
| Fortgeschrittene Nutzung - Umformen & Pivoting |
| Command | Description |
|---|---|
| INLINE_CODE_130 | Reshape data (pivot) |
| INLINE_CODE_131 | Create pivot table with aggregation |
| INLINE_CODE_132 | Unpivot DataFrame (wide to long) |
| INLINE_CODE_133 | Pivot columns to rows (wide to long) |
| INLINE_CODE_134 | Pivot rows to columns (long to wide) |
| INLINE_CODE_135 | Create cross-tabulation |
| INLINE_CODE_136 | Cross-tabulation with totals |
| INLINE_CODE_137 or INLINE_CODE_138 | Transpose DataFrame (swap rows/columns) |
| INLINE_CODE_139 | Transform list-like values into rows |
| INLINE_CODE_140 | One-hot encode categorical variable |
| INLINE_CODE_141 | One-hot encode multiple columns |
Erweiterte Nutzung - GroupBy Operations
| Command | Description |
|---|---|
| INLINE_CODE_142 | Multiple aggregations |
| INLINE_CODE_143 | Different aggregations per column |
| INLINE_CODE_144 | Named aggregations |
| INLINE_CODE_145 | Transform with group-wise operation |
| INLINE_CODE_146 | Filter groups by condition |
| INLINE_CODE_147 | Get first row of each group |
| INLINE_CODE_148 | Cumulative sum within groups |
| INLINE_CODE_149 | Rank within groups |
| INLINE_CODE_150 | Shift values within groups |
| INLINE_CODE_151 | Expanding window within groups |
Erweiterte Nutzung - Zeitreihe
| Command | Description |
|---|---|
| INLINE_CODE_152 | Set datetime column as index |
| INLINE_CODE_153 | Resample to daily frequency with mean |
| INLINE_CODE_154 | Resample to monthly frequency with sum |
| INLINE_CODE_155 | Resample with multiple aggregations |
| INLINE_CODE_156 | 7-day rolling average |
| INLINE_CODE_157 | 7-day rolling sum |
| INLINE_CODE_158 | Expanding window mean |
| INLINE_CODE_159 | Exponentially weighted moving average |
| INLINE_CODE_160 | Shift values down by 1 period |
| INLINE_CODE_161 | Calculate difference with previous row |
| INLINE_CODE_162 | Calculate percentage change |
| INLINE_CODE_163 | Generate date range |
| INLINE_CODE_164 | Extract year from datetime |
| INLINE_CODE_165 | Extract month from datetime |
| INLINE_CODE_166 | Extract day of week (0=Monday) |
Konfiguration
Optionen anzeigen¶
# Set maximum rows to display
pd.set_option('display.max_rows', 100)
# Set maximum columns to display
pd.set_option('display.max_columns', 50)
# Set column width
pd.set_option('display.max_colwidth', 100)
# Set float display precision
pd.set_option('display.precision', 2)
# Display all columns
pd.set_option('display.expand_frame_repr', False)
# Show full dataframe without truncation
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# Reset all options to defaults
pd.reset_option('all')
# Get current option value
pd.get_option('display.max_rows')
Leistungsoptionen¶
# Use copy-on-write mode (pandas 2.0+)
pd.options.mode.copy_on_write = True
# Set computation mode
pd.set_option('compute.use_bottleneck', True)
pd.set_option('compute.use_numexpr', True)
# Set chained assignment warning
pd.set_option('mode.chained_assignment', 'warn') # or 'raise' or None
I/O Optionen¶
# Set default CSV separator
pd.set_option('io.excel.xlsx.writer', 'xlsxwriter')
# Set clipboard encoding
pd.set_option('display.clipboard.encoding', 'utf-8')
Häufige Anwendungsfälle
Use Case 1: Datenreinigung Pipeline¶
import pandas as pd
# Read data
df = pd.read_csv('raw_data.csv')
# Remove duplicates
df = df.drop_duplicates()
# Handle missing values
df['column1'].fillna(df['column1'].mean(), inplace=True)
df.dropna(subset=['critical_column'], inplace=True)
# Fix data types
df['date'] = pd.to_datetime(df['date'])
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
# Remove outliers (values beyond 3 standard deviations)
df = df[np.abs(df['value'] - df['value'].mean()) <= (3 * df['value'].std())]
# Standardize text
df['name'] = df['name'].str.strip().str.lower()
# Save cleaned data
df.to_csv('cleaned_data.csv', index=False)
Use Case 2: Exploratory Datenanalyse¶
import pandas as pd
# Load data
df = pd.read_csv('sales_data.csv')
# Basic statistics
print(df.describe())
print(df.info())
# Check for missing values
print(df.isnull().sum())
# Analyze categorical variables
print(df['category'].value_counts())
print(df['region'].value_counts(normalize=True))
# Correlation analysis
correlation_matrix = df.corr()
print(correlation_matrix)
# Group analysis
sales_by_region = df.groupby('region').agg({
'sales': ['sum', 'mean', 'count'],
'profit': 'sum'
})
print(sales_by_region)
# Time-based analysis
df['date'] = pd.to_datetime(df['date'])
monthly_sales = df.set_index('date').resample('M')['sales'].sum()
print(monthly_sales)
Use Case 3: Datenintegration aus mehreren Quellen¶
import pandas as pd
# Read from different sources
customers = pd.read_csv('customers.csv')
orders = pd.read_excel('orders.xlsx')
products = pd.read_json('products.json')
# Merge datasets
df = pd.merge(orders, customers, on='customer_id', how='left')
df = pd.merge(df, products, on='product_id', how='left')
# Calculate derived metrics
df['total_price'] = df['quantity'] * df['unit_price']
df['profit'] = df['total_price'] - df['cost']
# Aggregate by customer
customer_summary = df.groupby('customer_id').agg({
'order_id': 'count',
'total_price': 'sum',
'profit': 'sum'
}).rename(columns={'order_id': 'order_count'})
# Export results
customer_summary.to_excel('customer_summary.xlsx')
Use Case 4: Zeitreihenanalyse¶
import pandas as pd
# Read time series data
df = pd.read_csv('stock_prices.csv', parse_dates=['date'])
df.set_index('date', inplace=True)
# Calculate returns
df['daily_return'] = df['close'].pct_change()
# Calculate moving averages
df['MA_7'] = df['close'].rolling(window=7).mean()
df['MA_30'] = df['close'].rolling(window=30).mean()
# Calculate exponential moving average
df['EMA_12'] = df['close'].ewm(span=12).mean()
# Resample to weekly data
weekly_df = df.resample('W').agg({
'open': 'first',
'high': 'max',
'low': 'min',
'close': 'last',
'volume': 'sum'
})
# Calculate volatility (rolling standard deviation)
df['volatility'] = df['daily_return'].rolling(window=30).std()
# Export analysis
df.to_csv('stock_analysis.csv')
Use Case 5: Datentransformation für maschinelles Lernen¶
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load data
df = pd.read_csv('ml_dataset.csv')
# Handle missing values
df.fillna(df.median(numeric_only=True), inplace=True)
# Encode categorical variables
df = pd.get_dummies(df, columns=['category', 'region'], drop_first=True)
# Create features
df['feature_ratio'] = df['feature1'] / (df['feature2'] + 1)
df['feature_interaction'] = df['feature1'] * df['feature2']
# Separate features and target
X = df.drop('target', axis=1)
y = df['target']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(
scaler.fit_transform(X_train),
columns=X_train.columns,
index=X_train.index
)
X_test_scaled = pd.DataFrame(
scaler.transform(X_test),
columns=X_test.columns,
index=X_test.index
)
oder Best Practices
-
**Use
inplace=False_ standardmäßig*: Vermeiden Sieinplace=True, die Unmutbarkeit zu erhalten und das Debuggen zu erleichtern. Bestimmen Sie stattdessen die Ergebnisse neuen Variablen. -
**Verarbeitende Betriebe* Verwenden Sie Pandas eingebaute vektorisierte Operationen anstelle von Schleifen. Operationen wie
df['col'] * 2sind viel schneller alsdf['col'].apply(lambda x: x * 2)_. -
**Kleinbetrieb effizient* Verwenden Sie Methodenketten mit der richtigen Formatierung für lesbarer Code:
df.query('col > 5').groupby('category').sum(). -
** Spezifizieren Sie dtypes beim Lesen von Daten*: Verwenden Sie
dtypeParameter inread_csv(), um die Speichernutzung zu reduzieren und die Leistung zu verbessern:pd.read_csv('file.csv', dtype={'col1': 'int32'}). -
Kategorischer Datentyp für kartellarme Spalten: Konvertieren Sie Stringspalten mit wenigen eindeutigen Werten in kategorisch:
df['category'] = df['category'].astype('category'), um Speicher zu speichern.
**Vermisste Werte explizit handhaben*: Überprüfen Sie immer nach fehlenden Werten bewusst, anstatt sie propagieren zu lassen: df.isnull().sum() vor der Verarbeitung.
- **Benutze Kopie() bei Bedarf*: Beim Erstellen einer Untermenge, die Sie ändern