_
_
Scikit-learn Cheatsheet¶
• Installation
| Platform | Command |
|---|---|
| pip (all platforms) | INLINE_CODE_5 |
| conda | INLINE_CODE_6 |
| Ubuntu/Debian | INLINE_CODE_7 |
| macOS (Homebrew) | INLINE_CODE_8 |
| Specific version | INLINE_CODE_9 |
| With dependencies | INLINE_CODE_10 |
| Virtual environment | INLINE_CODE_11 |
| Verify installation | INLINE_CODE_12 |
| _ | |
| oder Grundlegende Befehle - Daten laden & Splitting |
| Command | Description |
|---|---|
| INLINE_CODE_13 | Import datasets module |
| INLINE_CODE_14 | Load iris dataset |
| INLINE_CODE_15 | Extract features and labels |
| INLINE_CODE_16 | Load handwritten digits dataset |
| INLINE_CODE_17 | Load wine classification dataset |
| INLINE_CODE_18 | Load breast cancer dataset |
| INLINE_CODE_19 | Import train/test split function |
| INLINE_CODE_20 | Split data 80/20 train/test |
| INLINE_CODE_21 | Split with fixed random seed |
| INLINE_CODE_22 | Stratified split (preserves class distribution) |
| _ | |
| oder Grundlegende Befehle - Vorverarbeitung |
| Command | Description |
|---|---|
| INLINE_CODE_23 | Import standardization scaler |
| INLINE_CODE_24 | Create scaler instance |
| INLINE_CODE_25 | Compute mean and std from training data |
| INLINE_CODE_26 | Apply scaling transformation |
| INLINE_CODE_27 | Fit and transform in one step |
| INLINE_CODE_28 | Import min-max normalizer |
| INLINE_CODE_29 | Create normalizer for 0-1 range |
| INLINE_CODE_30 | Import label encoder |
| INLINE_CODE_31 | Create encoder instance |
| INLINE_CODE_32 | Encode categorical labels to integers |
| INLINE_CODE_33 | Decode integers back to original labels |
| INLINE_CODE_34 | Import one-hot encoder |
| INLINE_CODE_35 | Create dummy variables from categories |
oder Grundlegende Befehle - Klassifizierung
| Command | Description |
|---|---|
| INLINE_CODE_36 | Import logistic regression |
| INLINE_CODE_37 | Create logistic regression model |
| INLINE_CODE_38 | Train model on training data |
| INLINE_CODE_39 | Make predictions on test data |
| INLINE_CODE_40 | Get prediction probabilities |
| INLINE_CODE_41 | Import decision tree classifier |
| INLINE_CODE_42 | Create decision tree with max depth |
| INLINE_CODE_43 | Import K-nearest neighbors |
| INLINE_CODE_44 | Create KNN with 5 neighbors |
| INLINE_CODE_45 | Import Gaussian Naive Bayes |
| INLINE_CODE_46 | Create Naive Bayes classifier |
oder Grundlegende Befehle - Regression
| Command | Description |
|---|---|
| INLINE_CODE_47 | Import linear regression |
| INLINE_CODE_48 | Create linear regression model |
| INLINE_CODE_49 | Train regression model |
| INLINE_CODE_50 | Predict continuous values |
| INLINE_CODE_51 | Access model coefficients |
| INLINE_CODE_52 | Access model intercept |
| INLINE_CODE_53 | Import Ridge regression (L2) |
| INLINE_CODE_54 | Create Ridge with regularization strength |
| INLINE_CODE_55 | Import Lasso regression (L1) |
| INLINE_CODE_56 | Create Lasso with regularization |
oder Grundlegende Befehle - Modellbewertung
| Command | Description |
|---|---|
| INLINE_CODE_57 | Import accuracy metric |
| INLINE_CODE_58 | Calculate accuracy percentage |
| INLINE_CODE_59 | Import detailed classification metrics |
| INLINE_CODE_60 | Get precision, recall, F1-score |
| INLINE_CODE_61 | Import confusion matrix |
| INLINE_CODE_62 | Create confusion matrix |
| INLINE_CODE_63 | Import MSE metric |
| INLINE_CODE_64 | Calculate mean squared error |
| INLINE_CODE_65 | Calculate RMSE |
| INLINE_CODE_66 | Import R² metric |
| INLINE_CODE_67 | Calculate R² score |
| INLINE_CODE_68 | Get default score (accuracy or R²) |
oder Grundlegende Befehle - Cross-Validation
| Command | Description |
|---|---|
| INLINE_CODE_69 | Import cross-validation |
| INLINE_CODE_70 | Perform 5-fold cross-validation |
| INLINE_CODE_71 | 10-fold cross-validation |
| INLINE_CODE_72 | Get mean cross-validation score |
| INLINE_CODE_73 | Get standard deviation of scores |
| INLINE_CODE_74 | Import multi-metric CV |
| INLINE_CODE_75 | CV with multiple metrics |
| INLINE_CODE_76 | Import K-fold splitter |
| INLINE_CODE_77 | Create K-fold object |
| _ | |
| Erweiterte Nutzung - Ensemble Methoden |
| Command | Description |
|---|---|
| INLINE_CODE_78 | Import random forest |
| INLINE_CODE_79 | Create random forest with 100 trees |
| INLINE_CODE_80 | Get feature importance scores |
| INLINE_CODE_81 | Import gradient boosting |
| INLINE_CODE_82 | Create gradient boosting model |
| INLINE_CODE_83 | Import AdaBoost |
| INLINE_CODE_84 | Create AdaBoost classifier |
| INLINE_CODE_85 | Import voting ensemble |
| INLINE_CODE_86 | Combine multiple models |
| INLINE_CODE_87 | Import bagging ensemble |
| INLINE_CODE_88 | Create bagging ensemble |
| INLINE_CODE_89 | Import stacking ensemble |
| INLINE_CODE_90 | Stack models with meta-learner |
| _ | |
| Erweiterte Nutzung - Unterstützung Vector Machines |
| Command | Description |
|---|---|
| INLINE_CODE_91 | Import SVM classifier |
| INLINE_CODE_92 | Create RBF kernel SVM |
| INLINE_CODE_93 | Linear kernel SVM |
| INLINE_CODE_94 | Polynomial kernel SVM |
| INLINE_CODE_95 | Enable probability estimates |
| INLINE_CODE_96 | Import SVM regressor |
| INLINE_CODE_97 | Create SVM regressor |
| INLINE_CODE_98 | Import linear SVM (faster) |
| INLINE_CODE_99 | Linear SVM for large datasets |
| _ | |
| Erweiterte Nutzung - Clustering |
| Command | Description |
|---|---|
| INLINE_CODE_100 | Import K-means clustering |
| INLINE_CODE_101 | Create K-means with 3 clusters |
| INLINE_CODE_102 | Fit and get cluster labels |
| INLINE_CODE_103 | Get cluster centroids |
| INLINE_CODE_104 | Get within-cluster sum of squares |
| INLINE_CODE_105 | Import DBSCAN clustering |
| INLINE_CODE_106 | Density-based clustering |
| INLINE_CODE_107 | Import hierarchical clustering |
| INLINE_CODE_108 | Hierarchical clustering |
| INLINE_CODE_109 | Import mean shift clustering |
| INLINE_CODE_110 | Mean shift clustering |
| _ | |
| Fortgeschrittene Nutzung - Dimensionsreduktion |
| Command | Description |
|---|---|
| INLINE_CODE_111 | Import PCA |
| INLINE_CODE_112 | Create PCA with 2 components |
| INLINE_CODE_113 | Transform to principal components |
| INLINE_CODE_114 | Get variance explained by each component |
| INLINE_CODE_115 | Get principal component vectors |
| INLINE_CODE_116 | Keep components explaining 95% variance |
| INLINE_CODE_117 | Import t-SNE |
| INLINE_CODE_118 | t-SNE dimensionality reduction |
| INLINE_CODE_119 | Import truncated SVD |
| INLINE_CODE_120 | SVD for sparse matrices |
| INLINE_CODE_121 | Import LDA |
| INLINE_CODE_122 | Supervised dimensionality reduction |
| _ | |
| Fortgeschrittene Nutzung - Rohrleitungsbau |
| Command | Description |
|---|---|
| INLINE_CODE_123 | Import pipeline |
| INLINE_CODE_124 | Create processing pipeline |
| INLINE_CODE_125 | Fit entire pipeline |
| INLINE_CODE_126 | Predict using pipeline |
| INLINE_CODE_127 | Import pipeline maker |
| INLINE_CODE_128 | Auto-name pipeline steps |
| INLINE_CODE_129 | Access specific pipeline step |
| INLINE_CODE_130 | Import column transformer |
| INLINE_CODE_131 | Different preprocessing per column |
| INLINE_CODE_132 | Get all pipeline parameters |
| INLINE_CODE_133 | Set nested pipeline parameters |
| _ | |
| Erweiterte Nutzung - Hyperparameter Tuning |
| Command | Description |
|---|---|
| INLINE_CODE_134 | Import grid search |
| INLINE_CODE_135 | Exhaustive parameter search |
| INLINE_CODE_136 | Run grid search |
| INLINE_CODE_137 | Get best parameters found |
| INLINE_CODE_138 | Get best cross-validation score |
| INLINE_CODE_139 | Get best model |
| INLINE_CODE_140 | Get detailed CV results |
| INLINE_CODE_141 | Import randomized search |
| INLINE_CODE_142 | Random parameter sampling |
| INLINE_CODE_143 | Import distributions for random search |
| INLINE_CODE_144 | Define parameter distribution |
Erweiterte Nutzung - Feature Auswahl
| Command | Description |
|---|---|
| INLINE_CODE_145 | Import K-best selector |
| INLINE_CODE_146 | Select top 10 features |
| INLINE_CODE_147 | Import scoring functions |
| INLINE_CODE_148 | Chi-squared feature selection |
| INLINE_CODE_149 | Import recursive feature elimination |
| INLINE_CODE_150 | Recursive feature elimination |
| INLINE_CODE_151 | Import model-based selection |
| INLINE_CODE_152 | Select features by importance |
| INLINE_CODE_153 | Get boolean mask of selected features |
| INLINE_CODE_154 | Transform to selected features only |
Erweiterte Nutzung - Modell Persistence
| Command | Description |
|---|---|
| INLINE_CODE_155 | Import joblib for model saving |
| INLINE_CODE_156 | Save model to file |
| INLINE_CODE_157 | Load model from file |
| INLINE_CODE_158 | Import pickle module |
| INLINE_CODE_159 | Save with pickle |
| INLINE_CODE_160 | Load with pickle |
| INLINE_CODE_161 | Save with compression |
Konfiguration
Modellparameter Konfiguration¶
# Logistic Regression parameters
LogisticRegression(
penalty='l2', # Regularization type: 'l1', 'l2', 'elasticnet', 'none'
C=1.0, # Inverse regularization strength (smaller = stronger)
solver='lbfgs', # Algorithm: 'lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga'
max_iter=100, # Maximum iterations
random_state=42, # Random seed for reproducibility
n_jobs=-1 # Use all CPU cores
)
# Random Forest parameters
RandomForestClassifier(
n_estimators=100, # Number of trees
criterion='gini', # Split quality: 'gini' or 'entropy'
max_depth=None, # Maximum tree depth (None = unlimited)
min_samples_split=2, # Minimum samples to split node
min_samples_leaf=1, # Minimum samples in leaf
max_features='sqrt', # Features per split: 'sqrt', 'log2', int, float
bootstrap=True, # Bootstrap sampling
oob_score=False, # Out-of-bag score estimation
n_jobs=-1, # Parallel jobs
random_state=42,
class_weight='balanced' # Handle imbalanced classes
)
# Support Vector Machine parameters
SVC(
C=1.0, # Regularization parameter
kernel='rbf', # Kernel: 'linear', 'poly', 'rbf', 'sigmoid'
degree=3, # Polynomial degree (for 'poly' kernel)
gamma='scale', # Kernel coefficient: 'scale', 'auto', or float
coef0=0.0, # Independent term in kernel
probability=False, # Enable probability estimates (slower)
cache_size=200, # Kernel cache size (MB)
class_weight=None, # Class weights
max_iter=-1 # Iteration limit (-1 = no limit)
)
# Gradient Boosting parameters
GradientBoostingClassifier(
loss='log_loss', # Loss function
learning_rate=0.1, # Shrinks contribution of each tree
n_estimators=100, # Number of boosting stages
subsample=1.0, # Fraction of samples for fitting
criterion='friedman_mse', # Split quality measure
min_samples_split=2,
min_samples_leaf=1,
max_depth=3, # Maximum tree depth
max_features=None, # Features per split
validation_fraction=0.1, # Fraction for early stopping
n_iter_no_change=None, # Early stopping rounds
random_state=42
)
Pipeline Konfiguration¶
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
# Complete pipeline with preprocessing
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=0.95)),
('classifier', RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
))
])
# Pipeline with column-specific preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['age', 'income', 'score']),
('cat', OneHotEncoder(drop='first'), ['category', 'region'])
],
remainder='passthrough' # Keep other columns unchanged
)
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
Cross-Validation Konfiguration¶
from sklearn.model_selection import cross_validate
# Multi-metric cross-validation
cv_results = cross_validate(
estimator=model,
X=X,
y=y,
cv=5, # Number of folds
scoring={
'accuracy': 'accuracy',
'precision': 'precision_weighted',
'recall': 'recall_weighted',
'f1': 'f1_weighted',
'roc_auc': 'roc_auc_ovr'
},
return_train_score=True, # Include training scores
return_estimator=True, # Return fitted estimators
n_jobs=-1, # Parallel processing
verbose=1 # Progress messages
)
Grid Search Konfiguration¶
from sklearn.model_selection import GridSearchCV
# Comprehensive parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2'],
'bootstrap': [True, False]
}
grid_search = GridSearchCV(
estimator=RandomForestClassifier(),
param_grid=param_grid,
scoring='f1_weighted', # Scoring metric
cv=5, # Cross-validation folds
n_jobs=-1, # Use all cores
verbose=2, # Verbosity level
refit=True, # Refit best model on full data
return_train_score=True, # Return training scores
error_score='raise' # How to handle errors
)
Häufige Anwendungsfälle
Use Case 1: Binäre Klassifizierung mit unsymmetrischen Daten¶
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import numpy as np
# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, weights=[0.9, 0.1], random_state=42)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model with class balancing
model = RandomForestClassifier(
n_estimators=100,
class_weight='balanced', # Handle imbalance
random_state=42
)
model.fit(X_train_scaled, y_train)
# Evaluate
predictions = model.predict(X_test_scaled)
probabilities = model.predict_proba(X_test_scaled)[:, 1]
print(classification_report(y_test, predictions))
print(f"ROC-AUC Score: {roc_auc_score(y_test, probabilities):.3f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, predictions))
Use Case 2: Multi-Class Textklassifikation Pipeline¶
```python von sklearn.feature_extraction.text import TfidfVectorizer von sklearn. Pipeline Import Pipeline von sklearn.linear_model import LogisticRegression von sklearn.model_selection import cross_val_score
Mustertextdaten¶
Text = [ "Python ist toll für maschinelles Lernen", "Java wird für Unternehmensanwendungen verwendet", "JavaScript macht Webentwicklung",
... mehr Texte¶
! Labels = ['tech', 'tech', 'tech' # Kategorien
Textklassifikation Pipeline erstellen¶
Text_clf = Pipeline[[] (tfidf', TfidfVectorizer( max_features=5000, ngram_range=(1, 2), stop_words='english '