Zum Inhalt

_

_ _

_

Scikit-learn Cheatsheet

• Installation

Platform Command
pip (all platforms) INLINE_CODE_5
conda INLINE_CODE_6
Ubuntu/Debian INLINE_CODE_7
macOS (Homebrew) INLINE_CODE_8
Specific version INLINE_CODE_9
With dependencies INLINE_CODE_10
Virtual environment INLINE_CODE_11
Verify installation INLINE_CODE_12
_
oder Grundlegende Befehle - Daten laden & Splitting
Command Description
INLINE_CODE_13 Import datasets module
INLINE_CODE_14 Load iris dataset
INLINE_CODE_15 Extract features and labels
INLINE_CODE_16 Load handwritten digits dataset
INLINE_CODE_17 Load wine classification dataset
INLINE_CODE_18 Load breast cancer dataset
INLINE_CODE_19 Import train/test split function
INLINE_CODE_20 Split data 80/20 train/test
INLINE_CODE_21 Split with fixed random seed
INLINE_CODE_22 Stratified split (preserves class distribution)
_
oder Grundlegende Befehle - Vorverarbeitung
Command Description
INLINE_CODE_23 Import standardization scaler
INLINE_CODE_24 Create scaler instance
INLINE_CODE_25 Compute mean and std from training data
INLINE_CODE_26 Apply scaling transformation
INLINE_CODE_27 Fit and transform in one step
INLINE_CODE_28 Import min-max normalizer
INLINE_CODE_29 Create normalizer for 0-1 range
INLINE_CODE_30 Import label encoder
INLINE_CODE_31 Create encoder instance
INLINE_CODE_32 Encode categorical labels to integers
INLINE_CODE_33 Decode integers back to original labels
INLINE_CODE_34 Import one-hot encoder
INLINE_CODE_35 Create dummy variables from categories

oder Grundlegende Befehle - Klassifizierung

Command Description
INLINE_CODE_36 Import logistic regression
INLINE_CODE_37 Create logistic regression model
INLINE_CODE_38 Train model on training data
INLINE_CODE_39 Make predictions on test data
INLINE_CODE_40 Get prediction probabilities
INLINE_CODE_41 Import decision tree classifier
INLINE_CODE_42 Create decision tree with max depth
INLINE_CODE_43 Import K-nearest neighbors
INLINE_CODE_44 Create KNN with 5 neighbors
INLINE_CODE_45 Import Gaussian Naive Bayes
INLINE_CODE_46 Create Naive Bayes classifier

oder Grundlegende Befehle - Regression

Command Description
INLINE_CODE_47 Import linear regression
INLINE_CODE_48 Create linear regression model
INLINE_CODE_49 Train regression model
INLINE_CODE_50 Predict continuous values
INLINE_CODE_51 Access model coefficients
INLINE_CODE_52 Access model intercept
INLINE_CODE_53 Import Ridge regression (L2)
INLINE_CODE_54 Create Ridge with regularization strength
INLINE_CODE_55 Import Lasso regression (L1)
INLINE_CODE_56 Create Lasso with regularization

oder Grundlegende Befehle - Modellbewertung

Command Description
INLINE_CODE_57 Import accuracy metric
INLINE_CODE_58 Calculate accuracy percentage
INLINE_CODE_59 Import detailed classification metrics
INLINE_CODE_60 Get precision, recall, F1-score
INLINE_CODE_61 Import confusion matrix
INLINE_CODE_62 Create confusion matrix
INLINE_CODE_63 Import MSE metric
INLINE_CODE_64 Calculate mean squared error
INLINE_CODE_65 Calculate RMSE
INLINE_CODE_66 Import R² metric
INLINE_CODE_67 Calculate R² score
INLINE_CODE_68 Get default score (accuracy or R²)

oder Grundlegende Befehle - Cross-Validation

Command Description
INLINE_CODE_69 Import cross-validation
INLINE_CODE_70 Perform 5-fold cross-validation
INLINE_CODE_71 10-fold cross-validation
INLINE_CODE_72 Get mean cross-validation score
INLINE_CODE_73 Get standard deviation of scores
INLINE_CODE_74 Import multi-metric CV
INLINE_CODE_75 CV with multiple metrics
INLINE_CODE_76 Import K-fold splitter
INLINE_CODE_77 Create K-fold object
_
Erweiterte Nutzung - Ensemble Methoden
Command Description
INLINE_CODE_78 Import random forest
INLINE_CODE_79 Create random forest with 100 trees
INLINE_CODE_80 Get feature importance scores
INLINE_CODE_81 Import gradient boosting
INLINE_CODE_82 Create gradient boosting model
INLINE_CODE_83 Import AdaBoost
INLINE_CODE_84 Create AdaBoost classifier
INLINE_CODE_85 Import voting ensemble
INLINE_CODE_86 Combine multiple models
INLINE_CODE_87 Import bagging ensemble
INLINE_CODE_88 Create bagging ensemble
INLINE_CODE_89 Import stacking ensemble
INLINE_CODE_90 Stack models with meta-learner
_
Erweiterte Nutzung - Unterstützung Vector Machines
Command Description
INLINE_CODE_91 Import SVM classifier
INLINE_CODE_92 Create RBF kernel SVM
INLINE_CODE_93 Linear kernel SVM
INLINE_CODE_94 Polynomial kernel SVM
INLINE_CODE_95 Enable probability estimates
INLINE_CODE_96 Import SVM regressor
INLINE_CODE_97 Create SVM regressor
INLINE_CODE_98 Import linear SVM (faster)
INLINE_CODE_99 Linear SVM for large datasets
_
Erweiterte Nutzung - Clustering
Command Description
INLINE_CODE_100 Import K-means clustering
INLINE_CODE_101 Create K-means with 3 clusters
INLINE_CODE_102 Fit and get cluster labels
INLINE_CODE_103 Get cluster centroids
INLINE_CODE_104 Get within-cluster sum of squares
INLINE_CODE_105 Import DBSCAN clustering
INLINE_CODE_106 Density-based clustering
INLINE_CODE_107 Import hierarchical clustering
INLINE_CODE_108 Hierarchical clustering
INLINE_CODE_109 Import mean shift clustering
INLINE_CODE_110 Mean shift clustering
_
Fortgeschrittene Nutzung - Dimensionsreduktion
Command Description
INLINE_CODE_111 Import PCA
INLINE_CODE_112 Create PCA with 2 components
INLINE_CODE_113 Transform to principal components
INLINE_CODE_114 Get variance explained by each component
INLINE_CODE_115 Get principal component vectors
INLINE_CODE_116 Keep components explaining 95% variance
INLINE_CODE_117 Import t-SNE
INLINE_CODE_118 t-SNE dimensionality reduction
INLINE_CODE_119 Import truncated SVD
INLINE_CODE_120 SVD for sparse matrices
INLINE_CODE_121 Import LDA
INLINE_CODE_122 Supervised dimensionality reduction
_
Fortgeschrittene Nutzung - Rohrleitungsbau
Command Description
INLINE_CODE_123 Import pipeline
INLINE_CODE_124 Create processing pipeline
INLINE_CODE_125 Fit entire pipeline
INLINE_CODE_126 Predict using pipeline
INLINE_CODE_127 Import pipeline maker
INLINE_CODE_128 Auto-name pipeline steps
INLINE_CODE_129 Access specific pipeline step
INLINE_CODE_130 Import column transformer
INLINE_CODE_131 Different preprocessing per column
INLINE_CODE_132 Get all pipeline parameters
INLINE_CODE_133 Set nested pipeline parameters
_
Erweiterte Nutzung - Hyperparameter Tuning
Command Description
INLINE_CODE_134 Import grid search
INLINE_CODE_135 Exhaustive parameter search
INLINE_CODE_136 Run grid search
INLINE_CODE_137 Get best parameters found
INLINE_CODE_138 Get best cross-validation score
INLINE_CODE_139 Get best model
INLINE_CODE_140 Get detailed CV results
INLINE_CODE_141 Import randomized search
INLINE_CODE_142 Random parameter sampling
INLINE_CODE_143 Import distributions for random search
INLINE_CODE_144 Define parameter distribution

Erweiterte Nutzung - Feature Auswahl

Command Description
INLINE_CODE_145 Import K-best selector
INLINE_CODE_146 Select top 10 features
INLINE_CODE_147 Import scoring functions
INLINE_CODE_148 Chi-squared feature selection
INLINE_CODE_149 Import recursive feature elimination
INLINE_CODE_150 Recursive feature elimination
INLINE_CODE_151 Import model-based selection
INLINE_CODE_152 Select features by importance
INLINE_CODE_153 Get boolean mask of selected features
INLINE_CODE_154 Transform to selected features only

Erweiterte Nutzung - Modell Persistence

Command Description
INLINE_CODE_155 Import joblib for model saving
INLINE_CODE_156 Save model to file
INLINE_CODE_157 Load model from file
INLINE_CODE_158 Import pickle module
INLINE_CODE_159 Save with pickle
INLINE_CODE_160 Load with pickle
INLINE_CODE_161 Save with compression

Konfiguration

Modellparameter Konfiguration

# Logistic Regression parameters
LogisticRegression(
    penalty='l2',           # Regularization type: 'l1', 'l2', 'elasticnet', 'none'
    C=1.0,                  # Inverse regularization strength (smaller = stronger)
    solver='lbfgs',         # Algorithm: 'lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga'
    max_iter=100,           # Maximum iterations
    random_state=42,        # Random seed for reproducibility
    n_jobs=-1               # Use all CPU cores
)

# Random Forest parameters
RandomForestClassifier(
    n_estimators=100,       # Number of trees
    criterion='gini',       # Split quality: 'gini' or 'entropy'
    max_depth=None,         # Maximum tree depth (None = unlimited)
    min_samples_split=2,    # Minimum samples to split node
    min_samples_leaf=1,     # Minimum samples in leaf
    max_features='sqrt',    # Features per split: 'sqrt', 'log2', int, float
    bootstrap=True,         # Bootstrap sampling
    oob_score=False,        # Out-of-bag score estimation
    n_jobs=-1,              # Parallel jobs
    random_state=42,
    class_weight='balanced' # Handle imbalanced classes
)

# Support Vector Machine parameters
SVC(
    C=1.0,                  # Regularization parameter
    kernel='rbf',           # Kernel: 'linear', 'poly', 'rbf', 'sigmoid'
    degree=3,               # Polynomial degree (for 'poly' kernel)
    gamma='scale',          # Kernel coefficient: 'scale', 'auto', or float
    coef0=0.0,              # Independent term in kernel
    probability=False,      # Enable probability estimates (slower)
    cache_size=200,         # Kernel cache size (MB)
    class_weight=None,      # Class weights
    max_iter=-1             # Iteration limit (-1 = no limit)
)

# Gradient Boosting parameters
GradientBoostingClassifier(
    loss='log_loss',        # Loss function
    learning_rate=0.1,      # Shrinks contribution of each tree
    n_estimators=100,       # Number of boosting stages
    subsample=1.0,          # Fraction of samples for fitting
    criterion='friedman_mse', # Split quality measure
    min_samples_split=2,
    min_samples_leaf=1,
    max_depth=3,            # Maximum tree depth
    max_features=None,      # Features per split
    validation_fraction=0.1, # Fraction for early stopping
    n_iter_no_change=None,  # Early stopping rounds
    random_state=42
)

Pipeline Konfiguration

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

# Complete pipeline with preprocessing
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('classifier', RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        random_state=42
    ))
])

# Pipeline with column-specific preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income', 'score']),
        ('cat', OneHotEncoder(drop='first'), ['category', 'region'])
    ],
    remainder='passthrough'  # Keep other columns unchanged
)

full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

Cross-Validation Konfiguration

from sklearn.model_selection import cross_validate

# Multi-metric cross-validation
cv_results = cross_validate(
    estimator=model,
    X=X,
    y=y,
    cv=5,                           # Number of folds
    scoring={
        'accuracy': 'accuracy',
        'precision': 'precision_weighted',
        'recall': 'recall_weighted',
        'f1': 'f1_weighted',
        'roc_auc': 'roc_auc_ovr'
    },
    return_train_score=True,        # Include training scores
    return_estimator=True,          # Return fitted estimators
    n_jobs=-1,                      # Parallel processing
    verbose=1                       # Progress messages
)

Grid Search Konfiguration

from sklearn.model_selection import GridSearchCV

# Comprehensive parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=param_grid,
    scoring='f1_weighted',          # Scoring metric
    cv=5,                           # Cross-validation folds
    n_jobs=-1,                      # Use all cores
    verbose=2,                      # Verbosity level
    refit=True,                     # Refit best model on full data
    return_train_score=True,        # Return training scores
    error_score='raise'             # How to handle errors
)

Häufige Anwendungsfälle

Use Case 1: Binäre Klassifizierung mit unsymmetrischen Daten

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import numpy as np

# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                          n_redundant=5, weights=[0.9, 0.1], random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model with class balancing
model = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # Handle imbalance
    random_state=42
)
model.fit(X_train_scaled, y_train)

# Evaluate
predictions = model.predict(X_test_scaled)
probabilities = model.predict_proba(X_test_scaled)[:, 1]

print(classification_report(y_test, predictions))
print(f"ROC-AUC Score: {roc_auc_score(y_test, probabilities):.3f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, predictions))

Use Case 2: Multi-Class Textklassifikation Pipeline

```python von sklearn.feature_extraction.text import TfidfVectorizer von sklearn. Pipeline Import Pipeline von sklearn.linear_model import LogisticRegression von sklearn.model_selection import cross_val_score

Mustertextdaten

Text = [ "Python ist toll für maschinelles Lernen", "Java wird für Unternehmensanwendungen verwendet", "JavaScript macht Webentwicklung",

... mehr Texte

! Labels = ['tech', 'tech', 'tech' # Kategorien

Textklassifikation Pipeline erstellen

Text_clf = Pipeline[[] (tfidf', TfidfVectorizer( max_features=5000, ngram_range=(1, 2), stop_words='english '