Scikit-learn Cheatsheet¶

• Installation

Platform	Command
pip (all platforms)	INLINE_CODE_5
conda	INLINE_CODE_6
Ubuntu/Debian	INLINE_CODE_7
macOS (Homebrew)	INLINE_CODE_8
Specific version	INLINE_CODE_9
With dependencies	INLINE_CODE_10
Virtual environment	INLINE_CODE_11
Verify installation	INLINE_CODE_12
_
oder Grundlegende Befehle - Daten laden & Splitting

Command	Description
INLINE_CODE_13	Import datasets module
INLINE_CODE_14	Load iris dataset
INLINE_CODE_15	Extract features and labels
INLINE_CODE_16	Load handwritten digits dataset
INLINE_CODE_17	Load wine classification dataset
INLINE_CODE_18	Load breast cancer dataset
INLINE_CODE_19	Import train/test split function
INLINE_CODE_20	Split data 80/20 train/test
INLINE_CODE_21	Split with fixed random seed
INLINE_CODE_22	Stratified split (preserves class distribution)
_
oder Grundlegende Befehle - Vorverarbeitung

Command	Description
INLINE_CODE_23	Import standardization scaler
INLINE_CODE_24	Create scaler instance
INLINE_CODE_25	Compute mean and std from training data
INLINE_CODE_26	Apply scaling transformation
INLINE_CODE_27	Fit and transform in one step
INLINE_CODE_28	Import min-max normalizer
INLINE_CODE_29	Create normalizer for 0-1 range
INLINE_CODE_30	Import label encoder
INLINE_CODE_31	Create encoder instance
INLINE_CODE_32	Encode categorical labels to integers
INLINE_CODE_33	Decode integers back to original labels
INLINE_CODE_34	Import one-hot encoder
INLINE_CODE_35	Create dummy variables from categories

oder Grundlegende Befehle - Klassifizierung

Command	Description
INLINE_CODE_36	Import logistic regression
INLINE_CODE_37	Create logistic regression model
INLINE_CODE_38	Train model on training data
INLINE_CODE_39	Make predictions on test data
INLINE_CODE_40	Get prediction probabilities
INLINE_CODE_41	Import decision tree classifier
INLINE_CODE_42	Create decision tree with max depth
INLINE_CODE_43	Import K-nearest neighbors
INLINE_CODE_44	Create KNN with 5 neighbors
INLINE_CODE_45	Import Gaussian Naive Bayes
INLINE_CODE_46	Create Naive Bayes classifier

oder Grundlegende Befehle - Regression

Command	Description
INLINE_CODE_47	Import linear regression
INLINE_CODE_48	Create linear regression model
INLINE_CODE_49	Train regression model
INLINE_CODE_50	Predict continuous values
INLINE_CODE_51	Access model coefficients
INLINE_CODE_52	Access model intercept
INLINE_CODE_53	Import Ridge regression (L2)
INLINE_CODE_54	Create Ridge with regularization strength
INLINE_CODE_55	Import Lasso regression (L1)
INLINE_CODE_56	Create Lasso with regularization

oder Grundlegende Befehle - Modellbewertung

Command	Description
INLINE_CODE_57	Import accuracy metric
INLINE_CODE_58	Calculate accuracy percentage
INLINE_CODE_59	Import detailed classification metrics
INLINE_CODE_60	Get precision, recall, F1-score
INLINE_CODE_61	Import confusion matrix
INLINE_CODE_62	Create confusion matrix
INLINE_CODE_63	Import MSE metric
INLINE_CODE_64	Calculate mean squared error
INLINE_CODE_65	Calculate RMSE
INLINE_CODE_66	Import R² metric
INLINE_CODE_67	Calculate R² score
INLINE_CODE_68	Get default score (accuracy or R²)

oder Grundlegende Befehle - Cross-Validation

Command	Description
INLINE_CODE_69	Import cross-validation
INLINE_CODE_70	Perform 5-fold cross-validation
INLINE_CODE_71	10-fold cross-validation
INLINE_CODE_72	Get mean cross-validation score
INLINE_CODE_73	Get standard deviation of scores
INLINE_CODE_74	Import multi-metric CV
INLINE_CODE_75	CV with multiple metrics
INLINE_CODE_76	Import K-fold splitter
INLINE_CODE_77	Create K-fold object
_
Erweiterte Nutzung - Ensemble Methoden

Command	Description
INLINE_CODE_78	Import random forest
INLINE_CODE_79	Create random forest with 100 trees
INLINE_CODE_80	Get feature importance scores
INLINE_CODE_81	Import gradient boosting
INLINE_CODE_82	Create gradient boosting model
INLINE_CODE_83	Import AdaBoost
INLINE_CODE_84	Create AdaBoost classifier
INLINE_CODE_85	Import voting ensemble
INLINE_CODE_86	Combine multiple models
INLINE_CODE_87	Import bagging ensemble
INLINE_CODE_88	Create bagging ensemble
INLINE_CODE_89	Import stacking ensemble
INLINE_CODE_90	Stack models with meta-learner
_
Erweiterte Nutzung - Unterstützung Vector Machines

Command	Description
INLINE_CODE_91	Import SVM classifier
INLINE_CODE_92	Create RBF kernel SVM
INLINE_CODE_93	Linear kernel SVM
INLINE_CODE_94	Polynomial kernel SVM
INLINE_CODE_95	Enable probability estimates
INLINE_CODE_96	Import SVM regressor
INLINE_CODE_97	Create SVM regressor
INLINE_CODE_98	Import linear SVM (faster)
INLINE_CODE_99	Linear SVM for large datasets
_
Erweiterte Nutzung - Clustering

Command	Description
INLINE_CODE_100	Import K-means clustering
INLINE_CODE_101	Create K-means with 3 clusters
INLINE_CODE_102	Fit and get cluster labels
INLINE_CODE_103	Get cluster centroids
INLINE_CODE_104	Get within-cluster sum of squares
INLINE_CODE_105	Import DBSCAN clustering
INLINE_CODE_106	Density-based clustering
INLINE_CODE_107	Import hierarchical clustering
INLINE_CODE_108	Hierarchical clustering
INLINE_CODE_109	Import mean shift clustering
INLINE_CODE_110	Mean shift clustering
_
Fortgeschrittene Nutzung - Dimensionsreduktion

Command	Description
INLINE_CODE_111	Import PCA
INLINE_CODE_112	Create PCA with 2 components
INLINE_CODE_113	Transform to principal components
INLINE_CODE_114	Get variance explained by each component
INLINE_CODE_115	Get principal component vectors
INLINE_CODE_116	Keep components explaining 95% variance
INLINE_CODE_117	Import t-SNE
INLINE_CODE_118	t-SNE dimensionality reduction
INLINE_CODE_119	Import truncated SVD
INLINE_CODE_120	SVD for sparse matrices
INLINE_CODE_121	Import LDA
INLINE_CODE_122	Supervised dimensionality reduction
_
Fortgeschrittene Nutzung - Rohrleitungsbau

Command	Description
INLINE_CODE_123	Import pipeline
INLINE_CODE_124	Create processing pipeline
INLINE_CODE_125	Fit entire pipeline
INLINE_CODE_126	Predict using pipeline
INLINE_CODE_127	Import pipeline maker
INLINE_CODE_128	Auto-name pipeline steps
INLINE_CODE_129	Access specific pipeline step
INLINE_CODE_130	Import column transformer
INLINE_CODE_131	Different preprocessing per column
INLINE_CODE_132	Get all pipeline parameters
INLINE_CODE_133	Set nested pipeline parameters
_
Erweiterte Nutzung - Hyperparameter Tuning

Command	Description
INLINE_CODE_134	Import grid search
INLINE_CODE_135	Exhaustive parameter search
INLINE_CODE_136	Run grid search
INLINE_CODE_137	Get best parameters found
INLINE_CODE_138	Get best cross-validation score
INLINE_CODE_139	Get best model
INLINE_CODE_140	Get detailed CV results
INLINE_CODE_141	Import randomized search
INLINE_CODE_142	Random parameter sampling
INLINE_CODE_143	Import distributions for random search
INLINE_CODE_144	Define parameter distribution

Erweiterte Nutzung - Feature Auswahl

Command	Description
INLINE_CODE_145	Import K-best selector
INLINE_CODE_146	Select top 10 features
INLINE_CODE_147	Import scoring functions
INLINE_CODE_148	Chi-squared feature selection
INLINE_CODE_149	Import recursive feature elimination
INLINE_CODE_150	Recursive feature elimination
INLINE_CODE_151	Import model-based selection
INLINE_CODE_152	Select features by importance
INLINE_CODE_153	Get boolean mask of selected features
INLINE_CODE_154	Transform to selected features only

Erweiterte Nutzung - Modell Persistence

Command	Description
INLINE_CODE_155	Import joblib for model saving
INLINE_CODE_156	Save model to file
INLINE_CODE_157	Load model from file
INLINE_CODE_158	Import pickle module
INLINE_CODE_159	Save with pickle
INLINE_CODE_160	Load with pickle
INLINE_CODE_161	Save with compression

Konfiguration

Modellparameter Konfiguration¶

# Logistic Regression parameters
LogisticRegression(
    penalty='l2',           # Regularization type: 'l1', 'l2', 'elasticnet', 'none'
    C=1.0,                  # Inverse regularization strength (smaller = stronger)
    solver='lbfgs',         # Algorithm: 'lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga'
    max_iter=100,           # Maximum iterations
    random_state=42,        # Random seed for reproducibility
    n_jobs=-1               # Use all CPU cores
)

# Random Forest parameters
RandomForestClassifier(
    n_estimators=100,       # Number of trees
    criterion='gini',       # Split quality: 'gini' or 'entropy'
    max_depth=None,         # Maximum tree depth (None = unlimited)
    min_samples_split=2,    # Minimum samples to split node
    min_samples_leaf=1,     # Minimum samples in leaf
    max_features='sqrt',    # Features per split: 'sqrt', 'log2', int, float
    bootstrap=True,         # Bootstrap sampling
    oob_score=False,        # Out-of-bag score estimation
    n_jobs=-1,              # Parallel jobs
    random_state=42,
    class_weight='balanced' # Handle imbalanced classes
)

# Support Vector Machine parameters
SVC(
    C=1.0,                  # Regularization parameter
    kernel='rbf',           # Kernel: 'linear', 'poly', 'rbf', 'sigmoid'
    degree=3,               # Polynomial degree (for 'poly' kernel)
    gamma='scale',          # Kernel coefficient: 'scale', 'auto', or float
    coef0=0.0,              # Independent term in kernel
    probability=False,      # Enable probability estimates (slower)
    cache_size=200,         # Kernel cache size (MB)
    class_weight=None,      # Class weights
    max_iter=-1             # Iteration limit (-1 = no limit)
)

# Gradient Boosting parameters
GradientBoostingClassifier(
    loss='log_loss',        # Loss function
    learning_rate=0.1,      # Shrinks contribution of each tree
    n_estimators=100,       # Number of boosting stages
    subsample=1.0,          # Fraction of samples for fitting
    criterion='friedman_mse', # Split quality measure
    min_samples_split=2,
    min_samples_leaf=1,
    max_depth=3,            # Maximum tree depth
    max_features=None,      # Features per split
    validation_fraction=0.1, # Fraction for early stopping
    n_iter_no_change=None,  # Early stopping rounds
    random_state=42
)

Pipeline Konfiguration¶

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

# Complete pipeline with preprocessing
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('classifier', RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        random_state=42
    ))
])

# Pipeline with column-specific preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income', 'score']),
        ('cat', OneHotEncoder(drop='first'), ['category', 'region'])
    ],
    remainder='passthrough'  # Keep other columns unchanged
)

full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

Cross-Validation Konfiguration¶

from sklearn.model_selection import cross_validate

# Multi-metric cross-validation
cv_results = cross_validate(
    estimator=model,
    X=X,
    y=y,
    cv=5,                           # Number of folds
    scoring={
        'accuracy': 'accuracy',
        'precision': 'precision_weighted',
        'recall': 'recall_weighted',
        'f1': 'f1_weighted',
        'roc_auc': 'roc_auc_ovr'
    },
    return_train_score=True,        # Include training scores
    return_estimator=True,          # Return fitted estimators
    n_jobs=-1,                      # Parallel processing
    verbose=1                       # Progress messages
)

Grid Search Konfiguration¶

from sklearn.model_selection import GridSearchCV

# Comprehensive parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=param_grid,
    scoring='f1_weighted',          # Scoring metric
    cv=5,                           # Cross-validation folds
    n_jobs=-1,                      # Use all cores
    verbose=2,                      # Verbosity level
    refit=True,                     # Refit best model on full data
    return_train_score=True,        # Return training scores
    error_score='raise'             # How to handle errors
)

Häufige Anwendungsfälle

Use Case 1: Binäre Klassifizierung mit unsymmetrischen Daten¶

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import numpy as np

# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                          n_redundant=5, weights=[0.9, 0.1], random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model with class balancing
model = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # Handle imbalance
    random_state=42
)
model.fit(X_train_scaled, y_train)

# Evaluate
predictions = model.predict(X_test_scaled)
probabilities = model.predict_proba(X_test_scaled)[:, 1]

print(classification_report(y_test, predictions))
print(f"ROC-AUC Score: {roc_auc_score(y_test, probabilities):.3f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, predictions))

Use Case 2: Multi-Class Textklassifikation Pipeline¶

```python von sklearn.feature_extraction.text import TfidfVectorizer von sklearn. Pipeline Import Pipeline von sklearn.linear_model import LogisticRegression von sklearn.model_selection import cross_val_score

Mustertextdaten¶

Text = [ "Python ist toll für maschinelles Lernen", "Java wird für Unternehmensanwendungen verwendet", "JavaScript macht Webentwicklung",

... mehr Texte¶

! Labels = ['tech', 'tech', 'tech' # Kategorien

Textklassifikation Pipeline erstellen¶

Text_clf = Pipeline[[] (tfidf', TfidfVectorizer( max_features=5000, ngram_range=(1, 2), stop_words='english '