コンテンツにスキップ

Scikit-learn Cheatsheet

Scikit-learn Cheatsheet

Installation

PlatformCommand
pip (all platforms)pip install scikit-learn
condaconda install scikit-learn
Ubuntu/Debiansudo apt-get install python3-sklearn
macOS (Homebrew)brew install python && pip3 install scikit-learn
Specific versionpip install scikit-learn==1.3.0
With dependenciespip install scikit-learn numpy scipy matplotlib pandas
Virtual environmentpython -m venv env && source env/bin/activate && pip install scikit-learn
Verify installationpython -c "import sklearn; print(sklearn.__version__)"

Basic Commands - Data Loading & Splitting

CommandDescription
from sklearn import datasetsImport datasets module
iris = datasets.load_iris()Load iris dataset
X, y = iris.data, iris.targetExtract features and labels
datasets.load_digits()Load handwritten digits dataset
datasets.load_wine()Load wine classification dataset
datasets.load_breast_cancer()Load breast cancer dataset
from sklearn.model_selection import train_test_splitImport train/test split function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)Split data 80/20 train/test
train_test_split(X, y, test_size=0.3, random_state=42)Split with fixed random seed
train_test_split(X, y, stratify=y)Stratified split (preserves class distribution)

Basic Commands - Preprocessing

CommandDescription
from sklearn.preprocessing import StandardScalerImport standardization scaler
scaler = StandardScaler()Create scaler instance
scaler.fit(X_train)Compute mean and std from training data
X_scaled = scaler.transform(X_train)Apply scaling transformation
X_scaled = scaler.fit_transform(X_train)Fit and transform in one step
from sklearn.preprocessing import MinMaxScalerImport min-max normalizer
normalizer = MinMaxScaler(feature_range=(0, 1))Create normalizer for 0-1 range
from sklearn.preprocessing import LabelEncoderImport label encoder
encoder = LabelEncoder()Create encoder instance
y_encoded = encoder.fit_transform(y)Encode categorical labels to integers
encoder.inverse_transform(y_encoded)Decode integers back to original labels
from sklearn.preprocessing import OneHotEncoderImport one-hot encoder
OneHotEncoder(sparse=False).fit_transform(X)Create dummy variables from categories

Basic Commands - Classification

CommandDescription
from sklearn.linear_model import LogisticRegressionImport logistic regression
model = LogisticRegression()Create logistic regression model
model.fit(X_train, y_train)Train model on training data
predictions = model.predict(X_test)Make predictions on test data
probabilities = model.predict_proba(X_test)Get prediction probabilities
from sklearn.tree import DecisionTreeClassifierImport decision tree classifier
tree = DecisionTreeClassifier(max_depth=5)Create decision tree with max depth
from sklearn.neighbors import KNeighborsClassifierImport K-nearest neighbors
knn = KNeighborsClassifier(n_neighbors=5)Create KNN with 5 neighbors
from sklearn.naive_bayes import GaussianNBImport Gaussian Naive Bayes
nb = GaussianNB()Create Naive Bayes classifier

Basic Commands - Regression

CommandDescription
from sklearn.linear_model import LinearRegressionImport linear regression
model = LinearRegression()Create linear regression model
model.fit(X_train, y_train)Train regression model
predictions = model.predict(X_test)Predict continuous values
model.coef_Access model coefficients
model.intercept_Access model intercept
from sklearn.linear_model import RidgeImport Ridge regression (L2)
ridge = Ridge(alpha=1.0)Create Ridge with regularization strength
from sklearn.linear_model import LassoImport Lasso regression (L1)
lasso = Lasso(alpha=0.1)Create Lasso with regularization

Basic Commands - Model Evaluation

CommandDescription
from sklearn.metrics import accuracy_scoreImport accuracy metric
accuracy_score(y_test, predictions)Calculate accuracy percentage
from sklearn.metrics import classification_reportImport detailed classification metrics
classification_report(y_test, predictions)Get precision, recall, F1-score
from sklearn.metrics import confusion_matrixImport confusion matrix
confusion_matrix(y_test, predictions)Create confusion matrix
from sklearn.metrics import mean_squared_errorImport MSE metric
mean_squared_error(y_test, predictions)Calculate mean squared error
mean_squared_error(y_test, predictions, squared=False)Calculate RMSE
from sklearn.metrics import r2_scoreImport R² metric
r2_score(y_test, predictions)Calculate R² score
model.score(X_test, y_test)Get default score (accuracy or R²)

Basic Commands - Cross-Validation

CommandDescription
from sklearn.model_selection import cross_val_scoreImport cross-validation
cross_val_score(model, X, y, cv=5)Perform 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=10)10-fold cross-validation
scores.mean()Get mean cross-validation score
scores.std()Get standard deviation of scores
from sklearn.model_selection import cross_validateImport multi-metric CV
cross_validate(model, X, y, cv=5, scoring=['accuracy', 'precision'])CV with multiple metrics
from sklearn.model_selection import KFoldImport K-fold splitter
kfold = KFold(n_splits=5, shuffle=True, random_state=42)Create K-fold object

Advanced Usage - Ensemble Methods

CommandDescription
from sklearn.ensemble import RandomForestClassifierImport random forest
rf = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1)Create random forest with 100 trees
rf.feature_importances_Get feature importance scores
from sklearn.ensemble import GradientBoostingClassifierImport gradient boosting
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)Create gradient boosting model
from sklearn.ensemble import AdaBoostClassifierImport AdaBoost
ada = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)Create AdaBoost classifier
from sklearn.ensemble import VotingClassifierImport voting ensemble
VotingClassifier(estimators=[('lr', model1), ('rf', model2)], voting='soft')Combine multiple models
from sklearn.ensemble import BaggingClassifierImport bagging ensemble
BaggingClassifier(base_estimator=tree, n_estimators=10)Create bagging ensemble
from sklearn.ensemble import StackingClassifierImport stacking ensemble
StackingClassifier(estimators=[...], final_estimator=LogisticRegression())Stack models with meta-learner

Advanced Usage - Support Vector Machines

CommandDescription
from sklearn.svm import SVCImport SVM classifier
svm = SVC(kernel='rbf', C=1.0, gamma='scale')Create RBF kernel SVM
SVC(kernel='linear')Linear kernel SVM
SVC(kernel='poly', degree=3)Polynomial kernel SVM
SVC(probability=True)Enable probability estimates
from sklearn.svm import SVRImport SVM regressor
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)Create SVM regressor
from sklearn.svm import LinearSVCImport linear SVM (faster)
LinearSVC(C=1.0, max_iter=1000)Linear SVM for large datasets

Advanced Usage - Clustering

CommandDescription
from sklearn.cluster import KMeansImport K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)Create K-means with 3 clusters
clusters = kmeans.fit_predict(X)Fit and get cluster labels
kmeans.cluster_centers_Get cluster centroids
kmeans.inertia_Get within-cluster sum of squares
from sklearn.cluster import DBSCANImport DBSCAN clustering
DBSCAN(eps=0.5, min_samples=5).fit_predict(X)Density-based clustering
from sklearn.cluster import AgglomerativeClusteringImport hierarchical clustering
AgglomerativeClustering(n_clusters=3, linkage='ward')Hierarchical clustering
from sklearn.cluster import MeanShiftImport mean shift clustering
MeanShift(bandwidth=2.0).fit_predict(X)Mean shift clustering

Advanced Usage - Dimensionality Reduction

CommandDescription
from sklearn.decomposition import PCAImport PCA
pca = PCA(n_components=2)Create PCA with 2 components
X_pca = pca.fit_transform(X)Transform to principal components
pca.explained_variance_ratio_Get variance explained by each component
pca.components_Get principal component vectors
PCA(n_components=0.95)Keep components explaining 95% variance
from sklearn.manifold import TSNEImport t-SNE
TSNE(n_components=2, perplexity=30).fit_transform(X)t-SNE dimensionality reduction
from sklearn.decomposition import TruncatedSVDImport truncated SVD
TruncatedSVD(n_components=100).fit_transform(X)SVD for sparse matrices
from sklearn.discriminant_analysis import LinearDiscriminantAnalysisImport LDA
LDA(n_components=2).fit_transform(X, y)Supervised dimensionality reduction

Advanced Usage - Pipeline Construction

CommandDescription
from sklearn.pipeline import PipelineImport pipeline
Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())])Create processing pipeline
pipeline.fit(X_train, y_train)Fit entire pipeline
pipeline.predict(X_test)Predict using pipeline
from sklearn.pipeline import make_pipelineImport pipeline maker
make_pipeline(StandardScaler(), PCA(10), LogisticRegression())Auto-name pipeline steps
pipeline.named_steps['scaler']Access specific pipeline step
from sklearn.compose import ColumnTransformerImport column transformer
ColumnTransformer([('num', StandardScaler(), [0,1]), ('cat', OneHotEncoder(), [2])])Different preprocessing per column
pipeline.get_params()Get all pipeline parameters
pipeline.set_params(clf__C=0.1)Set nested pipeline parameters

Advanced Usage - Hyperparameter Tuning

CommandDescription
from sklearn.model_selection import GridSearchCVImport grid search
GridSearchCV(model, param_grid, cv=5, n_jobs=-1)Exhaustive parameter search
grid_search.fit(X_train, y_train)Run grid search
grid_search.best_params_Get best parameters found
grid_search.best_score_Get best cross-validation score
grid_search.best_estimator_Get best model
grid_search.cv_results_Get detailed CV results
from sklearn.model_selection import RandomizedSearchCVImport randomized search
RandomizedSearchCV(model, param_distributions, n_iter=100, cv=5)Random parameter sampling
from scipy.stats import randint, uniformImport distributions for random search
param_distributions = {'n_estimators': randint(50, 200)}Define parameter distribution

Advanced Usage - Feature Selection

CommandDescription
from sklearn.feature_selection import SelectKBestImport K-best selector
SelectKBest(k=10).fit_transform(X, y)Select top 10 features
from sklearn.feature_selection import chi2, f_classifImport scoring functions
SelectKBest(score_func=chi2, k=5)Chi-squared feature selection
from sklearn.feature_selection import RFEImport recursive feature elimination
RFE(estimator=model, n_features_to_select=10).fit(X, y)Recursive feature elimination
from sklearn.feature_selection import SelectFromModelImport model-based selection
SelectFromModel(RandomForestClassifier()).fit(X, y)Select features by importance
selector.get_support()Get boolean mask of selected features
selector.transform(X)Transform to selected features only

Advanced Usage - Model Persistence

CommandDescription
import joblibImport joblib for model saving
joblib.dump(model, 'model.pkl')Save model to file
model = joblib.load('model.pkl')Load model from file
import pickleImport pickle module
pickle.dump(model, open('model.pkl', 'wb'))Save with pickle
model = pickle.load(open('model.pkl', 'rb'))Load with pickle
joblib.dump(model, 'model.pkl', compress=3)Save with compression

Configuration

Model Parameters Configuration

# Logistic Regression parameters
LogisticRegression(
    penalty='l2',           # Regularization type: 'l1', 'l2', 'elasticnet', 'none'
    C=1.0,                  # Inverse regularization strength (smaller = stronger)
    solver='lbfgs',         # Algorithm: 'lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga'
    max_iter=100,           # Maximum iterations
    random_state=42,        # Random seed for reproducibility
    n_jobs=-1               # Use all CPU cores
)

# Random Forest parameters
RandomForestClassifier(
    n_estimators=100,       # Number of trees
    criterion='gini',       # Split quality: 'gini' or 'entropy'
    max_depth=None,         # Maximum tree depth (None = unlimited)
    min_samples_split=2,    # Minimum samples to split node
    min_samples_leaf=1,     # Minimum samples in leaf
    max_features='sqrt',    # Features per split: 'sqrt', 'log2', int, float
    bootstrap=True,         # Bootstrap sampling
    oob_score=False,        # Out-of-bag score estimation
    n_jobs=-1,              # Parallel jobs
    random_state=42,
    class_weight='balanced' # Handle imbalanced classes
)

# Support Vector Machine parameters
SVC(
    C=1.0,                  # Regularization parameter
    kernel='rbf',           # Kernel: 'linear', 'poly', 'rbf', 'sigmoid'
    degree=3,               # Polynomial degree (for 'poly' kernel)
    gamma='scale',          # Kernel coefficient: 'scale', 'auto', or float
    coef0=0.0,              # Independent term in kernel
    probability=False,      # Enable probability estimates (slower)
    cache_size=200,         # Kernel cache size (MB)
    class_weight=None,      # Class weights
    max_iter=-1             # Iteration limit (-1 = no limit)
)

# Gradient Boosting parameters
GradientBoostingClassifier(
    loss='log_loss',        # Loss function
    learning_rate=0.1,      # Shrinks contribution of each tree
    n_estimators=100,       # Number of boosting stages
    subsample=1.0,          # Fraction of samples for fitting
    criterion='friedman_mse', # Split quality measure
    min_samples_split=2,
    min_samples_leaf=1,
    max_depth=3,            # Maximum tree depth
    max_features=None,      # Features per split
    validation_fraction=0.1, # Fraction for early stopping
    n_iter_no_change=None,  # Early stopping rounds
    random_state=42
)

Pipeline Configuration

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

# Complete pipeline with preprocessing
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('classifier', RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        random_state=42
    ))
])

# Pipeline with column-specific preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income', 'score']),
        ('cat', OneHotEncoder(drop='first'), ['category', 'region'])
    ],
    remainder='passthrough'  # Keep other columns unchanged
)

full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

Cross-Validation Configuration

from sklearn.model_selection import cross_validate

# Multi-metric cross-validation
cv_results = cross_validate(
    estimator=model,
    X=X,
    y=y,
    cv=5,                           # Number of folds
    scoring={
        'accuracy': 'accuracy',
        'precision': 'precision_weighted',
        'recall': 'recall_weighted',
        'f1': 'f1_weighted',
        'roc_auc': 'roc_auc_ovr'
    },
    return_train_score=True,        # Include training scores
    return_estimator=True,          # Return fitted estimators
    n_jobs=-1,                      # Parallel processing
    verbose=1                       # Progress messages
)

Grid Search Configuration

from sklearn.model_selection import GridSearchCV

# Comprehensive parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=param_grid,
    scoring='f1_weighted',          # Scoring metric
    cv=5,                           # Cross-validation folds
    n_jobs=-1,                      # Use all cores
    verbose=2,                      # Verbosity level
    refit=True,                     # Refit best model on full data
    return_train_score=True,        # Return training scores
    error_score='raise'             # How to handle errors
)

Common Use Cases

Use Case 1: Binary Classification with Imbalanced Data

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import numpy as np

# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                          n_redundant=5, weights=[0.9, 0.1], random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model with class balancing
model = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # Handle imbalance
    random_state=42
)
model.fit(X_train_scaled, y_train)

# Evaluate
predictions = model.predict(X_test_scaled)
probabilities = model.predict_proba(X_test_scaled)[:, 1]

print(classification_report(y_test, predictions))
print(f"ROC-AUC Score: {roc_auc_score(y_test, probabilities):.3f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, predictions))

Use Case 2: Multi-Class Text Classification Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Sample text data
texts = [
    "Python is great for machine learning",
    "Java is used for enterprise applications",
    "JavaScript powers web development",
    # ... more texts
]
labels = ['tech', 'tech', 'tech']  # Categories

# Create text classification pipeline
text_clf = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=5000,
        ngram_range=(1, 2),
        stop_words='english'