_

Statsmodels Cheatsheet¶

• Installation

Platform	Command
Ubuntu/Debian	INLINE_CODE_8 or INLINE_CODE_9
macOS	INLINE_CODE_10 or INLINE_CODE_11
Windows	INLINE_CODE_12 or INLINE_CODE_13
With all dependencies	INLINE_CODE_14
With plotting support	INLINE_CODE_15
From source (latest)	INLINE_CODE_16
Docker	INLINE_CODE_17

**Anforderungen*: Python 3.8+, NumPy >= 1.18, SciPy >= 1.4, Pandas >= 1,0, Patsy >= 0,5.2

oder Grundlegende Befehle

Command	Description
INLINE_CODE_18	Import main statsmodels API
INLINE_CODE_19	Import formula API (R-style syntax)
INLINE_CODE_20	Add intercept column to feature matrix
INLINE_CODE_21	Fit Ordinary Least Squares regression model
INLINE_CODE_22	Fit OLS using formula notation
INLINE_CODE_23	Display comprehensive model summary
INLINE_CODE_24	Get model coefficients/parameters
INLINE_CODE_25	Get p-values for coefficients
INLINE_CODE_26	Get R-squared value
INLINE_CODE_27	Get Akaike Information Criterion
INLINE_CODE_28	Get Bayesian Information Criterion
INLINE_CODE_29	Get model residuals
INLINE_CODE_30	Get fitted/predicted values
INLINE_CODE_31	Make predictions on new data
INLINE_CODE_32	Get confidence intervals for parameters
INLINE_CODE_33	Fit logistic regression model
INLINE_CODE_34	Load example dataset
INLINE_CODE_35	Get predictions with confidence intervals

Erweiterte Nutzung - Regressionsmodelle

Command	Description
INLINE_CODE_36	Weighted Least Squares regression
INLINE_CODE_37	Generalized Least Squares regression
INLINE_CODE_38	Robust Linear Model (resistant to outliers)
INLINE_CODE_39	Quantile regression (median regression)
INLINE_CODE_40	Mixed Linear Model (random effects)
INLINE_CODE_41	Panel data fixed effects model
INLINE_CODE_42	Poisson regression (GLM)
INLINE_CODE_43	Gamma regression (GLM)
INLINE_CODE_44	Negative binomial regression
INLINE_CODE_45	Multinomial logistic regression
INLINE_CODE_46	Probit regression model
_
Erweiterte Nutzung - Zeitreihenanalyse

Command	Description
INLINE_CODE_47	Fit ARIMA model with specified order
INLINE_CODE_48	Seasonal ARIMA with exogenous variables
INLINE_CODE_49	Augmented Dickey-Fuller stationarity test
INLINE_CODE_50	Calculate autocorrelation function
INLINE_CODE_51	Calculate partial autocorrelation function
INLINE_CODE_52	Plot autocorrelation function
INLINE_CODE_53	Plot partial autocorrelation function
INLINE_CODE_54	Fit Vector Autoregression model
INLINE_CODE_55	Plot impulse response functions
INLINE_CODE_56	Forecast error variance decomposition
INLINE_CODE_57	Structural time series model
INLINE_CODE_58	Exponential smoothing (Holt-Winters)
INLINE_CODE_59	Generate forecasts for future periods
INLINE_CODE_60	Plot diagnostic plots for time series model

Erweiterte Nutzung - Statistische Tests

Command	Description
INLINE_CODE_61	Jarque-Bera normality test
INLINE_CODE_62	Durbin-Watson autocorrelation test
INLINE_CODE_63	Breusch-Pagan heteroscedasticity test
INLINE_CODE_64	Ljung-Box autocorrelation test
INLINE_CODE_65	Omnibus normality test
INLINE_CODE_66	Tukey's HSD multiple comparison test
INLINE_CODE_67	ANOVA for model comparison
INLINE_CODE_68	Granger causality test (VAR models)
INLINE_CODE_69	Z-test for proportions
INLINE_CODE_70	Independent samples t-test
_
Fortgeschrittene Nutzung - Überleben und nichtparametrisch

Command	Description
INLINE_CODE_71	Kaplan-Meier survival curve
INLINE_CODE_72	Cox Proportional Hazards model
INLINE_CODE_73	Kernel density estimation
INLINE_CODE_74	Kernel regression (nonparametric)
INLINE_CODE_75	LOWESS smoothing
INLINE_CODE_76	Principal Component Analysis
INLINE_CODE_77	Factor Analysis

Konfiguration

Modell Formel Syntax (R-Stil)¶

# Basic formula syntax
'y ~ x1 + x2'                    # Multiple predictors
'y ~ x1 + x2 + x1:x2'           # With interaction term
'y ~ x1 * x2'                    # Shorthand for x1 + x2 + x1:x2
'y ~ C(category)'                # Categorical variable
'y ~ np.log(x1) + np.sqrt(x2)'  # Transformations
'y ~ x1 + I(x1**2)'             # Polynomial terms

Modell Fitting Optionen¶

# Common fitting parameters
results = model.fit(
    method='lbfgs',              # Optimization method
    maxiter=1000,                # Maximum iterations
    disp=True,                   # Display convergence messages
    cov_type='HC3'               # Robust covariance type
)

# Time series specific
results = model.fit(
    start_params=None,           # Initial parameter values
    method='css-mle',            # Estimation method
    trend='c',                   # Trend component
    solver='lbfgs',              # Optimization solver
    maxiter=500,                 # Maximum iterations
    full_output=True             # Return additional information
)

Optionen anzeigen¶

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 4)

# Statsmodels summary options
results.summary(
    alpha=0.05,                  # Significance level
    title='Model Results',       # Custom title
    xname=['Var1', 'Var2']      # Custom variable names
)

Häufige Anwendungsfälle

Use Case 1: Linear Regression Analysis¶

import statsmodels.api as sm
import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Prepare variables
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']
X = sm.add_constant(X)  # Add intercept

# Fit model
model = sm.OLS(y, X)
results = model.fit()

# Display results
print(results.summary())

# Check assumptions
print(f"Jarque-Bera test: {sm.stats.jarque_bera(results.resid)}")
print(f"Durbin-Watson: {sm.stats.durbin_watson(results.resid)}")

# Make predictions
predictions = results.predict(X_new)

Use Case 2: Time Series Vorhersage mit ARIMA¶

import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

# Load time series data
ts_data = pd.read_csv('timeseries.csv', index_col='date', parse_dates=True)

# Check stationarity
adf_result = adfuller(ts_data['value'])
print(f'ADF Statistic: {adf_result[0]:.4f}')
print(f'p-value: {adf_result[1]:.4f}')

# Plot ACF and PACF to determine order
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(ts_data['value'], lags=40, ax=axes[0])
plot_pacf(ts_data['value'], lags=40, ax=axes[1])
plt.show()

# Fit ARIMA model
model = ARIMA(ts_data['value'], order=(1, 1, 1))
results = model.fit()
print(results.summary())

# Forecast
forecast = results.forecast(steps=12)
print(forecast)

# Plot diagnostics
results.plot_diagnostics(figsize=(15, 10))
plt.show()

Use Case 3: Logistische Regression für die Klassifizierung¶

import statsmodels.formula.api as smf
import pandas as pd

# Load data
df = pd.read_csv('classification_data.csv')

# Fit logistic regression
model = smf.logit('outcome ~ age + income + education', data=df)
results = model.fit()

# Display results
print(results.summary())

# Get odds ratios
odds_ratios = pd.DataFrame({
    'OR': results.params.apply(lambda x: np.exp(x)),
    'CI_lower': results.conf_int()[0].apply(lambda x: np.exp(x)),
    'CI_upper': results.conf_int()[1].apply(lambda x: np.exp(x))
})
print(odds_ratios)

# Predict probabilities
df['predicted_prob'] = results.predict(df)

# Classification accuracy
df['predicted_class'] = (df['predicted_prob'] > 0.5).astype(int)
accuracy = (df['outcome'] == df['predicted_class']).mean()
print(f"Accuracy: {accuracy:.2%}")

Use Case 4: Panel Data Analysis¶

import pandas as pd
from statsmodels.regression.linear_model import PanelOLS

# Load panel data (MultiIndex: entity, time)
df = pd.read_csv('panel_data.csv')
df = df.set_index(['entity_id', 'time'])

# Prepare variables
y = df['dependent_var']
X = df[['var1', 'var2', 'var3']]

# Fixed effects model
fe_model = PanelOLS(y, X, entity_effects=True, time_effects=True)
fe_results = fe_model.fit()
print(fe_results.summary)

# Extract fixed effects
entity_effects = fe_results.estimated_effects
print(entity_effects.head())

Use Case 5: Vector Autoregression (VAR) für Multivariate Time Series¶

import pandas as pd
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import adfuller

# Load multivariate time series
df = pd.read_csv('multivariate_ts.csv', index_col='date', parse_dates=True)

# Check stationarity for all variables
for col in df.columns:
    result = adfuller(df[col])
    print(f'{col}: ADF = {result[0]:.4f}, p-value = {result[1]:.4f}')

# Fit VAR model
model = VAR(df)
results = model.fit(maxlags=5, ic='aic')
print(results.summary())

# Granger causality test
granger_results = results.test_causality('var1', ['var2', 'var3'], kind='f')
print(granger_results)

# Impulse response analysis
irf = results.irf(10)
irf.plot(orth=True)

# Forecast
forecast = results.forecast(df.values[-results.k_ar:], steps=12)
forecast_df = pd.DataFrame(forecast, columns=df.columns)
print(forecast_df)

oder Best Practices

** Immer konstanten Begriff hinzufügen*: Verwenden Sie sm.add_constant(X), wenn Sie eine Array-basierte API verwenden, um Intercept in Regressionsmodellen einzuschließen
** Modellannahmen prüfen*: Validierung von Resten für Normalität, Homoszemetizität und Autokorrelation mit diagnostischen Tests
** Verwenden Sie die Formel API für die Lesbarkeit*: Bevorzugen smf.ols('y ~ x1 + x2', data=df) über Array-basierte API für deutlichere, beständigere Code
**Test für Stationarität in Zeitreihen*: Laufen Sie immer adfuller() vor dem Einbau von ARIMA-Modellen; Differenzdaten, wenn nicht stationär
**Examine ACF/PACF-Plots*: Verwenden Sie plot_acf() und plot_pacf(), um geeignete ARIMA-Bestellparameter zu ermitteln
**Kompare Modelle mit Informationskriterien*: Verwenden Sie AIC/BIC zur Modellauswahl; niedrigere Werte zeigen eine bessere Passform mit Parsimony
**Validate out-of-sample*: Daten in Train/test-Sets teilen und Vorhersagegenauigkeit auf Holdout-Daten bewerten
**Handle Multikollinearität*: Überprüfen Sie VIF (Variance Inflation Factor) für hochkorrelierte Vorhersagen in Regressionsmodellen
** Verwenden Sie robuste Standardfehler*: Anwenden cov_type='HC3'_ in .fit() für heteroszemetasticity-robust Inference
**Dokumente Modellspezifikationen*: Klare Aufzeichnungen über Modellbestellungen, Transformationen und Auswahlkriterien für Reproduzierbarkeit
**Visualize Diagnostik*: Laufen Sie immer results.plot_diagnostics() für Zeitreihenmodelle, um Restmuster zu überprüfen

Fehlerbehebung

Issue	Solution
LinAlgError: Singular matrix	Check for perfect multicollinearity; remove redundant variables or use INLINE_CODE_86 only once
Convergence not achieved	Increase INLINE_CODE_87 parameter, try different optimization method (INLINE_CODE_88), or scale/standardize features
Perfect separation in logistic regression	Use penalized regression (INLINE_CODE_89), remove problematic predictors, or collect more diverse data
ARIMA model won't fit	Verify data is stationary with INLINE_CODE_90, try different order parameters, or check for missing values
ValueError: endog and exog matrices are different sizes	Ensure X and y have same number of observations; check for missing values and align indices
Non-stationary time series warnings	Difference the series (INLINE_CODE_91), set INLINE_CODE_92, or transform data (log, Box-Cox)
Memory error with large datasets	Use chunking, reduce lag order in VAR/ARIMA, or consider INLINE_CODE_93 for state space models
Negative R-squared values	Model is worse than mean baseline; check model specification, add relevant features, or try different model type
Heteroscedasticity detected	Use WLS with appropriate weights, apply robust standard errors (INLINE_CODE_94), or transform dependent variable
High VIF values (>10)	Remove or combine correlated predictors, use PCA for dimensionality reduction, or apply ridge regression
Residuals show patterns	Add polynomial terms, interaction effects, or use nonparametric methods like kernel regression
ImportError für optionale Abhängigkeiten	Installieren Sie fehlende Pakete: `pip install matplotlib scipy patsy` oder verwenden `pip install statsmodels[all]`