_
_
_
Statsmodels Cheatsheet¶
• Installation
| Platform | Command |
|---|---|
| Ubuntu/Debian | INLINE_CODE_8 or INLINE_CODE_9 |
| macOS | INLINE_CODE_10 or INLINE_CODE_11 |
| Windows | INLINE_CODE_12 or INLINE_CODE_13 |
| With all dependencies | INLINE_CODE_14 |
| With plotting support | INLINE_CODE_15 |
| From source (latest) | INLINE_CODE_16 |
| Docker | INLINE_CODE_17 |
**Anforderungen*: Python 3.8+, NumPy >= 1.18, SciPy >= 1.4, Pandas >= 1,0, Patsy >= 0,5.2
oder Grundlegende Befehle
| Command | Description |
|---|---|
| INLINE_CODE_18 | Import main statsmodels API |
| INLINE_CODE_19 | Import formula API (R-style syntax) |
| INLINE_CODE_20 | Add intercept column to feature matrix |
| INLINE_CODE_21 | Fit Ordinary Least Squares regression model |
| INLINE_CODE_22 | Fit OLS using formula notation |
| INLINE_CODE_23 | Display comprehensive model summary |
| INLINE_CODE_24 | Get model coefficients/parameters |
| INLINE_CODE_25 | Get p-values for coefficients |
| INLINE_CODE_26 | Get R-squared value |
| INLINE_CODE_27 | Get Akaike Information Criterion |
| INLINE_CODE_28 | Get Bayesian Information Criterion |
| INLINE_CODE_29 | Get model residuals |
| INLINE_CODE_30 | Get fitted/predicted values |
| INLINE_CODE_31 | Make predictions on new data |
| INLINE_CODE_32 | Get confidence intervals for parameters |
| INLINE_CODE_33 | Fit logistic regression model |
| INLINE_CODE_34 | Load example dataset |
| INLINE_CODE_35 | Get predictions with confidence intervals |
Erweiterte Nutzung - Regressionsmodelle
| Command | Description |
|---|---|
| INLINE_CODE_36 | Weighted Least Squares regression |
| INLINE_CODE_37 | Generalized Least Squares regression |
| INLINE_CODE_38 | Robust Linear Model (resistant to outliers) |
| INLINE_CODE_39 | Quantile regression (median regression) |
| INLINE_CODE_40 | Mixed Linear Model (random effects) |
| INLINE_CODE_41 | Panel data fixed effects model |
| INLINE_CODE_42 | Poisson regression (GLM) |
| INLINE_CODE_43 | Gamma regression (GLM) |
| INLINE_CODE_44 | Negative binomial regression |
| INLINE_CODE_45 | Multinomial logistic regression |
| INLINE_CODE_46 | Probit regression model |
| _ | |
| Erweiterte Nutzung - Zeitreihenanalyse |
| Command | Description |
|---|---|
| INLINE_CODE_47 | Fit ARIMA model with specified order |
| INLINE_CODE_48 | Seasonal ARIMA with exogenous variables |
| INLINE_CODE_49 | Augmented Dickey-Fuller stationarity test |
| INLINE_CODE_50 | Calculate autocorrelation function |
| INLINE_CODE_51 | Calculate partial autocorrelation function |
| INLINE_CODE_52 | Plot autocorrelation function |
| INLINE_CODE_53 | Plot partial autocorrelation function |
| INLINE_CODE_54 | Fit Vector Autoregression model |
| INLINE_CODE_55 | Plot impulse response functions |
| INLINE_CODE_56 | Forecast error variance decomposition |
| INLINE_CODE_57 | Structural time series model |
| INLINE_CODE_58 | Exponential smoothing (Holt-Winters) |
| INLINE_CODE_59 | Generate forecasts for future periods |
| INLINE_CODE_60 | Plot diagnostic plots for time series model |
Erweiterte Nutzung - Statistische Tests
| Command | Description |
|---|---|
| INLINE_CODE_61 | Jarque-Bera normality test |
| INLINE_CODE_62 | Durbin-Watson autocorrelation test |
| INLINE_CODE_63 | Breusch-Pagan heteroscedasticity test |
| INLINE_CODE_64 | Ljung-Box autocorrelation test |
| INLINE_CODE_65 | Omnibus normality test |
| INLINE_CODE_66 | Tukey's HSD multiple comparison test |
| INLINE_CODE_67 | ANOVA for model comparison |
| INLINE_CODE_68 | Granger causality test (VAR models) |
| INLINE_CODE_69 | Z-test for proportions |
| INLINE_CODE_70 | Independent samples t-test |
| _ | |
| Fortgeschrittene Nutzung - Überleben und nichtparametrisch |
| Command | Description |
|---|---|
| INLINE_CODE_71 | Kaplan-Meier survival curve |
| INLINE_CODE_72 | Cox Proportional Hazards model |
| INLINE_CODE_73 | Kernel density estimation |
| INLINE_CODE_74 | Kernel regression (nonparametric) |
| INLINE_CODE_75 | LOWESS smoothing |
| INLINE_CODE_76 | Principal Component Analysis |
| INLINE_CODE_77 | Factor Analysis |
Konfiguration
Modell Formel Syntax (R-Stil)¶
# Basic formula syntax
'y ~ x1 + x2' # Multiple predictors
'y ~ x1 + x2 + x1:x2' # With interaction term
'y ~ x1 * x2' # Shorthand for x1 + x2 + x1:x2
'y ~ C(category)' # Categorical variable
'y ~ np.log(x1) + np.sqrt(x2)' # Transformations
'y ~ x1 + I(x1**2)' # Polynomial terms
Modell Fitting Optionen¶
# Common fitting parameters
results = model.fit(
method='lbfgs', # Optimization method
maxiter=1000, # Maximum iterations
disp=True, # Display convergence messages
cov_type='HC3' # Robust covariance type
)
# Time series specific
results = model.fit(
start_params=None, # Initial parameter values
method='css-mle', # Estimation method
trend='c', # Trend component
solver='lbfgs', # Optimization solver
maxiter=500, # Maximum iterations
full_output=True # Return additional information
)
Optionen anzeigen¶
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 4)
# Statsmodels summary options
results.summary(
alpha=0.05, # Significance level
title='Model Results', # Custom title
xname=['Var1', 'Var2'] # Custom variable names
)
Häufige Anwendungsfälle
Use Case 1: Linear Regression Analysis¶
import statsmodels.api as sm
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Prepare variables
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']
X = sm.add_constant(X) # Add intercept
# Fit model
model = sm.OLS(y, X)
results = model.fit()
# Display results
print(results.summary())
# Check assumptions
print(f"Jarque-Bera test: {sm.stats.jarque_bera(results.resid)}")
print(f"Durbin-Watson: {sm.stats.durbin_watson(results.resid)}")
# Make predictions
predictions = results.predict(X_new)
Use Case 2: Time Series Vorhersage mit ARIMA¶
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
# Load time series data
ts_data = pd.read_csv('timeseries.csv', index_col='date', parse_dates=True)
# Check stationarity
adf_result = adfuller(ts_data['value'])
print(f'ADF Statistic: {adf_result[0]:.4f}')
print(f'p-value: {adf_result[1]:.4f}')
# Plot ACF and PACF to determine order
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(ts_data['value'], lags=40, ax=axes[0])
plot_pacf(ts_data['value'], lags=40, ax=axes[1])
plt.show()
# Fit ARIMA model
model = ARIMA(ts_data['value'], order=(1, 1, 1))
results = model.fit()
print(results.summary())
# Forecast
forecast = results.forecast(steps=12)
print(forecast)
# Plot diagnostics
results.plot_diagnostics(figsize=(15, 10))
plt.show()
Use Case 3: Logistische Regression für die Klassifizierung¶
import statsmodels.formula.api as smf
import pandas as pd
# Load data
df = pd.read_csv('classification_data.csv')
# Fit logistic regression
model = smf.logit('outcome ~ age + income + education', data=df)
results = model.fit()
# Display results
print(results.summary())
# Get odds ratios
odds_ratios = pd.DataFrame({
'OR': results.params.apply(lambda x: np.exp(x)),
'CI_lower': results.conf_int()[0].apply(lambda x: np.exp(x)),
'CI_upper': results.conf_int()[1].apply(lambda x: np.exp(x))
})
print(odds_ratios)
# Predict probabilities
df['predicted_prob'] = results.predict(df)
# Classification accuracy
df['predicted_class'] = (df['predicted_prob'] > 0.5).astype(int)
accuracy = (df['outcome'] == df['predicted_class']).mean()
print(f"Accuracy: {accuracy:.2%}")
Use Case 4: Panel Data Analysis¶
import pandas as pd
from statsmodels.regression.linear_model import PanelOLS
# Load panel data (MultiIndex: entity, time)
df = pd.read_csv('panel_data.csv')
df = df.set_index(['entity_id', 'time'])
# Prepare variables
y = df['dependent_var']
X = df[['var1', 'var2', 'var3']]
# Fixed effects model
fe_model = PanelOLS(y, X, entity_effects=True, time_effects=True)
fe_results = fe_model.fit()
print(fe_results.summary)
# Extract fixed effects
entity_effects = fe_results.estimated_effects
print(entity_effects.head())
Use Case 5: Vector Autoregression (VAR) für Multivariate Time Series¶
import pandas as pd
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import adfuller
# Load multivariate time series
df = pd.read_csv('multivariate_ts.csv', index_col='date', parse_dates=True)
# Check stationarity for all variables
for col in df.columns:
result = adfuller(df[col])
print(f'{col}: ADF = {result[0]:.4f}, p-value = {result[1]:.4f}')
# Fit VAR model
model = VAR(df)
results = model.fit(maxlags=5, ic='aic')
print(results.summary())
# Granger causality test
granger_results = results.test_causality('var1', ['var2', 'var3'], kind='f')
print(granger_results)
# Impulse response analysis
irf = results.irf(10)
irf.plot(orth=True)
# Forecast
forecast = results.forecast(df.values[-results.k_ar:], steps=12)
forecast_df = pd.DataFrame(forecast, columns=df.columns)
print(forecast_df)
oder Best Practices
- ** Immer konstanten Begriff hinzufügen*: Verwenden Sie
sm.add_constant(X), wenn Sie eine Array-basierte API verwenden, um Intercept in Regressionsmodellen einzuschließen - ** Modellannahmen prüfen*: Validierung von Resten für Normalität, Homoszemetizität und Autokorrelation mit diagnostischen Tests
- ** Verwenden Sie die Formel API für die Lesbarkeit*: Bevorzugen
smf.ols('y ~ x1 + x2', data=df)über Array-basierte API für deutlichere, beständigere Code - **Test für Stationarität in Zeitreihen*: Laufen Sie immer
adfuller()vor dem Einbau von ARIMA-Modellen; Differenzdaten, wenn nicht stationär - **Examine ACF/PACF-Plots*: Verwenden Sie
plot_acf()undplot_pacf(), um geeignete ARIMA-Bestellparameter zu ermitteln - **Kompare Modelle mit Informationskriterien*: Verwenden Sie AIC/BIC zur Modellauswahl; niedrigere Werte zeigen eine bessere Passform mit Parsimony
- **Validate out-of-sample*: Daten in Train/test-Sets teilen und Vorhersagegenauigkeit auf Holdout-Daten bewerten
- **Handle Multikollinearität*: Überprüfen Sie VIF (Variance Inflation Factor) für hochkorrelierte Vorhersagen in Regressionsmodellen
- ** Verwenden Sie robuste Standardfehler*: Anwenden
cov_type='HC3'_ in.fit()für heteroszemetasticity-robust Inference - **Dokumente Modellspezifikationen*: Klare Aufzeichnungen über Modellbestellungen, Transformationen und Auswahlkriterien für Reproduzierbarkeit
- **Visualize Diagnostik*: Laufen Sie immer
results.plot_diagnostics()für Zeitreihenmodelle, um Restmuster zu überprüfen
Fehlerbehebung
| Issue | Solution |
|---|---|
| LinAlgError: Singular matrix | Check for perfect multicollinearity; remove redundant variables or use INLINE_CODE_86 only once |
| Convergence not achieved | Increase INLINE_CODE_87 parameter, try different optimization method (INLINE_CODE_88), or scale/standardize features |
| Perfect separation in logistic regression | Use penalized regression (INLINE_CODE_89), remove problematic predictors, or collect more diverse data |
| ARIMA model won't fit | Verify data is stationary with INLINE_CODE_90, try different order parameters, or check for missing values |
| ValueError: endog and exog matrices are different sizes | Ensure X and y have same number of observations; check for missing values and align indices |
| Non-stationary time series warnings | Difference the series (INLINE_CODE_91), set INLINE_CODE_92, or transform data (log, Box-Cox) |
| Memory error with large datasets | Use chunking, reduce lag order in VAR/ARIMA, or consider INLINE_CODE_93 for state space models |
| Negative R-squared values | Model is worse than mean baseline; check model specification, add relevant features, or try different model type |
| Heteroscedasticity detected | Use WLS with appropriate weights, apply robust standard errors (INLINE_CODE_94), or transform dependent variable |
| High VIF values (>10) | Remove or combine correlated predictors, use PCA for dimensionality reduction, or apply ridge regression |
| Residuals show patterns | Add polynomial terms, interaction effects, or use nonparametric methods like kernel regression |
| ImportError für optionale Abhängigkeiten | Installieren Sie fehlende Pakete: pip install matplotlib scipy patsy oder verwenden pip install statsmodels[all] |