Statsmodels Cheatsheet
Installation
| Platform | Command |
|---|
| Ubuntu/Debian | pip install statsmodels or sudo apt-get install python3-statsmodels |
| macOS | pip install statsmodels or conda install -c conda-forge statsmodels |
| Windows | pip install statsmodels or conda install -c conda-forge statsmodels |
| With all dependencies | pip install statsmodels[all] |
| With plotting support | pip install statsmodels[plotting] |
| From source (latest) | git clone https://github.com/statsmodels/statsmodels.git && cd statsmodels && pip install . |
| Docker | docker run -it python:3.9-slim bash -c "pip install statsmodels pandas numpy" |
| Requirements: Python 3.8+, NumPy >= 1.18, SciPy >= 1.4, Pandas >= 1.0, Patsy >= 0.5.2 | |
Basic Commands
| Command | Description |
|---|
import statsmodels.api as sm | Import main statsmodels API |
import statsmodels.formula.api as smf | Import formula API (R-style syntax) |
sm.add_constant(X) | Add intercept column to feature matrix |
sm.OLS(y, X).fit() | Fit Ordinary Least Squares regression model |
smf.ols('y ~ x1 + x2', data=df).fit() | Fit OLS using formula notation |
results.summary() | Display comprehensive model summary |
results.params | Get model coefficients/parameters |
results.pvalues | Get p-values for coefficients |
results.rsquared | Get R-squared value |
results.aic | Get Akaike Information Criterion |
results.bic | Get Bayesian Information Criterion |
results.resid | Get model residuals |
results.fittedvalues | Get fitted/predicted values |
results.predict(X_new) | Make predictions on new data |
results.conf_int() | Get confidence intervals for parameters |
smf.logit('y ~ x1 + x2', data=df).fit() | Fit logistic regression model |
sm.datasets.get_rdataset('mtcars') | Load example dataset |
results.get_prediction(X_new).summary_frame() | Get predictions with confidence intervals |
Advanced Usage - Regression Models
| Command | Description |
|---|
sm.WLS(y, X, weights=w).fit() | Weighted Least Squares regression |
sm.GLS(y, X, sigma=sigma).fit() | Generalized Least Squares regression |
sm.RLM(y, X, M=sm.robust.norms.HuberT()).fit() | Robust Linear Model (resistant to outliers) |
smf.quantreg('y ~ x', data=df).fit(q=0.5) | Quantile regression (median regression) |
smf.mixedlm('y ~ x', data=df, groups=groups).fit() | Mixed Linear Model (random effects) |
sm.PanelOLS(y, X, entity_effects=True).fit() | Panel data fixed effects model |
smf.glm('y ~ x', data=df, family=sm.families.Poisson()).fit() | Poisson regression (GLM) |
smf.glm('y ~ x', data=df, family=sm.families.Gamma()).fit() | Gamma regression (GLM) |
smf.glm('y ~ x', data=df, family=sm.families.NegativeBinomial()).fit() | Negative binomial regression |
sm.MNLogit(y, X).fit() | Multinomial logistic regression |
sm.Probit(y, X).fit() | Probit regression model |
Advanced Usage - Time Series Analysis
| Command | Description |
|---|
ARIMA(data, order=(p,d,q)).fit() | Fit ARIMA model with specified order |
SARIMAX(data, order=(p,d,q), seasonal_order=(P,D,Q,s)).fit() | Seasonal ARIMA with exogenous variables |
adfuller(timeseries) | Augmented Dickey-Fuller stationarity test |
acf(timeseries, nlags=40) | Calculate autocorrelation function |
pacf(timeseries, nlags=40) | Calculate partial autocorrelation function |
plot_acf(timeseries, lags=40) | Plot autocorrelation function |
plot_pacf(timeseries, lags=40) | Plot partial autocorrelation function |
VAR(data).fit(maxlags=5) | Fit Vector Autoregression model |
results.irf(10).plot() | Plot impulse response functions |
results.fevd(10) | Forecast error variance decomposition |
UnobservedComponents(data, level='local linear trend').fit() | Structural time series model |
ExponentialSmoothing(data, seasonal='add', seasonal_periods=12).fit() | Exponential smoothing (Holt-Winters) |
results.forecast(steps=10) | Generate forecasts for future periods |
results.plot_diagnostics() | Plot diagnostic plots for time series model |
Advanced Usage - Statistical Tests
| Command | Description |
|---|
jarque_bera(residuals) | Jarque-Bera normality test |
durbin_watson(residuals) | Durbin-Watson autocorrelation test |
het_breuschpagan(residuals, X) | Breusch-Pagan heteroscedasticity test |
acorr_ljungbox(residuals, lags=10) | Ljung-Box autocorrelation test |
omni_normtest(residuals) | Omnibus normality test |
pairwise_tukeyhsd(data, groups) | Tukey’s HSD multiple comparison test |
anova_lm(model1, model2) | ANOVA for model comparison |
results.test_causality('var1', ['var2']) | Granger causality test (VAR models) |
proportions_ztest(counts, nobs) | Z-test for proportions |
ttest_ind(sample1, sample2) | Independent samples t-test |
Advanced Usage - Survival & Nonparametric
| Command | Description |
|---|
SurvfuncRight(durations, status).plot() | Kaplan-Meier survival curve |
PHReg(durations, X, status=status).fit() | Cox Proportional Hazards model |
KDEUnivariate(data).fit() | Kernel density estimation |
KernelReg(y, X, var_type='c').fit() | Kernel regression (nonparametric) |
lowess(y, x, frac=0.1) | LOWESS smoothing |
PCA(data, ncomp=3, standardize=True) | Principal Component Analysis |
Factor(data, n_factor=2).fit() | Factor Analysis |
Configuration
# Basic formula syntax
'y ~ x1 + x2' # Multiple predictors
'y ~ x1 + x2 + x1:x2' # With interaction term
'y ~ x1 * x2' # Shorthand for x1 + x2 + x1:x2
'y ~ C(category)' # Categorical variable
'y ~ np.log(x1) + np.sqrt(x2)' # Transformations
'y ~ x1 + I(x1**2)' # Polynomial terms
Model Fitting Options
# Common fitting parameters
results = model.fit(
method='lbfgs', # Optimization method
maxiter=1000, # Maximum iterations
disp=True, # Display convergence messages
cov_type='HC3' # Robust covariance type
)
# Time series specific
results = model.fit(
start_params=None, # Initial parameter values
method='css-mle', # Estimation method
trend='c', # Trend component
solver='lbfgs', # Optimization solver
maxiter=500, # Maximum iterations
full_output=True # Return additional information
)
Display Options
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 4)
# Statsmodels summary options
results.summary(
alpha=0.05, # Significance level
title='Model Results', # Custom title
xname=['Var1', 'Var2'] # Custom variable names
)
Common Use Cases
Use Case 1: Linear Regression Analysis
import statsmodels.api as sm
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Prepare variables
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']
X = sm.add_constant(X) # Add intercept
# Fit model
model = sm.OLS(y, X)
results = model.fit()
# Display results
print(results.summary())
# Check assumptions
print(f"Jarque-Bera test: {sm.stats.jarque_bera(results.resid)}")
print(f"Durbin-Watson: {sm.stats.durbin_watson(results.resid)}")
# Make predictions
predictions = results.predict(X_new)
Use Case 2: Time Series Forecasting with ARIMA
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
# Load time series data
ts_data = pd.read_csv('timeseries.csv', index_col='date', parse_dates=True)
# Check stationarity
adf_result = adfuller(ts_data['value'])
print(f'ADF Statistic: {adf_result[0]:.4f}')
print(f'p-value: {adf_result[1]:.4f}')
# Plot ACF and PACF to determine order
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(ts_data['value'], lags=40, ax=axes[0])
plot_pacf(ts_data['value'], lags=40, ax=axes[1])
plt.show()
# Fit ARIMA model
model = ARIMA(ts_data['value'], order=(1, 1, 1))
results = model.fit()
print(results.summary())
# Forecast
forecast = results.forecast(steps=12)
print(forecast)
# Plot diagnostics
results.plot_diagnostics(figsize=(15, 10))
plt.show()
Use Case 3: Logistic Regression for Classification
import statsmodels.formula.api as smf
import pandas as pd
# Load data
df = pd.read_csv('classification_data.csv')
# Fit logistic regression
model = smf.logit('outcome ~ age + income + education', data=df)
results = model.fit()
# Display results
print(results.summary())
# Get odds ratios
odds_ratios = pd.DataFrame({
'OR': results.params.apply(lambda x: np.exp(x)),
'CI_lower': results.conf_int()[0].apply(lambda x: np.exp(x)),
'CI_upper': results.conf_int()[1].apply(lambda x: np.exp(x))
})
print(odds_ratios)
# Predict probabilities
df['predicted_prob'] = results.predict(df)
# Classification accuracy
df['predicted_class'] = (df['predicted_prob'] > 0.5).astype(int)
accuracy = (df['outcome'] == df['predicted_class']).mean()
print(f"Accuracy: {accuracy:.2%}")
Use Case 4: Panel Data Analysis
import pandas as pd
from statsmodels.regression.linear_model import PanelOLS
# Load panel data (MultiIndex: entity, time)
df = pd.read_csv('panel_data.csv')
df = df.set_index(['entity_id', 'time'])
# Prepare variables
y = df['dependent_var']
X = df[['var1', 'var2', 'var3']]
# Fixed effects model
fe_model = PanelOLS(y, X, entity_effects=True, time_effects=True)
fe_results = fe_model.fit()
print(fe_results.summary)
# Extract fixed effects
entity_effects = fe_results.estimated_effects
print(entity_effects.head())
Use Case 5: Vector Autoregression (VAR) for Multivariate Time Series
import pandas as pd
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import adfuller
# Load multivariate time series
df = pd.read_csv('multivariate_ts.csv', index_col='date', parse_dates=True)
# Check stationarity for all variables
for col in df.columns:
result = adfuller(df[col])
print(f'{col}: ADF = {result[0]:.4f}, p-value = {result[1]:.4f}')
# Fit VAR model
model = VAR(df)
results = model.fit(maxlags=5, ic='aic')
print(results.summary())
# Granger causality test
granger_results = results.test_causality('var1', ['var2', 'var3'], kind='f')
print(granger_results)
# Impulse response analysis
irf = results.irf(10)
irf.plot(orth=True)
# Forecast
forecast = results.forecast(df.values[-results.k_ar:], steps=12)
forecast_df = pd.DataFrame(forecast, columns=df.columns)
print(forecast_df)
Best Practices
- Always add constant term: Use
sm.add_constant(X) when using array-based API to include intercept in regression models
- Check model assumptions: Validate residuals for normality, homoscedasticity, and autocorrelation using diagnostic tests
- Use formula API for readability: Prefer
smf.ols('y ~ x1 + x2', data=df) over array-based API for clearer, more maintainable code
- Test for stationarity in time series: Always run
adfuller() test before fitting ARIMA models; difference data if non-stationary
- Examine ACF/PACF plots: Use
plot_acf() and plot_pacf() to determine appropriate ARIMA order parameters
- Compare models with information criteria: Use AIC/BIC for model selection; lower values indicate better fit with parsimony
- Validate out-of-sample: Split data into train/test sets and evaluate prediction accuracy on holdout data
- Handle multicollinearity: Check VIF (Variance Inflation Factor) for highly correlated predictors in regression models
- Use robust standard errors: Apply
cov_type='HC3' in .fit() for heteroscedasticity-robust inference
- Document model specifications: Keep clear records of model orders, transformations, and selection criteria for reproducibility
- Visualize diagnostics: Always run
results.plot_diagnostics() for time series models to check residual patterns
Troubleshooting
| Issue | Solution |
|---|
| LinAlgError: Singular matrix | Check for perfect multicollinearity; remove redundant variables or use sm.add_constant() only once |
| Convergence not achieved | Increase maxiter parameter, try different optimization method (method='bfgs'), or scale/standardize features |
| Perfect separation in logistic regression | Use penalized regression (method='l1'), remove problematic predictors, or collect more diverse data |
| ARIMA model won’t fit | Verify data is stationary with adfuller(), try different order parameters, or check for missing values |
| ValueError: endog and exog matrices are different sizes | Ensure X and y have same number of observations; check for missing values and align indices |
| Non-stationary time series warnings | Difference the series (df.diff().dropna()), set enforce_stationarity=False, or transform data (log, Box-Cox) |
| Memory error with large datasets | Use chunking, reduce lag order in VAR/ARIMA, or consider statsmodels.tsa.statespace for state space models |
| Negative R-squared values | Model is worse than mean baseline; check model specification, add relevant features, or try different model type |
| Heteroscedasticity detected | Use WLS with appropriate weights, apply robust standard errors (cov_type='HC3'), or transform dependent variable |
| High VIF values (>10) | Remove or combine correlated predictors, use PCA for dimensionality reduction, or apply ridge regression |
| Residuals show patterns | Add polynomial terms, interaction effects, or use nonparametric methods like kernel regression |