Statsmodels Cheatsheet¶

Installation¶

Platform	Command
Ubuntu/Debian	`pip install statsmodels` or `sudo apt-get install python3-statsmodels`
macOS	`pip install statsmodels` or `conda install -c conda-forge statsmodels`
Windows	`pip install statsmodels` or `conda install -c conda-forge statsmodels`
With all dependencies	`pip install statsmodels[all]`
With plotting support	`pip install statsmodels[plotting]`
From source (latest)	`git clone https://github.com/statsmodels/statsmodels.git && cd statsmodels && pip install .`
Docker	`docker run -it python:3.9-slim bash -c "pip install statsmodels pandas numpy"`

Requirements: Python 3.8+, NumPy >= 1.18, SciPy >= 1.4, Pandas >= 1.0, Patsy >= 0.5.2

Basic Commands¶

Command	Description
`import statsmodels.api as sm`	Import main statsmodels API
`import statsmodels.formula.api as smf`	Import formula API (R-style syntax)
`sm.add_constant(X)`	Add intercept column to feature matrix
`sm.OLS(y, X).fit()`	Fit Ordinary Least Squares regression model
`smf.ols('y ~ x1 + x2', data=df).fit()`	Fit OLS using formula notation
`results.summary()`	Display comprehensive model summary
`results.params`	Get model coefficients/parameters
`results.pvalues`	Get p-values for coefficients
`results.rsquared`	Get R-squared value
`results.aic`	Get Akaike Information Criterion
`results.bic`	Get Bayesian Information Criterion
`results.resid`	Get model residuals
`results.fittedvalues`	Get fitted/predicted values
`results.predict(X_new)`	Make predictions on new data
`results.conf_int()`	Get confidence intervals for parameters
`smf.logit('y ~ x1 + x2', data=df).fit()`	Fit logistic regression model
`sm.datasets.get_rdataset('mtcars')`	Load example dataset
`results.get_prediction(X_new).summary_frame()`	Get predictions with confidence intervals

Advanced Usage - Regression Models¶

Command	Description
`sm.WLS(y, X, weights=w).fit()`	Weighted Least Squares regression
`sm.GLS(y, X, sigma=sigma).fit()`	Generalized Least Squares regression
`sm.RLM(y, X, M=sm.robust.norms.HuberT()).fit()`	Robust Linear Model (resistant to outliers)
`smf.quantreg('y ~ x', data=df).fit(q=0.5)`	Quantile regression (median regression)
`smf.mixedlm('y ~ x', data=df, groups=groups).fit()`	Mixed Linear Model (random effects)
`sm.PanelOLS(y, X, entity_effects=True).fit()`	Panel data fixed effects model
`smf.glm('y ~ x', data=df, family=sm.families.Poisson()).fit()`	Poisson regression (GLM)
`smf.glm('y ~ x', data=df, family=sm.families.Gamma()).fit()`	Gamma regression (GLM)
`smf.glm('y ~ x', data=df, family=sm.families.NegativeBinomial()).fit()`	Negative binomial regression
`sm.MNLogit(y, X).fit()`	Multinomial logistic regression
`sm.Probit(y, X).fit()`	Probit regression model

Advanced Usage - Time Series Analysis¶

Command	Description
`ARIMA(data, order=(p,d,q)).fit()`	Fit ARIMA model with specified order
`SARIMAX(data, order=(p,d,q), seasonal_order=(P,D,Q,s)).fit()`	Seasonal ARIMA with exogenous variables
`adfuller(timeseries)`	Augmented Dickey-Fuller stationarity test
`acf(timeseries, nlags=40)`	Calculate autocorrelation function
`pacf(timeseries, nlags=40)`	Calculate partial autocorrelation function
`plot_acf(timeseries, lags=40)`	Plot autocorrelation function
`plot_pacf(timeseries, lags=40)`	Plot partial autocorrelation function
`VAR(data).fit(maxlags=5)`	Fit Vector Autoregression model
`results.irf(10).plot()`	Plot impulse response functions
`results.fevd(10)`	Forecast error variance decomposition
`UnobservedComponents(data, level='local linear trend').fit()`	Structural time series model
`ExponentialSmoothing(data, seasonal='add', seasonal_periods=12).fit()`	Exponential smoothing (Holt-Winters)
`results.forecast(steps=10)`	Generate forecasts for future periods
`results.plot_diagnostics()`	Plot diagnostic plots for time series model

Advanced Usage - Statistical Tests¶

Command	Description
`jarque_bera(residuals)`	Jarque-Bera normality test
`durbin_watson(residuals)`	Durbin-Watson autocorrelation test
`het_breuschpagan(residuals, X)`	Breusch-Pagan heteroscedasticity test
`acorr_ljungbox(residuals, lags=10)`	Ljung-Box autocorrelation test
`omni_normtest(residuals)`	Omnibus normality test
`pairwise_tukeyhsd(data, groups)`	Tukey's HSD multiple comparison test
`anova_lm(model1, model2)`	ANOVA for model comparison
`results.test_causality('var1', ['var2'])`	Granger causality test (VAR models)
`proportions_ztest(counts, nobs)`	Z-test for proportions
`ttest_ind(sample1, sample2)`	Independent samples t-test

Advanced Usage - Survival & Nonparametric¶

Command	Description
`SurvfuncRight(durations, status).plot()`	Kaplan-Meier survival curve
`PHReg(durations, X, status=status).fit()`	Cox Proportional Hazards model
`KDEUnivariate(data).fit()`	Kernel density estimation
`KernelReg(y, X, var_type='c').fit()`	Kernel regression (nonparametric)
`lowess(y, x, frac=0.1)`	LOWESS smoothing
`PCA(data, ncomp=3, standardize=True)`	Principal Component Analysis
`Factor(data, n_factor=2).fit()`	Factor Analysis

Configuration¶

Model Formula Syntax (R-style)¶

# Basic formula syntax
'y ~ x1 + x2'                    # Multiple predictors
'y ~ x1 + x2 + x1:x2'           # With interaction term
'y ~ x1 * x2'                    # Shorthand for x1 + x2 + x1:x2
'y ~ C(category)'                # Categorical variable
'y ~ np.log(x1) + np.sqrt(x2)'  # Transformations
'y ~ x1 + I(x1**2)'             # Polynomial terms

Model Fitting Options¶

# Common fitting parameters
results = model.fit(
    method='lbfgs',              # Optimization method
    maxiter=1000,                # Maximum iterations
    disp=True,                   # Display convergence messages
    cov_type='HC3'               # Robust covariance type
)

# Time series specific
results = model.fit(
    start_params=None,           # Initial parameter values
    method='css-mle',            # Estimation method
    trend='c',                   # Trend component
    solver='lbfgs',              # Optimization solver
    maxiter=500,                 # Maximum iterations
    full_output=True             # Return additional information
)

Display Options¶

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 4)

# Statsmodels summary options
results.summary(
    alpha=0.05,                  # Significance level
    title='Model Results',       # Custom title
    xname=['Var1', 'Var2']      # Custom variable names
)

Common Use Cases¶

Use Case 1: Linear Regression Analysis¶

import statsmodels.api as sm
import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Prepare variables
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']
X = sm.add_constant(X)  # Add intercept

# Fit model
model = sm.OLS(y, X)
results = model.fit()

# Display results
print(results.summary())

# Check assumptions
print(f"Jarque-Bera test: {sm.stats.jarque_bera(results.resid)}")
print(f"Durbin-Watson: {sm.stats.durbin_watson(results.resid)}")

# Make predictions
predictions = results.predict(X_new)

Use Case 2: Time Series Forecasting with ARIMA¶

import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

# Load time series data
ts_data = pd.read_csv('timeseries.csv', index_col='date', parse_dates=True)

# Check stationarity
adf_result = adfuller(ts_data['value'])
print(f'ADF Statistic: {adf_result[0]:.4f}')
print(f'p-value: {adf_result[1]:.4f}')

# Plot ACF and PACF to determine order
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(ts_data['value'], lags=40, ax=axes[0])
plot_pacf(ts_data['value'], lags=40, ax=axes[1])
plt.show()

# Fit ARIMA model
model = ARIMA(ts_data['value'], order=(1, 1, 1))
results = model.fit()
print(results.summary())

# Forecast
forecast = results.forecast(steps=12)
print(forecast)

# Plot diagnostics
results.plot_diagnostics(figsize=(15, 10))
plt.show()

Use Case 3: Logistic Regression for Classification¶

import statsmodels.formula.api as smf
import pandas as pd

# Load data
df = pd.read_csv('classification_data.csv')

# Fit logistic regression
model = smf.logit('outcome ~ age + income + education', data=df)
results = model.fit()

# Display results
print(results.summary())

# Get odds ratios
odds_ratios = pd.DataFrame({
    'OR': results.params.apply(lambda x: np.exp(x)),
    'CI_lower': results.conf_int()[0].apply(lambda x: np.exp(x)),
    'CI_upper': results.conf_int()[1].apply(lambda x: np.exp(x))
})
print(odds_ratios)

# Predict probabilities
df['predicted_prob'] = results.predict(df)

# Classification accuracy
df['predicted_class'] = (df['predicted_prob'] > 0.5).astype(int)
accuracy = (df['outcome'] == df['predicted_class']).mean()
print(f"Accuracy: {accuracy:.2%}")

Use Case 4: Panel Data Analysis¶

import pandas as pd
from statsmodels.regression.linear_model import PanelOLS

# Load panel data (MultiIndex: entity, time)
df = pd.read_csv('panel_data.csv')
df = df.set_index(['entity_id', 'time'])

# Prepare variables
y = df['dependent_var']
X = df[['var1', 'var2', 'var3']]

# Fixed effects model
fe_model = PanelOLS(y, X, entity_effects=True, time_effects=True)
fe_results = fe_model.fit()
print(fe_results.summary)

# Extract fixed effects
entity_effects = fe_results.estimated_effects
print(entity_effects.head())

Use Case 5: Vector Autoregression (VAR) for Multivariate Time Series¶

import pandas as pd
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import adfuller

# Load multivariate time series
df = pd.read_csv('multivariate_ts.csv', index_col='date', parse_dates=True)

# Check stationarity for all variables
for col in df.columns:
    result = adfuller(df[col])
    print(f'{col}: ADF = {result[0]:.4f}, p-value = {result[1]:.4f}')

# Fit VAR model
model = VAR(df)
results = model.fit(maxlags=5, ic='aic')
print(results.summary())

# Granger causality test
granger_results = results.test_causality('var1', ['var2', 'var3'], kind='f')
print(granger_results)

# Impulse response analysis
irf = results.irf(10)
irf.plot(orth=True)

# Forecast
forecast = results.forecast(df.values[-results.k_ar:], steps=12)
forecast_df = pd.DataFrame(forecast, columns=df.columns)
print(forecast_df)

Best Practices¶

Always add constant term: Use sm.add_constant(X) when using array-based API to include intercept in regression models
Check model assumptions: Validate residuals for normality, homoscedasticity, and autocorrelation using diagnostic tests
Use formula API for readability: Prefer smf.ols('y ~ x1 + x2', data=df) over array-based API for clearer, more maintainable code
Test for stationarity in time series: Always run adfuller() test before fitting ARIMA models; difference data if non-stationary
Examine ACF/PACF plots: Use plot_acf() and plot_pacf() to determine appropriate ARIMA order parameters
Compare models with information criteria: Use AIC/BIC for model selection; lower values indicate better fit with parsimony
Validate out-of-sample: Split data into train/test sets and evaluate prediction accuracy on holdout data
Handle multicollinearity: Check VIF (Variance Inflation Factor) for highly correlated predictors in regression models
Use robust standard errors: Apply cov_type='HC3' in .fit() for heteroscedasticity-robust inference
Document model specifications: Keep clear records of model orders, transformations, and selection criteria for reproducibility
Visualize diagnostics: Always run results.plot_diagnostics() for time series models to check residual patterns

Troubleshooting¶

Issue	Solution
LinAlgError: Singular matrix	Check for perfect multicollinearity; remove redundant variables or use `sm.add_constant()` only once
Convergence not achieved	Increase `maxiter` parameter, try different optimization method (`method='bfgs'`), or scale/standardize features
Perfect separation in logistic regression	Use penalized regression (`method='l1'`), remove problematic predictors, or collect more diverse data
ARIMA model won't fit	Verify data is stationary with `adfuller()`, try different order parameters, or check for missing values
ValueError: endog and exog matrices are different sizes	Ensure X and y have same number of observations; check for missing values and align indices
Non-stationary time series warnings	Difference the series (`df.diff().dropna()`), set `enforce_stationarity=False`, or transform data (log, Box-Cox)
Memory error with large datasets	Use chunking, reduce lag order in VAR/ARIMA, or consider `statsmodels.tsa.statespace` for state space models
Negative R-squared values	Model is worse than mean baseline; check model specification, add relevant features, or try different model type
Heteroscedasticity detected	Use WLS with appropriate weights, apply robust standard errors (`cov_type='HC3'`), or transform dependent variable
High VIF values (>10)	Remove or combine correlated predictors, use PCA for dimensionality reduction, or apply ridge regression
Residuals show patterns	Add polynomial terms, interaction effects, or use nonparametric methods like kernel regression
ImportError for optional dependencies	Install missing packages: `pip install matplotlib scipy patsy` or use `pip install statsmodels[all]`