data-science
📋 Copy All statsmodels Commands
📄 Generate statsmodels PDF Guide
Statsmodels Cheatsheet
Installation
Platform
Command
Ubuntu/Debian
pip install statsmodels or sudo apt-get install python3-statsmodels
macOS
pip install statsmodels or conda install -c conda-forge statsmodels
Windows
pip install statsmodels or conda install -c conda-forge statsmodels
With all dependencies
pip install statsmodels[all]
With plotting support
pip install statsmodels[plotting]
From source (latest)
git clone https://github.com/statsmodels/statsmodels.git && cd statsmodels && pip install .
Docker
docker run -it python:3.9-slim bash -c "pip install statsmodels pandas numpy"
Requirements : Python 3.8+, NumPy >= 1.18, SciPy >= 1.4, Pandas >= 1.0, Patsy >= 0.5.2
Basic Commands
Command
Description
import statsmodels.api as sm
Import main statsmodels API
import statsmodels.formula.api as smf
Import formula API (R-style syntax)
sm.add_constant(X)
Add intercept column to feature matrix
sm.OLS(y, X).fit()
Fit Ordinary Least Squares regression model
smf.ols('y ~ x1 + x2', data=df).fit()
Fit OLS using formula notation
results.summary()
Display comprehensive model summary
results.params
Get model coefficients/parameters
results.pvalues
Get p-values for coefficients
results.rsquared
Get R-squared value
results.aic
Get Akaike Information Criterion
results.bic
Get Bayesian Information Criterion
results.resid
Get model residuals
results.fittedvalues
Get fitted/predicted values
results.predict(X_new)
Make predictions on new data
results.conf_int()
Get confidence intervals for parameters
smf.logit('y ~ x1 + x2', data=df).fit()
Fit logistic regression model
sm.datasets.get_rdataset('mtcars')
Load example dataset
results.get_prediction(X_new).summary_frame()
Get predictions with confidence intervals
Advanced Usage - Regression Models
Command
Description
sm.WLS(y, X, weights=w).fit()
Weighted Least Squares regression
sm.GLS(y, X, sigma=sigma).fit()
Generalized Least Squares regression
sm.RLM(y, X, M=sm.robust.norms.HuberT()).fit()
Robust Linear Model (resistant to outliers)
smf.quantreg('y ~ x', data=df).fit(q=0.5)
Quantile regression (median regression)
smf.mixedlm('y ~ x', data=df, groups=groups).fit()
Mixed Linear Model (random effects)
sm.PanelOLS(y, X, entity_effects=True).fit()
Panel data fixed effects model
smf.glm('y ~ x', data=df, family=sm.families.Poisson()).fit()
Poisson regression (GLM)
smf.glm('y ~ x', data=df, family=sm.families.Gamma()).fit()
Gamma regression (GLM)
smf.glm('y ~ x', data=df, family=sm.families.NegativeBinomial()).fit()
Negative binomial regression
sm.MNLogit(y, X).fit()
Multinomial logistic regression
sm.Probit(y, X).fit()
Probit regression model
Advanced Usage - Time Series Analysis
Command
Description
ARIMA(data, order=(p,d,q)).fit()
Fit ARIMA model with specified order
SARIMAX(data, order=(p,d,q), seasonal_order=(P,D,Q,s)).fit()
Seasonal ARIMA with exogenous variables
adfuller(timeseries)
Augmented Dickey-Fuller stationarity test
acf(timeseries, nlags=40)
Calculate autocorrelation function
pacf(timeseries, nlags=40)
Calculate partial autocorrelation function
plot_acf(timeseries, lags=40)
Plot autocorrelation function
plot_pacf(timeseries, lags=40)
Plot partial autocorrelation function
VAR(data).fit(maxlags=5)
Fit Vector Autoregression model
results.irf(10).plot()
Plot impulse response functions
results.fevd(10)
Forecast error variance decomposition
UnobservedComponents(data, level='local linear trend').fit()
Structural time series model
ExponentialSmoothing(data, seasonal='add', seasonal_periods=12).fit()
Exponential smoothing (Holt-Winters)
results.forecast(steps=10)
Generate forecasts for future periods
results.plot_diagnostics()
Plot diagnostic plots for time series model
Advanced Usage - Statistical Tests
Command
Description
jarque_bera(residuals)
Jarque-Bera normality test
durbin_watson(residuals)
Durbin-Watson autocorrelation test
het_breuschpagan(residuals, X)
Breusch-Pagan heteroscedasticity test
acorr_ljungbox(residuals, lags=10)
Ljung-Box autocorrelation test
omni_normtest(residuals)
Omnibus normality test
pairwise_tukeyhsd(data, groups)
Tukey's HSD multiple comparison test
anova_lm(model1, model2)
ANOVA for model comparison
results.test_causality('var1', ['var2'])
Granger causality test (VAR models)
proportions_ztest(counts, nobs)
Z-test for proportions
ttest_ind(sample1, sample2)
Independent samples t-test
Advanced Usage - Survival & Nonparametric
Command
Description
SurvfuncRight(durations, status).plot()
Kaplan-Meier survival curve
PHReg(durations, X, status=status).fit()
Cox Proportional Hazards model
KDEUnivariate(data).fit()
Kernel density estimation
KernelReg(y, X, var_type='c').fit()
Kernel regression (nonparametric)
lowess(y, x, frac=0.1)
LOWESS smoothing
PCA(data, ncomp=3, standardize=True)
Principal Component Analysis
Factor(data, n_factor=2).fit()
Factor Analysis
Configuration
# Basic formula syntax
'y ~ x1 + x2' # Multiple predictors
'y ~ x1 + x2 + x1:x2' # With interaction term
'y ~ x1 * x2' # Shorthand for x1 + x2 + x1:x2
'y ~ C(category)' # Categorical variable
'y ~ np.log(x1) + np.sqrt(x2)' # Transformations
'y ~ x1 + I(x1**2)' # Polynomial terms
Model Fitting Options
# Common fitting parameters
results = model . fit (
method = 'lbfgs' , # Optimization method
maxiter = 1000 , # Maximum iterations
disp = True , # Display convergence messages
cov_type = 'HC3' # Robust covariance type
)
# Time series specific
results = model . fit (
start_params = None , # Initial parameter values
method = 'css-mle' , # Estimation method
trend = 'c' , # Trend component
solver = 'lbfgs' , # Optimization solver
maxiter = 500 , # Maximum iterations
full_output = True # Return additional information
)
Display Options
import pandas as pd
pd . set_option ( 'display.max_columns' , None )
pd . set_option ( 'display.width' , None )
pd . set_option ( 'display.precision' , 4 )
# Statsmodels summary options
results . summary (
alpha = 0.05 , # Significance level
title = 'Model Results' , # Custom title
xname = [ 'Var1' , 'Var2' ] # Custom variable names
)
Common Use Cases
Use Case 1: Linear Regression Analysis
import statsmodels.api as sm
import pandas as pd
# Load data
df = pd . read_csv ( 'data.csv' )
# Prepare variables
X = df [[ 'feature1' , 'feature2' , 'feature3' ]]
y = df [ 'target' ]
X = sm . add_constant ( X ) # Add intercept
# Fit model
model = sm . OLS ( y , X )
results = model . fit ()
# Display results
print ( results . summary ())
# Check assumptions
print ( f "Jarque-Bera test: { sm . stats . jarque_bera ( results . resid ) } " )
print ( f "Durbin-Watson: { sm . stats . durbin_watson ( results . resid ) } " )
# Make predictions
predictions = results . predict ( X_new )
Use Case 2: Time Series Forecasting with ARIMA
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf , plot_pacf
import matplotlib.pyplot as plt
# Load time series data
ts_data = pd . read_csv ( 'timeseries.csv' , index_col = 'date' , parse_dates = True )
# Check stationarity
adf_result = adfuller ( ts_data [ 'value' ])
print ( f 'ADF Statistic: { adf_result [ 0 ] : .4f } ' )
print ( f 'p-value: { adf_result [ 1 ] : .4f } ' )
# Plot ACF and PACF to determine order
fig , axes = plt . subplots ( 1 , 2 , figsize = ( 12 , 4 ))
plot_acf ( ts_data [ 'value' ], lags = 40 , ax = axes [ 0 ])
plot_pacf ( ts_data [ 'value' ], lags = 40 , ax = axes [ 1 ])
plt . show ()
# Fit ARIMA model
model = ARIMA ( ts_data [ 'value' ], order = ( 1 , 1 , 1 ))
results = model . fit ()
print ( results . summary ())
# Forecast
forecast = results . forecast ( steps = 12 )
print ( forecast )
# Plot diagnostics
results . plot_diagnostics ( figsize = ( 15 , 10 ))
plt . show ()
Use Case 3: Logistic Regression for Classification
import statsmodels.formula.api as smf
import pandas as pd
# Load data
df = pd . read_csv ( 'classification_data.csv' )
# Fit logistic regression
model = smf . logit ( 'outcome ~ age + income + education' , data = df )
results = model . fit ()
# Display results
print ( results . summary ())
# Get odds ratios
odds_ratios = pd . DataFrame ({
'OR' : results . params . apply ( lambda x : np . exp ( x )),
'CI_lower' : results . conf_int ()[ 0 ] . apply ( lambda x : np . exp ( x )),
'CI_upper' : results . conf_int ()[ 1 ] . apply ( lambda x : np . exp ( x ))
})
print ( odds_ratios )
# Predict probabilities
df [ 'predicted_prob' ] = results . predict ( df )
# Classification accuracy
df [ 'predicted_class' ] = ( df [ 'predicted_prob' ] > 0.5 ) . astype ( int )
accuracy = ( df [ 'outcome' ] == df [ 'predicted_class' ]) . mean ()
print ( f "Accuracy: { accuracy : .2% } " )
Use Case 4: Panel Data Analysis
import pandas as pd
from statsmodels.regression.linear_model import PanelOLS
# Load panel data (MultiIndex: entity, time)
df = pd . read_csv ( 'panel_data.csv' )
df = df . set_index ([ 'entity_id' , 'time' ])
# Prepare variables
y = df [ 'dependent_var' ]
X = df [[ 'var1' , 'var2' , 'var3' ]]
# Fixed effects model
fe_model = PanelOLS ( y , X , entity_effects = True , time_effects = True )
fe_results = fe_model . fit ()
print ( fe_results . summary )
# Extract fixed effects
entity_effects = fe_results . estimated_effects
print ( entity_effects . head ())
Use Case 5: Vector Autoregression (VAR) for Multivariate Time Series
import pandas as pd
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import adfuller
# Load multivariate time series
df = pd . read_csv ( 'multivariate_ts.csv' , index_col = 'date' , parse_dates = True )
# Check stationarity for all variables
for col in df . columns :
result = adfuller ( df [ col ])
print ( f ' { col } : ADF = { result [ 0 ] : .4f } , p-value = { result [ 1 ] : .4f } ' )
# Fit VAR model
model = VAR ( df )
results = model . fit ( maxlags = 5 , ic = 'aic' )
print ( results . summary ())
# Granger causality test
granger_results = results . test_causality ( 'var1' , [ 'var2' , 'var3' ], kind = 'f' )
print ( granger_results )
# Impulse response analysis
irf = results . irf ( 10 )
irf . plot ( orth = True )
# Forecast
forecast = results . forecast ( df . values [ - results . k_ar :], steps = 12 )
forecast_df = pd . DataFrame ( forecast , columns = df . columns )
print ( forecast_df )
Best Practices
Always add constant term : Use sm.add_constant(X) when using array-based API to include intercept in regression models
Check model assumptions : Validate residuals for normality, homoscedasticity, and autocorrelation using diagnostic tests
Use formula API for readability : Prefer smf.ols('y ~ x1 + x2', data=df) over array-based API for clearer, more maintainable code
Test for stationarity in time series : Always run adfuller() test before fitting ARIMA models; difference data if non-stationary
Examine ACF/PACF plots : Use plot_acf() and plot_pacf() to determine appropriate ARIMA order parameters
Compare models with information criteria : Use AIC/BIC for model selection; lower values indicate better fit with parsimony
Validate out-of-sample : Split data into train/test sets and evaluate prediction accuracy on holdout data
Handle multicollinearity : Check VIF (Variance Inflation Factor) for highly correlated predictors in regression models
Use robust standard errors : Apply cov_type='HC3' in .fit() for heteroscedasticity-robust inference
Document model specifications : Keep clear records of model orders, transformations, and selection criteria for reproducibility
Visualize diagnostics : Always run results.plot_diagnostics() for time series models to check residual patterns
Troubleshooting
Issue
Solution
LinAlgError: Singular matrix
Check for perfect multicollinearity; remove redundant variables or use sm.add_constant() only once
Convergence not achieved
Increase maxiter parameter, try different optimization method (method='bfgs'), or scale/standardize features
Perfect separation in logistic regression
Use penalized regression (method='l1'), remove problematic predictors, or collect more diverse data
ARIMA model won't fit
Verify data is stationary with adfuller(), try different order parameters, or check for missing values
ValueError: endog and exog matrices are different sizes
Ensure X and y have same number of observations; check for missing values and align indices
Non-stationary time series warnings
Difference the series (df.diff().dropna()), set enforce_stationarity=False, or transform data (log, Box-Cox)
Memory error with large datasets
Use chunking, reduce lag order in VAR/ARIMA, or consider statsmodels.tsa.statespace for state space models
Negative R-squared values
Model is worse than mean baseline; check model specification, add relevant features, or try different model type
Heteroscedasticity detected
Use WLS with appropriate weights, apply robust standard errors (cov_type='HC3'), or transform dependent variable
High VIF values (>10)
Remove or combine correlated predictors, use PCA for dimensionality reduction, or apply ridge regression
Residuals show patterns
Add polynomial terms, interaction effects, or use nonparametric methods like kernel regression
ImportError for optional dependencies
Install missing packages: pip install matplotlib scipy patsy or use pip install statsmodels[all]