Pular para o conteúdo

Statsmodels Cheatsheet

Statsmodels Cheatsheet

Installation

PlatformCommand
Ubuntu/Debianpip install statsmodels or sudo apt-get install python3-statsmodels
macOSpip install statsmodels or conda install -c conda-forge statsmodels
Windowspip install statsmodels or conda install -c conda-forge statsmodels
With all dependenciespip install statsmodels[all]
With plotting supportpip install statsmodels[plotting]
From source (latest)git clone https://github.com/statsmodels/statsmodels.git && cd statsmodels && pip install .
Dockerdocker run -it python:3.9-slim bash -c "pip install statsmodels pandas numpy"
Requirements: Python 3.8+, NumPy >= 1.18, SciPy >= 1.4, Pandas >= 1.0, Patsy >= 0.5.2

Basic Commands

CommandDescription
import statsmodels.api as smImport main statsmodels API
import statsmodels.formula.api as smfImport formula API (R-style syntax)
sm.add_constant(X)Add intercept column to feature matrix
sm.OLS(y, X).fit()Fit Ordinary Least Squares regression model
smf.ols('y ~ x1 + x2', data=df).fit()Fit OLS using formula notation
results.summary()Display comprehensive model summary
results.paramsGet model coefficients/parameters
results.pvaluesGet p-values for coefficients
results.rsquaredGet R-squared value
results.aicGet Akaike Information Criterion
results.bicGet Bayesian Information Criterion
results.residGet model residuals
results.fittedvaluesGet fitted/predicted values
results.predict(X_new)Make predictions on new data
results.conf_int()Get confidence intervals for parameters
smf.logit('y ~ x1 + x2', data=df).fit()Fit logistic regression model
sm.datasets.get_rdataset('mtcars')Load example dataset
results.get_prediction(X_new).summary_frame()Get predictions with confidence intervals

Advanced Usage - Regression Models

CommandDescription
sm.WLS(y, X, weights=w).fit()Weighted Least Squares regression
sm.GLS(y, X, sigma=sigma).fit()Generalized Least Squares regression
sm.RLM(y, X, M=sm.robust.norms.HuberT()).fit()Robust Linear Model (resistant to outliers)
smf.quantreg('y ~ x', data=df).fit(q=0.5)Quantile regression (median regression)
smf.mixedlm('y ~ x', data=df, groups=groups).fit()Mixed Linear Model (random effects)
sm.PanelOLS(y, X, entity_effects=True).fit()Panel data fixed effects model
smf.glm('y ~ x', data=df, family=sm.families.Poisson()).fit()Poisson regression (GLM)
smf.glm('y ~ x', data=df, family=sm.families.Gamma()).fit()Gamma regression (GLM)
smf.glm('y ~ x', data=df, family=sm.families.NegativeBinomial()).fit()Negative binomial regression
sm.MNLogit(y, X).fit()Multinomial logistic regression
sm.Probit(y, X).fit()Probit regression model

Advanced Usage - Time Series Analysis

CommandDescription
ARIMA(data, order=(p,d,q)).fit()Fit ARIMA model with specified order
SARIMAX(data, order=(p,d,q), seasonal_order=(P,D,Q,s)).fit()Seasonal ARIMA with exogenous variables
adfuller(timeseries)Augmented Dickey-Fuller stationarity test
acf(timeseries, nlags=40)Calculate autocorrelation function
pacf(timeseries, nlags=40)Calculate partial autocorrelation function
plot_acf(timeseries, lags=40)Plot autocorrelation function
plot_pacf(timeseries, lags=40)Plot partial autocorrelation function
VAR(data).fit(maxlags=5)Fit Vector Autoregression model
results.irf(10).plot()Plot impulse response functions
results.fevd(10)Forecast error variance decomposition
UnobservedComponents(data, level='local linear trend').fit()Structural time series model
ExponentialSmoothing(data, seasonal='add', seasonal_periods=12).fit()Exponential smoothing (Holt-Winters)
results.forecast(steps=10)Generate forecasts for future periods
results.plot_diagnostics()Plot diagnostic plots for time series model

Advanced Usage - Statistical Tests

CommandDescription
jarque_bera(residuals)Jarque-Bera normality test
durbin_watson(residuals)Durbin-Watson autocorrelation test
het_breuschpagan(residuals, X)Breusch-Pagan heteroscedasticity test
acorr_ljungbox(residuals, lags=10)Ljung-Box autocorrelation test
omni_normtest(residuals)Omnibus normality test
pairwise_tukeyhsd(data, groups)Tukey’s HSD multiple comparison test
anova_lm(model1, model2)ANOVA for model comparison
results.test_causality('var1', ['var2'])Granger causality test (VAR models)
proportions_ztest(counts, nobs)Z-test for proportions
ttest_ind(sample1, sample2)Independent samples t-test

Advanced Usage - Survival & Nonparametric

CommandDescription
SurvfuncRight(durations, status).plot()Kaplan-Meier survival curve
PHReg(durations, X, status=status).fit()Cox Proportional Hazards model
KDEUnivariate(data).fit()Kernel density estimation
KernelReg(y, X, var_type='c').fit()Kernel regression (nonparametric)
lowess(y, x, frac=0.1)LOWESS smoothing
PCA(data, ncomp=3, standardize=True)Principal Component Analysis
Factor(data, n_factor=2).fit()Factor Analysis

Configuration

Model Formula Syntax (R-style)

# Basic formula syntax
'y ~ x1 + x2'                    # Multiple predictors
'y ~ x1 + x2 + x1:x2'           # With interaction term
'y ~ x1 * x2'                    # Shorthand for x1 + x2 + x1:x2
'y ~ C(category)'                # Categorical variable
'y ~ np.log(x1) + np.sqrt(x2)'  # Transformations
'y ~ x1 + I(x1**2)'             # Polynomial terms

Model Fitting Options

# Common fitting parameters
results = model.fit(
    method='lbfgs',              # Optimization method
    maxiter=1000,                # Maximum iterations
    disp=True,                   # Display convergence messages
    cov_type='HC3'               # Robust covariance type
)

# Time series specific
results = model.fit(
    start_params=None,           # Initial parameter values
    method='css-mle',            # Estimation method
    trend='c',                   # Trend component
    solver='lbfgs',              # Optimization solver
    maxiter=500,                 # Maximum iterations
    full_output=True             # Return additional information
)

Display Options

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 4)

# Statsmodels summary options
results.summary(
    alpha=0.05,                  # Significance level
    title='Model Results',       # Custom title
    xname=['Var1', 'Var2']      # Custom variable names
)

Common Use Cases

Use Case 1: Linear Regression Analysis

import statsmodels.api as sm
import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Prepare variables
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']
X = sm.add_constant(X)  # Add intercept

# Fit model
model = sm.OLS(y, X)
results = model.fit()

# Display results
print(results.summary())

# Check assumptions
print(f"Jarque-Bera test: {sm.stats.jarque_bera(results.resid)}")
print(f"Durbin-Watson: {sm.stats.durbin_watson(results.resid)}")

# Make predictions
predictions = results.predict(X_new)

Use Case 2: Time Series Forecasting with ARIMA

import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

# Load time series data
ts_data = pd.read_csv('timeseries.csv', index_col='date', parse_dates=True)

# Check stationarity
adf_result = adfuller(ts_data['value'])
print(f'ADF Statistic: {adf_result[0]:.4f}')
print(f'p-value: {adf_result[1]:.4f}')

# Plot ACF and PACF to determine order
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(ts_data['value'], lags=40, ax=axes[0])
plot_pacf(ts_data['value'], lags=40, ax=axes[1])
plt.show()

# Fit ARIMA model
model = ARIMA(ts_data['value'], order=(1, 1, 1))
results = model.fit()
print(results.summary())

# Forecast
forecast = results.forecast(steps=12)
print(forecast)

# Plot diagnostics
results.plot_diagnostics(figsize=(15, 10))
plt.show()

Use Case 3: Logistic Regression for Classification

import statsmodels.formula.api as smf
import pandas as pd

# Load data
df = pd.read_csv('classification_data.csv')

# Fit logistic regression
model = smf.logit('outcome ~ age + income + education', data=df)
results = model.fit()

# Display results
print(results.summary())

# Get odds ratios
odds_ratios = pd.DataFrame({
    'OR': results.params.apply(lambda x: np.exp(x)),
    'CI_lower': results.conf_int()[0].apply(lambda x: np.exp(x)),
    'CI_upper': results.conf_int()[1].apply(lambda x: np.exp(x))
})
print(odds_ratios)

# Predict probabilities
df['predicted_prob'] = results.predict(df)

# Classification accuracy
df['predicted_class'] = (df['predicted_prob'] > 0.5).astype(int)
accuracy = (df['outcome'] == df['predicted_class']).mean()
print(f"Accuracy: {accuracy:.2%}")

Use Case 4: Panel Data Analysis

import pandas as pd
from statsmodels.regression.linear_model import PanelOLS

# Load panel data (MultiIndex: entity, time)
df = pd.read_csv('panel_data.csv')
df = df.set_index(['entity_id', 'time'])

# Prepare variables
y = df['dependent_var']
X = df[['var1', 'var2', 'var3']]

# Fixed effects model
fe_model = PanelOLS(y, X, entity_effects=True, time_effects=True)
fe_results = fe_model.fit()
print(fe_results.summary)

# Extract fixed effects
entity_effects = fe_results.estimated_effects
print(entity_effects.head())

Use Case 5: Vector Autoregression (VAR) for Multivariate Time Series

import pandas as pd
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import adfuller

# Load multivariate time series
df = pd.read_csv('multivariate_ts.csv', index_col='date', parse_dates=True)

# Check stationarity for all variables
for col in df.columns:
    result = adfuller(df[col])
    print(f'{col}: ADF = {result[0]:.4f}, p-value = {result[1]:.4f}')

# Fit VAR model
model = VAR(df)
results = model.fit(maxlags=5, ic='aic')
print(results.summary())

# Granger causality test
granger_results = results.test_causality('var1', ['var2', 'var3'], kind='f')
print(granger_results)

# Impulse response analysis
irf = results.irf(10)
irf.plot(orth=True)

# Forecast
forecast = results.forecast(df.values[-results.k_ar:], steps=12)
forecast_df = pd.DataFrame(forecast, columns=df.columns)
print(forecast_df)

Best Practices

  • Always add constant term: Use sm.add_constant(X) when using array-based API to include intercept in regression models
  • Check model assumptions: Validate residuals for normality, homoscedasticity, and autocorrelation using diagnostic tests
  • Use formula API for readability: Prefer smf.ols('y ~ x1 + x2', data=df) over array-based API for clearer, more maintainable code
  • Test for stationarity in time series: Always run adfuller() test before fitting ARIMA models; difference data if non-stationary
  • Examine ACF/PACF plots: Use plot_acf() and plot_pacf() to determine appropriate ARIMA order parameters
  • Compare models with information criteria: Use AIC/BIC for model selection; lower values indicate better fit with parsimony
  • Validate out-of-sample: Split data into train/test sets and evaluate prediction accuracy on holdout data
  • Handle multicollinearity: Check VIF (Variance Inflation Factor) for highly correlated predictors in regression models
  • Use robust standard errors: Apply cov_type='HC3' in .fit() for heteroscedasticity-robust inference
  • Document model specifications: Keep clear records of model orders, transformations, and selection criteria for reproducibility
  • Visualize diagnostics: Always run results.plot_diagnostics() for time series models to check residual patterns

Troubleshooting

IssueSolution
LinAlgError: Singular matrixCheck for perfect multicollinearity; remove redundant variables or use sm.add_constant() only once
Convergence not achievedIncrease maxiter parameter, try different optimization method (method='bfgs'), or scale/standardize features
Perfect separation in logistic regressionUse penalized regression (method='l1'), remove problematic predictors, or collect more diverse data
ARIMA model won’t fitVerify data is stationary with adfuller(), try different order parameters, or check for missing values
ValueError: endog and exog matrices are different sizesEnsure X and y have same number of observations; check for missing values and align indices
Non-stationary time series warningsDifference the series (df.diff().dropna()), set enforce_stationarity=False, or transform data (log, Box-Cox)
Memory error with large datasetsUse chunking, reduce lag order in VAR/ARIMA, or consider statsmodels.tsa.statespace for state space models
Negative R-squared valuesModel is worse than mean baseline; check model specification, add relevant features, or try different model type
Heteroscedasticity detectedUse WLS with appropriate weights, apply robust standard errors (cov_type='HC3'), or transform dependent variable
High VIF values (>10)Remove or combine correlated predictors, use PCA for dimensionality reduction, or apply ridge regression
Residuals show patternsAdd polynomial terms, interaction effects, or use nonparametric methods like kernel regression