Statsmodels 치트시트

설치

플랫폼	명령어
Ubuntu/Debian	`pip install statsmodels` or `sudo apt-get install python3-statsmodels`
macOS	`pip install statsmodels` or `conda install -c conda-forge statsmodels`
Windows	`pip install statsmodels` or `conda install -c conda-forge statsmodels`
With all dependencies	`pip install statsmodels[all]`
With plotting support	`pip install statsmodels[plotting]`
From source (latest)	`git clone https://github.com/statsmodels/statsmodels.git && cd statsmodels && pip install .`
Docker	`docker run -it python:3.9-slim bash -c "pip install statsmodels pandas numpy"`

기본 명령어

명령어	설명
`import statsmodels.api as sm`	statsmodels 주요 API 가져오기
`import statsmodels.formula.api as smf`	Import 공식 API (R-스타일 구문)
`sm.add_constant(X)`	특징 행렬에 절편 열 추가하기
`sm.OLS(y, X).fit()`	적합한 일반 최소 제곱 회귀 모델
`smf.ols('y ~ x1 + x2', data=df).fit()`	공식 표기법을 사용하여 OLS 적합
`results.summary()`	포괄적인 모델 요약 표시
`results.params`	모델 계수/매개변수 가져오기
`results.pvalues`	계수의 p-값 얻기
`results.rsquared`	R-squared 값 얻기
`results.aic`	Akaike 정보 기준 얻기
`results.bic`	Bayesian 정보 기준 얻기
`results.resid`	모델 잔차 얻기
`results.fittedvalues`	적합/예측된 값 얻기
`results.predict(X_new)`	새로운 데이터에 대해 예측하기
`results.conf_int()`	매개변수의 신뢰 구간 얻기
`smf.logit('y ~ x1 + x2', data=df).fit()`	로지스틱 회귀 모델 적합
`sm.datasets.get_rdataset('mtcars')`	예시 데이터셋 로드
`results.get_prediction(X_new).summary_frame()`	신뢰 구간을 포함한 예측 얻기

명령어	설명
`sm.WLS(y, X, weights=w).fit()`	가중 최소 제곱 회귀
`sm.GLS(y, X, sigma=sigma).fit()`	일반화 최소제곱 회귀
`sm.RLM(y, X, M=sm.robust.norms.HuberT()).fit()`	강건한 선형 모델 (이상치에 강한)
`smf.quantreg('y ~ x', data=df).fit(q=0.5)`	분위수 회귀 (중앙값 회귀)
`smf.mixedlm('y ~ x', data=df, groups=groups).fit()`	혼합 선형 모델 (랜덤 효과)
`sm.PanelOLS(y, X, entity_effects=True).fit()`	패널 데이터 고정효과 모델
`smf.glm('y ~ x', data=df, family=sm.families.Poisson()).fit()`	포아송 회귀 (GLM)
`smf.glm('y ~ x', data=df, family=sm.families.Gamma()).fit()`	감마 회귀 (GLM)
`smf.glm('y ~ x', data=df, family=sm.families.NegativeBinomial()).fit()`	음이항 회귀분석
`sm.MNLogit(y, X).fit()`	다항 로지스틱 회귀
`sm.Probit(y, X).fit()`	프로빗 회귀 모델

명령어	설명
`ARIMA(data, order=(p,d,q)).fit()`	지정된 차수로 ARIMA 모델 적합
`SARIMAX(data, order=(p,d,q), seasonal_order=(P,D,Q,s)).fit()`	외생 변수를 포함한 계절형 ARIMA
`adfuller(timeseries)`	증강 딕키-풀러 정상성 검정
`acf(timeseries, nlags=40)`	자기상관 함수 계산
`pacf(timeseries, nlags=40)`	부분 자기상관 함수 계산
`plot_acf(timeseries, lags=40)`	플롯 자기상관 함수
`plot_pacf(timeseries, lags=40)`	플롯 부분 자기상관 함수
`VAR(data).fit(maxlags=5)`	Fit Vector Autoregression 모델
`results.irf(10).plot()`	충격 응답 함수 도시하기
`results.fevd(10)`	예측 오차 분산 분해
`UnobservedComponents(data, level='local linear trend').fit()`	구조적 시계열 모델
`ExponentialSmoothing(data, seasonal='add', seasonal_periods=12).fit()`	지수 평활법 (홀트-윈터스)
`results.forecast(steps=10)`	향후 기간에 대한 예측 생성
`results.plot_diagnostics()`	시계열 모델의 진단 플롯 그리기

명령어	설명
`jarque_bera(residuals)`	Jarque-Bera 정규성 검정
`durbin_watson(residuals)`	Durbin-Watson 자기상관 검정
`het_breuschpagan(residuals, X)`	Breusch-Pagan 이분산성 검정
`acorr_ljungbox(residuals, lags=10)`	Ljung-Box 자기상관 검정
`omni_normtest(residuals)`	전체 정규성 검정
`pairwise_tukeyhsd(data, groups)`	터키의 HSD 다중 비교 검정
`anova_lm(model1, model2)`	모델 비교를 위한 ANOVA
`results.test_causality('var1', ['var2'])`	그랜저 인과성 검정 (VAR 모델)
`proportions_ztest(counts, nobs)`	비율에 대한 Z-검정
`ttest_ind(sample1, sample2)`	독립 표본 t-검정

명령어	설명
`SurvfuncRight(durations, status).plot()`	카플란-마이어 생존 곡선
`PHReg(durations, X, status=status).fit()`	Cox 비례 위험 모델
`KDEUnivariate(data).fit()`	커널 밀도 추정
`KernelReg(y, X, var_type='c').fit()`	커널 회귀 (비모수적)
`lowess(y, x, frac=0.1)`	LOWESS 평활화
`PCA(data, ncomp=3, standardize=True)`	주성분 분석
`Factor(data, n_factor=2).fit()`	요인 분석

모델 공식 문법 (R-스타일)

# Basic formula syntax
'y ~ x1 + x2'                    # Multiple predictors
'y ~ x1 + x2 + x1:x2'           # With interaction term
'y ~ x1 * x2'                    # Shorthand for x1 + x2 + x1:x2
'y ~ C(category)'                # Categorical variable
'y ~ np.log(x1) + np.sqrt(x2)'  # Transformations
'y ~ x1 + I(x1**2)'             # Polynomial terms

모델 피팅 옵션

# Common fitting parameters
results = model.fit(
    method='lbfgs',              # Optimization method
    maxiter=1000,                # Maximum iterations
    disp=True,                   # Display convergence messages
    cov_type='HC3'               # Robust covariance type
)

# Time series specific
results = model.fit(
    start_params=None,           # Initial parameter values
    method='css-mle',            # Estimation method
    trend='c',                   # Trend component
    solver='lbfgs',              # Optimization solver
    maxiter=500,                 # Maximum iterations
    full_output=True             # Return additional information
)

디스플레이 옵션

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 4)

# Statsmodels summary options
results.summary(
    alpha=0.05,                  # Significance level
    title='Model Results',       # Custom title
    xname=['Var1', 'Var2']      # Custom variable names
)

일반적인 사용 사례

사용 사례 1: 선형 회귀 분석

import statsmodels.api as sm
import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Prepare variables
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']
X = sm.add_constant(X)  # Add intercept

# Fit model
model = sm.OLS(y, X)
results = model.fit()

# Display results
print(results.summary())

# Check assumptions
print(f"Jarque-Bera test: {sm.stats.jarque_bera(results.resid)}")
print(f"Durbin-Watson: {sm.stats.durbin_watson(results.resid)}")

# Make predictions
predictions = results.predict(X_new)

사용 사례 2: ARIMA를 이용한 시계열 예측

import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

# Load time series data
ts_data = pd.read_csv('timeseries.csv', index_col='date', parse_dates=True)

# Check stationarity
adf_result = adfuller(ts_data['value'])
print(f'ADF Statistic: {adf_result[0]:.4f}')
print(f'p-value: {adf_result[1]:.4f}')

# Plot ACF and PACF to determine order
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(ts_data['value'], lags=40, ax=axes[0])
plot_pacf(ts_data['value'], lags=40, ax=axes[1])
plt.show()

# Fit ARIMA model
model = ARIMA(ts_data['value'], order=(1, 1, 1))
results = model.fit()
print(results.summary())

# Forecast
forecast = results.forecast(steps=12)
print(forecast)

# Plot diagnostics
results.plot_diagnostics(figsize=(15, 10))
plt.show()

사용 사례 3: 분류를 위한 로지스틱 회귀

import statsmodels.formula.api as smf
import pandas as pd

# Load data
df = pd.read_csv('classification_data.csv')

# Fit logistic regression
model = smf.logit('outcome ~ age + income + education', data=df)
results = model.fit()

# Display results
print(results.summary())

# Get odds ratios
odds_ratios = pd.DataFrame({
    'OR': results.params.apply(lambda x: np.exp(x)),
    'CI_lower': results.conf_int()[0].apply(lambda x: np.exp(x)),
    'CI_upper': results.conf_int()[1].apply(lambda x: np.exp(x))
})
print(odds_ratios)

# Predict probabilities
df['predicted_prob'] = results.predict(df)

# Classification accuracy
df['predicted_class'] = (df['predicted_prob'] > 0.5).astype(int)
accuracy = (df['outcome'] == df['predicted_class']).mean()
print(f"Accuracy: {accuracy:.2%}")

사용 사례 4: 패널 데이터 분석

import pandas as pd
from statsmodels.regression.linear_model import PanelOLS

# Load panel data (MultiIndex: entity, time)
df = pd.read_csv('panel_data.csv')
df = df.set_index(['entity_id', 'time'])

# Prepare variables
y = df['dependent_var']
X = df[['var1', 'var2', 'var3']]

# Fixed effects model
fe_model = PanelOLS(y, X, entity_effects=True, time_effects=True)
fe_results = fe_model.fit()
print(fe_results.summary)

# Extract fixed effects
entity_effects = fe_results.estimated_effects
print(entity_effects.head())

사용 사례 5: 다변량 시계열을 위한 벡터 자기회귀(VAR)

import pandas as pd
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import adfuller

# Load multivariate time series
df = pd.read_csv('multivariate_ts.csv', index_col='date', parse_dates=True)

# Check stationarity for all variables
for col in df.columns:
    result = adfuller(df[col])
    print(f'{col}: ADF = {result[0]:.4f}, p-value = {result[1]:.4f}')

# Fit VAR model
model = VAR(df)
results = model.fit(maxlags=5, ic='aic')
print(results.summary())

# Granger causality test
granger_results = results.test_causality('var1', ['var2', 'var3'], kind='f')
print(granger_results)

# Impulse response analysis
irf = results.irf(10)
irf.plot(orth=True)

# Forecast
forecast = results.forecast(df.values[-results.k_ar:], steps=12)
forecast_df = pd.DataFrame(forecast, columns=df.columns)
print(forecast_df)

모범 사례

항상 상수항 추가: 회귀 모델에서 절편을 포함하려면 sm.add_constant(X)를 사용하세요
모델 가정 확인: 진단 검정을 사용하여 잔차의 정규성, 등분산성, 자기상관을 검증하세요
가독성을 위해 공식 API 사용: 더 명확하고 유지보수 가능한 코드를 위해 smf.ols('y ~ x1 + x2', data=df)를 배열 기반 API보다 선호하세요
시계열의 정상성 검정: ARIMA 모델을 피팅하기 전에 항상 adfuller()검정을 실행하세요; 비정상적이면 차분하세요
ACF/PACF 플롯 검사: plot_acf()를 사용하세요plot_pacf()ARIMA 모델 파라미터 적절한 순서 결정을 위해
모델 정보 기준 비교: AIC/BIC를 사용하여 모델 선택; 낮은 값이 더 간결하고 적합한 모델을 나타냄
샘플 외 검증: 데이터를 훈련/테스트 세트로 분할하고 보류 데이터에서 예측 정확성 평가
다중공선성 처리: 회귀 모델에서 고도로 상관된 예측 변수에 대해 VIF(분산 팽창 계수) 확인
강건한 표준 오차 사용: 적용cov_type='HC3'~에서.fit()이분산성에 강건한 추론을 위해
모델 사양 문서화: 재현성을 위해 모델 순서, 변환, 선택 기준에 대한 명확한 기록 유지
진단 시각화: 항상 실행results.plot_diagnostics()시계열 모델의 잔차 패턴 확인을 위해

문제 해결

문제	솔루션
LinAlgError: Singular matrix	Check for perfect multicollinearity; remove redundant variables or use `sm.add_constant()` only once
Convergence not achieved	Increase `maxiter` parameter, try different optimization method (`method='bfgs'`), or scale/standardize features
Perfect separation in logistic regression	Use penalized regression (`method='l1'`), remove problematic predictors, or collect more diverse data
ARIMA model won’t fit	Verify data is stationary with `adfuller()`, try different order parameters, or check for missing values
ValueError: endog and exog matrices are different sizes	X와 y의 관측값 수가 동일한지 확인하고; 누락된 값을 확인하고 인덱스를 정렬하세요
Non-stationary time series warnings	Difference the series (`df.diff().dropna()`), set `enforce_stationarity=False`, or transform data (log, Box-Cox)
Memory error with large datasets	Use chunking, reduce lag order in VAR/ARIMA, or consider `statsmodels.tsa.statespace` for state space models
Negative R-squared values	모델이 평균 기준선보다 성능이 낮음; 모델 사양을 확인하고, 관련 특성을 추가하거나 다른 모델 유형을 시도해 보세요
Heteroscedasticity detected	Use WLS with appropriate weights, apply robust standard errors (`cov_type='HC3'`), or transform dependent variable
High VIF values (>10)	상관관계 있는 예측 변수를 제거하거나 결합하고, 차원 축소를 위해 PCA를 사용하거나 ridge 회귀를 적용하세요
Residuals show patterns	다항식 항, 상호작용 효과를 추가하거나 커널 회귀와 같은 비모수적 방법을 사용하세요