728x90

study 우선순위

렉쳐노트에서 특성 중요도 구하는 방법 3가지와 xgb 학습 과정 전반적으로 이해하기.
과제 제출
내데이터 조합하기
그걸로 과제..
- xgboost 개념 이해하기

Feature Importances 질문 정리

특성 중요도 계산 방법들(permutation importances, Feature importance, ...)을 이해하고 사용하여 모델을 해석하고 특성 선택시 활용할 수 있다.

gradient boosting 을 이해하고 xgboost를 사용하여 모델을 만들 수 있다.

Warm up

오늘 학습할 주제에 대한 동영상을 시청하세요.

배깅 복습
- Bootstrap aggregating bagging

부스팅(Boosting)
- AdaBoost
  - AdaBoost와 RandomForest와 핵심 차이점 3가지는?
- Gradient Boosting

Lecture Note

모델 해석과 특성 선택을 위한 순열 중요도(Permutation Importances) 계산

3가지 특성 중요도 계산 방법

1. Feature Importances(Mean decrease impurity, MDI)

: sklearn 트리 기반 분류기에서 디폴트로 사용되는 특성 중요도는 속도는 빠르지만 결과를 주의해서 봐야 합니다. 각각 특성을 모든 트리에 대해 평균불순도감소(mean decrease impurity)를 계산한 값

Sklearn, DecisionTreeClassifier, min_impurity_decrease

sklearn, property feature_importances_

랜덤포레스트 특성 중요도 시각화

2. Drop-Column Importance

: 독립변수 하나를 제거해보고 학습시켜보는 방법 (회귀분석에서 전진,후진 방법과 비슷한 것 같다)

코드

column  = 'opinion_seas_risk'

# opinion_h1n1_risk 없이 fit
pipe = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(n_estimators=100, random_state=2, n_jobs=-1)
)
pipe.fit(X_train.drop(columns=column), y_train)
score_without = pipe.score(X_val.drop(columns=column), y_val)
print(f'검증 정확도 ({column} 제외): {score_without}')

# opinion_h1n1_risk 포함 후 다시 학습
pipe = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(n_estimators=100, random_state=2, n_jobs=-1)
)
pipe.fit(X_train, y_train)
score_with = pipe.score(X_val, y_val)
print(f'검증 정확도 ({column} 포함): {score_with}')

# opinion_h1n1_risk 포함 전 후 정확도 차이를 계산합니다
print(f'{column}의 Drop-Column 중요도: {score_with - score_without}')

검증 정확도 (opinion_seas_risk 제외): 0.733127742853754 검증 정확도 (opinion_seas_risk 포함): 0.7526983750444787 opinion_seas_risk의 Drop-Column 중요도: 0.019570632190724635

3. 순열중요도, (Permutation Importance, Mean Decrease Accuracy,MDA)

: 기본 특성 중요도와 Drop-column 중요도 중간

중요도 측정은 관심있는 특성에만 무작위로 노이즈를 주고 예측을 하였을 때 성능 평가지표(정확도, F1, R2 등)가 얼마나 감소하는지를 측정하는 방법이다.

Drop-column 중요도를 계산하기 위해 재학습을 하지 않아도 되서 시간이 단축된다.

이때 노이즈를 주는 가장 간단한 방법이 그 특성값들을 샘플들 내에서 섞는 것(shuffle, permutation) 이다. The ELI5 library documentation explains, permutation importance

특성중요도를 알아보고자 하는 변수의 데이터만 무작위로 섞어서 아예 기능을 하지 못하게 만들어 버리는 것이다.

순열중요도 계산 코드

# doctor_recc_h1n1 에 대해서 순열 중요도를 계산해 봅시다
feature = 'doctor_recc_seasonal'
X_val_permuted = X_val.copy()
X_val_permuted[feature] = np.random.permutation(X_val_permuted[feature])
score_permuted = pipe.score(X_val_permuted, y_val)

print(f'검증 정확도 ({feature}): {score_with}')
print(f'검증 정확도 (permuted "{feature}"): {score_permuted}')
print(f'순열 중요도: {score_with - score_permuted}')

검증 정확도 (doctor_recc_seasonal): 0.7526983750444787 검증 정확도 (permuted "doctor_recc_seasonal"): 0.6863954453801447 순열 중요도: 0.06630292966433393

eli5 라이브러리를 사용해서 순열 중요도를 계산

라이브러리를 사용해서 계산하는 방법

코드

# 설치를 우선 해야합니다
pip install eli5

from sklearn.pipeline import Pipeline
# encoder, imputer를 preprocessing으로 묶었습니다. 후에 eli5 permutation 계산에 사용합니다
pipe = Pipeline([
    ('preprocessing', make_pipeline(OrdinalEncoder(), SimpleImputer())),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=2, n_jobs=-1)) 
])

pipe.fit(X_train, y_train)
print('검증 정확도: ', pipe.score(X_val, y_val))

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import eli5
from eli5.sklearn import PermutationImportance

# permuter 정의
permuter = PermutationImportance(
    pipe.named_steps['rf'], # model
    scoring='accuracy', # metric
    n_iter=5, # 다른 random seed를 사용하여 5번 반복
    random_state=2
)

# permuter 계산은 preprocessing 된 X_val을 사용합니다.
X_val_transformed = pipe.named_steps['preprocessing'].transform(X_val)

# 실제로 fit 의미보다는 스코어를 다시 계산하는 작업입니다
permuter.fit(X_val_transformed, y_val);

feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values()

# 특성별 score 확인
eli5.show_weights(
    permuter, 
    top=None, # top n 지정 가능, None 일 경우 모든 특성 
    feature_names=feature_names # list 형식으로 넣어야 합니다
) # 이코드를 통해 특성 중요도를 아래와 같이 정렬해서 출력할 수 있습니다.

중요도를 이용하여 특성을 선택(Feature selection)해보자

중요도가 -인 특성을 제외해도 성능은 거의 영향이 없으며, 모델학습 속도는 개선된다

minimum_importance = 0.001
mask = permuter.feature_importances_ > minimum_importance
features = X_train.columns[mask]
X_train_selected = X_train[features]
X_val_selected = X_val[features]

print('특성 삭제 후:', X_train_selected.shape, X_val_selected.shape)

특성 삭제 후: (33723, 8) (8431, 8)

# pipeline 다시 정의
pipe = Pipeline([
    ('preprocessing', make_pipeline(OrdinalEncoder(), SimpleImputer())),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=2, n_jobs=-1)) 
], verbose=1)

pipe.fit(X_train_selected, y_train);

[Pipeline] ..... (step 1 of 2) Processing preprocessing, total= 0.1s [Pipeline] ................ (step 2 of 2) Processing rf, total= 0.4s

Boosting(xgboost for gradient boosting)를 사용해 봅시다

AdaBoost의 알고리즘 예시

Step 0. 모든 관측치에 대해 가중치를 동일하게 설정 합니다.

Step 1. 관측치를 복원추출 하여 약한 학습기 Dn을 학습하고 +, - 분류 합니다.

Step 2. 잘못 분류된 관측치에 가중치를 부여해 다음 과정에서 샘플링이 잘되도록 합니다.

Step 3. Step 1~2 과정을 n회 반복(n = 3) 합니다.

Step 4. 분류기들(D1, D2, D3)을 결합하여 최종 예측을 수행합니다.

그래디언트 부스팅

그래디언트 부스팅은 AdaBoost와 유사하지만 비용함수(Loss function)을 최적화하는 방법에 있어서 차이가 있다.

그래디언트 부스팅에서는 샘플의 가중치를 조정하는 대신 잔차(residual)을 학습하도록 한다.

이것은 잔차가 더 큰 데이터를 더 학습하도록 만드는 효과가 있다.

sklearn 외에 부스팅이 구현된 여러가지 라이브러리를 사용할 수 있습니다:

Python libraries for Gradient Boosting

scikit-learn Gradient Tree Boosting — 상대적으로 속도가 느릴 수 있습니다.
- Anaconda: already installed
- Google Colab: already installed

xgboost — 결측값을 수용하며, monotonic constraints를 강제할 수 있습니다.
- Anaconda, Mac/Linux: conda install -c conda-forge xgboost
- Windows: conda install -c anaconda py-xgboost
- Google Colab: already installed

LightGBM — 결측값을 수용하며, monotonic constraints를 강제할 수 있습니다.
- Anaconda: conda install -c conda-forge lightgbm
- Google Colab: already installed

CatBoost — 결측값을 수용하며, categorical features를 전처리 없이 사용할 수 있습니다.
- Anaconda: conda install -c conda-forge catboost
- Google Colab: pip install catboost
Xgboost

xgboost 코드 정리
```
from xgboost import XGBClassifier

pipe = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    XGBClassifier(n_estimators=200
                  , random_state=2
                  , n_jobs=-1
                  , max_depth=7
                  , learning_rate=0.2
                 )
)

pipe.fit(X_train, y_train);

from sklearn.metrics import accuracy_score
y_pred = pipe.predict(X_val)
print('검증 정확도: ', accuracy_score(y_val, y_pred))

print(classification_report(y_pred, y_val))
```
검증 정확도: 0.7601707982445736 precision recall f1-score support 0 0.79 0.77 0.78 4724 1 0.72 0.74 0.73 3707 accuracy 0.76 8431 macro avg 0.76 0.76 0.76 8431 weighted avg 0.76 0.76 0.76 8431

Early Stopping을 사용하여 과적합 방지
왜 n_estimators 최적화를 위해 GridSearchCV나 반복문 대신 early stopping을 사용하는 이유 :
n_iterations 가 반복수라 할때, early stopping을 사용하면 우리는 n_iterations 만큼의 트리를 학습하면 된다
하지만 GridSearchCV나 반복문을 사용하면 무려 sum(range(1,n_rounds+1)) 트리를 학습해야 함...!! 거기에 max_depth, learning_rate 등등 파라미터 값에 따라 더 돌려야함
```
encoder = OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train) # 학습데이터
X_val_encoded = encoder.transform(X_val) # 검증데이터

model = XGBClassifier(
    n_estimators=1000,  # <= 1000 트리로 설정했지만, early stopping 에 따라 조절됩니다.
    max_depth=7,        # default=3, high cardinality 특성을 위해 기본보다 높여 보았습니다.
    learning_rate=0.2,
#     scale_pos_weight=ratio, # imbalance 데이터 일 경우 비율을 적용합니다.
    n_jobs=-1
)

eval_set = [(X_train_encoded, y_train), 
            (X_val_encoded, y_val)]

model.fit(X_train_encoded, y_train, 
          eval_set=eval_set,
          eval_metric='error', # #(wrong cases)/#(all cases)
          early_stopping_rounds=50
         ) # 50 rounds 동안 스코어의 개선이 없으면 멈춤
```
하이퍼파라미터 튜닝
Random Forest
- max_depth (높은값에서 감소시키며 튜닝, 너무 깊어지면 과적합)
- n_estimators (적을경우 과소적합, 높을경우 긴 학습시간)
- min_samples_leaf (과적합일경우 높임)
- max_features (줄일 수록 다양한 트리생성, 높이면 같은 특성을 사용하는 트리가 많아져 다양성이 감소)
- class_weight (imbalanced 클래스인 경우 시도)
XGBoost
- learning_rate (높을경우 과적합 위험이 있습니다)
- max_depth (낮은값에서 증가시키며 튜닝, 너무 깊어지면 과적합위험, -1 설정시 제한 없이 분기, 특성이 많을 수록 깊게 설정)
- n_estimators (너무 크게 주면 긴 학습시간, early_stopping_rounds와 같이 사용)
- scale_pos_weight (imbalanced 문제인 경우 적용시도)
보다 자세한 내용은 다음을 참고하세요
- Notes on Parameter Tuning

Uploaded by Notion2Tistory v1.1.0

728x90

'코드스테이츠 Ai Boostcamp' 카테고리의 다른 글

[Applied Predictive Modeling] Data Wrangling 데이터 전처리 (0)	2021.06.24
[Applied Predictive Modeling] Choose Your ML Problems (0)	2021.06.24
[트리모델]Evaluation Metrics for Classification 평가지표 (0)	2021.06.24
[트리모델]Model Selection 모델선택 방법 (0)	2021.06.24
[Tree Based Model]Random Forests(랜덤 포레스트) (0)	2021.06.16

mindsee Ai

[Applied Predictive Modeling] Feature Importances 특성 중요도

study 우선순위

Feature Importances 질문 정리

Warm up

Lecture Note

모델 해석과 특성 선택을 위한 순열 중요도(Permutation Importances) 계산

1. Feature Importances(Mean decrease impurity, MDI)

2. Drop-Column Importance

3. 순열중요도, (Permutation Importance, Mean Decrease Accuracy,MDA)

순열중요도 계산 코드

eli5 라이브러리를 사용해서 순열 중요도를 계산

중요도를 이용하여 특성을 선택(Feature selection)해보자

Boosting(xgboost for gradient boosting)를 사용해 봅시다

AdaBoost의 알고리즘 예시

그래디언트 부스팅

sklearn 외에 부스팅이 구현된 여러가지 라이브러리를 사용할 수 있습니다:

Xgboost

Early Stopping을 사용하여 과적합 방지

하이퍼파라미터 튜닝

Random Forest

XGBoost

'코드스테이츠 Ai Boostcamp' 카테고리의 다른 글

댓글

티스토리툴바

[Applied Predictive Modeling] Feature Importances 특성 중요도

study 우선순위

Feature Importances 질문 정리

Warm up

Lecture Note

모델 해석과 특성 선택을 위한 순열 중요도(Permutation Importances) 계산

1. Feature Importances(Mean decrease impurity, MDI)

2. Drop-Column Importance

3. 순열중요도, (Permutation Importance, Mean Decrease Accuracy,MDA)

순열중요도 계산 코드

eli5 라이브러리를 사용해서 순열 중요도를 계산

중요도를 이용하여 특성을 선택(Feature selection)해보자

Boosting(xgboost for gradient boosting)를 사용해 봅시다

AdaBoost의 알고리즘 예시

그래디언트 부스팅

sklearn 외에 부스팅이 구현된 여러가지 라이브러리를 사용할 수 있습니다:

Xgboost

Early Stopping을 사용하여 과적합 방지

하이퍼파라미터 튜닝

Random Forest

XGBoost

'코드스테이츠 Ai Boostcamp' 카테고리의 다른 글

관련글

댓글

티스토리툴바