2021-09-22(따릉이 프로젝트 완성하기 8)

저번 포스팅 마지막에는 count의 분포와 비슷한 파생변수를 만든다는 말 끝으로 끝냈었습니다. 예고 했듯이 제 나름대로 target 변수를 설명해줄 파생변수를 2가지 만들어 봤습니다. 그 외로 몇가지 더 실험해 보았는데 이는 아래 주요 요약에 적어 놓았습니다.

*주요 요약

target 분포를 설명해줄 파생변수 2가지
변수들의 왜곡 확인 & target 변수의 이상치 제거
스태킹 모델

1. target 분포를 설명해줄 파생변수 2가지

1.1 시간별 평균 이용 수

train_df = train.copy()
test_df = test.copy()
train_df['cue'] = 0
test_df['cue'] = 1
df = pd.concat([train_df,test_df],axis=0).reset_index(drop=True)

# 전체 데이터 중 train에 해당하는 행 추출
train_data = df.query('cue=="0"').reset_index(drop=True)

df['hour_mean']=1

#각 시간별 인덱스 추출
index00 = df.query('hour=="0"').index
index01 = df.query('hour=="1"').index
index02 = df.query('hour=="2"').index
index03 = df.query('hour=="3"').index
index04 = df.query('hour=="4"').index
index05 = df.query('hour=="5"').index
index06 = df.query('hour=="6"').index
index07 = df.query('hour=="7"').index
index08 = df.query('hour=="8"').index
index09 = df.query('hour=="9"').index
index10 = df.query('hour=="10"').index
index11 = df.query('hour=="11"').index
index12 = df.query('hour=="12"').index
index13 = df.query('hour=="13"').index
index14 = df.query('hour=="14"').index
index15 = df.query('hour=="15"').index
index16 = df.query('hour=="16"').index
index17 = df.query('hour=="17"').index
index18 = df.query('hour=="18"').index
index19 = df.query('hour=="19"').index
index20 = df.query('hour=="20"').index
index21 = df.query('hour=="21"').index
index22 = df.query('hour=="22"').index
index23 = df.query('hour=="23"').index

# 각 시간별 평균값을 "hourmean" 변수에 대입
df.iloc[index00,-1] = train_data.query('hour=="0"')['count'].mean()
df.iloc[index01,-1] = train_data.query('hour=="1"')['count'].mean()
df.iloc[index02,-1] = train_data.query('hour=="2"')['count'].mean()
df.iloc[index03,-1] = train_data.query('hour=="3"')['count'].mean()
df.iloc[index04,-1] = train_data.query('hour=="4"')['count'].mean()
df.iloc[index05,-1] = train_data.query('hour=="5"')['count'].mean()
df.iloc[index06,-1] = train_data.query('hour=="6"')['count'].mean()
df.iloc[index07,-1] = train_data.query('hour=="7"')['count'].mean()
df.iloc[index08,-1] = train_data.query('hour=="8"')['count'].mean()
df.iloc[index09,-1] = train_data.query('hour=="9"')['count'].mean()
df.iloc[index10,-1] = train_data.query('hour=="10"')['count'].mean()
df.iloc[index11,-1] = train_data.query('hour=="11"')['count'].mean()
df.iloc[index12,-1] = train_data.query('hour=="12"')['count'].mean()
df.iloc[index13,-1] = train_data.query('hour=="13"')['count'].mean()
df.iloc[index14,-1] = train_data.query('hour=="14"')['count'].mean()
df.iloc[index15,-1] = train_data.query('hour=="15"')['count'].mean()
df.iloc[index16,-1] = train_data.query('hour=="16"')['count'].mean()
df.iloc[index17,-1] = train_data.query('hour=="17"')['count'].mean()
df.iloc[index18,-1] = train_data.query('hour=="18"')['count'].mean()
df.iloc[index19,-1] = train_data.query('hour=="19"')['count'].mean()
df.iloc[index20,-1] = train_data.query('hour=="20"')['count'].mean()
df.iloc[index21,-1] = train_data.query('hour=="21"')['count'].mean()
df.iloc[index22,-1] = train_data.query('hour=="22"')['count'].mean()
df.iloc[index23,-1] = train_data.query('hour=="23"')['count'].mean()

시간 별 count 평균값을 활용하여 hour_mean 변수를 생성했습니다. 해당 변수를 수행한 결과, hour 변수와 연관성이 컸습니다. feature 중요도에서 hour값을 빼면 hour_mean 값이 크게 증가 한걸 확인했습니다. 다만 hour_mean 중요도가 hour 중요도보다 약 9% 더 크게 나왔습니다. (이미지가 아까 있었는데 다른 실험들 하느라 없어졌네요..;)

df.tail()

	id	hour	hour_bef_temperature	hour_bef_precipitation	hour_bef_windspeed	hour_bef_humidity	hour_bef_visibility	hour_bef_ozone	hour_bef_pm10	hour_bef_pm2.5	count	cue	hour_mean
2169	2148	1	24.6	0.0	2.4	60.0	1745.0	0.023833	46.0	30.25	NaN	1	47.606557
2170	2149	1	18.1	0.0	1.0	55.0	2000.0	0.027000	30.0	20.25	NaN	1	47.606557
2171	2165	9	23.3	0.0	2.3	66.0	1789.0	0.020000	17.0	15.00	NaN	1	93.540984
2172	2166	16	27.0	0.0	1.6	46.0	1956.0	0.032000	40.0	26.00	NaN	1	169.100000
2173	2177	8	22.3	0.0	1.0	63.0	1277.0	0.007000	30.0	24.00	NaN	1	136.688525

1.2 hour_bef_precipitation에 따른 평균 이용 수

df['precipitation_mean'] = 1
index0 = df.query('hour_bef_precipitation=="0.0"').index
index1 = df.query('hour_bef_precipitation=="1.0"').index

df.iloc[index0,-1] =train_data.query('hour_bef_precipitation=="0.0"')['count'].mean()
df.iloc[index1,-1] =train_data.query('hour_bef_precipitation=="1.0"')['count'].mean()

비의 유무에 따른 평균 이용 수 변수를 생성했습니다. 비가 내린 날과 내리지 않는 날을 명확하게 구분하려는 의도로 만들었지만 영향력이 0 였습니다.

2. 변수들의 왜곡 확인 & target 변수의 이상치 제거

2.1 변수들의 왜곡 확인 (보통 1 이상일때 왜곡이 있다고 판정하여 log 변환 실시)

# 변수들의 왜곡 확인
from scipy.stats import skew
feature_df = df.drop(['id'	,'hour','count','cue','hour_bef_precipitation'], axis=1)
feature_index = feature_df.dtypes[feature_df.dtypes != 'object'].index 
feature_index = feature_df.dtypes[feature_df.dtypes != 'object'].index 
skew_features = feature_df[feature_index].apply(lambda x: skew(x))

#shew(왜곡) 저도가 1 이상인 칼럼만 추출
skew_features_top = skew_features[skew_features>1]
print(skew_features_top.sort_values(ascending=False))

hour_bef_pm10     2.645937
hour_bef_pm2.5    1.387923
dtype: float64

위에 두 변수에서 왜도가 있음을 나타내고 있습니다. 하지만 해당 변수들은 학습에서 사용하지 않기 때문에 pass 했습니다.

2.2 target 변수의 이상치 제거

plt.figure(figsize=(15,10))
sns.boxplot(x='hour',y='count', data=train)
plt.show()

target 변수 분포에서 가장 특징점은 출퇴근 시간에서 수요가 급증한다는 점이다. 해당 특징을 잘 살릴수록 성능에 큰 도움이 될거라 예상됩니다.

count의 퇴근 시간대 분포를 보면 이상치가 존재한다. 이는 휴일,주말에 측정한 것으로 추측되므로 이를 제거한다
2017년 5월 휴일은 석가탄신일, 어린이날, 19대 선거 3일이다. 시간별 count 이상치 갯수와 비슷하다
휴일 변수는 test의 count값이 없기 때문에 만들지 못한다.

#18시
train[train['hour']==18]['count']<100  # 19,1035,1113
# #19시
train[train['hour']==19]['count']<50  # 110, 306, 713

위 코드 실행 결과, 옆에 주석으로 아웃라이어 인덱스를 표시해 뒀습니다. 100과 50은 상자그림을 통해서 간단하게 설정한 겁니다. 해당 인덱스를 제거하도록 하겠습니다.

del_index = [19,1035,1113,110,306,713]
df.drop(del_index, axis=0, inplace =True)

3. CV 기반의 스태킹

#보류 -> 그리드 서치를 통해 결정하기
#개별 model 생성
knn=KNeighborsRegressor(n_jobs = -1)
rf = RandomForestRegressor(n_jobs = -1, random_state=2021)
dt = DecisionTreeRegressor(random_state=2021)
xgb = XGBRegressor(verbosity = 0, random_state=2021)
ada = AdaBoostRegressor(random_state=2021)
ridge=lm.Ridge()
lasso=lm.Lasso()
final = lm.Ridge()
lgb_reg=lgb.LGBMRegressor()

def print_best_params(model, params):
  best_model, best_score = None, float('inf')
  grid_model = GridSearchCV(model,param_grid=params,
                            scoring='neg_mean_squared_error',cv=5)
  grid_model.fit(X_train,Y_train)
  # predictions = grid_model.predict(X_test)
  # score = evaluate(Y_test, predictions)['mse'][0]
  print("model name is {0},Grid best score:{1}, Grid best_params_:{2} ".format(model.__class__.__name__,grid_model.best_score_,grid_model.best_params_))

#param 설정
knn_params ={"n_neighbors": range(2,7)}
rf_params={"max_depth": range(2, 5),"min_samples_split": range(2, 5),"min_samples_leaf": range(2, 5), "n_estimators": [100,200,300]}
dt_params={"max_depth": range(2, 5),"min_samples_split": range(2, 5),"min_samples_leaf": range(2, 5)}
xgb_params={"gamma": uniform(0, 0.5).rvs(3),"max_depth": range(2, 7), "n_estimators": [100,200,300]}
ada_params={"n_estimators": [40,50,60]}
lgb_params={"gamma": uniform(0, 0.5).rvs(3),"max_depth": range(2, 7), "n_estimators": [100,200,300,400]}
Ridge_params={'alpha': [0.01, 0.1, 1.0, 10, 100],'fit_intercept': [True, False],'normalize': [True, False]}
lasso_params={'alpha': [0.1, 1.0, 10],'fit_intercept': [True, False],'normalize': [True, False]}

print_best_params(knn,knn_params)
print_best_params(rf,rf_params)
print_best_params(dt,dt_params)
print_best_params(ada,ada_params)
print_best_params(xgb,xgb_params)
print_best_params(lgb_reg,lgb_params)
print_best_params(ridge,Ridge_params)
print_best_params(lasso,lasso_params)

model name is KNeighborsRegressor,Grid best score:-3458.5501989157733, Grid best_params_:{'n_neighbors': 4} 
model name is RandomForestRegressor,Grid best score:-1715.3496565899343, Grid best_params_:{'max_depth': 4, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 100} 
model name is DecisionTreeRegressor,Grid best score:-2033.204198431899, Grid best_params_:{'max_depth': 4, 'min_samples_leaf': 3, 'min_samples_split': 2} 
model name is AdaBoostRegressor,Grid best score:-2149.0152774244843, Grid best_params_:{'n_estimators': 40} 
model name is XGBRegressor,Grid best score:-1477.5776373356193, Grid best_params_:{'gamma': 0.12889993895241564, 'max_depth': 4, 'n_estimators': 100} 
model name is LGBMRegressor,Grid best score:-1524.0959447354717, Grid best_params_:{'gamma': 0.273931886786571, 'max_depth': 6, 'n_estimators': 100} 
model name is Ridge,Grid best score:-1751.7217163613059, Grid best_params_:{'alpha': 0.1, 'fit_intercept': True, 'normalize': False} 
model name is Lasso,Grid best score:-1754.1340908572297, Grid best_params_:{'alpha': 0.1, 'fit_intercept': True, 'normalize': False}

knn=KNeighborsRegressor(n_jobs = -1)
rf = RandomForestRegressor(max_depth= 4, min_samples_leaf= 2, min_samples_split= 2,n_jobs = -1, random_state=2021)
dt = DecisionTreeRegressor(max_depth= 4, min_samples_leaf= 4, min_samples_split= 2,random_state=2021)
xgb = XGBRegressor(gamma= 0.25883224322873616, max_depth= 4, n_estimators= 100, verbosity = 0, random_state=2021)
ada = AdaBoostRegressor(n_estimators= 40,random_state=2021)
ridge=lm.Ridge(alpha= 0.01, fit_intercept= True, normalize= True)
lasso=lm.Lasso(alpha= 1.0, fit_intercept= True, normalize= False)

#최종 메타 모델
lgb_reg=lgb.LGBMRegressor(gamma= 0.08826298344672961, max_depth= 4)

def get_stacking_base_datasets(model, x_train, y_train, test, n_folds):
 
  # 지정된 n_folds 값으로 kFold 생성
  kf = KFold(n_splits=n_folds, shuffle=False , random_state=2021)
  #추후에 메타 모델이 사용할 학습 데잍 반환을 위한 넘파이 배열 초기화
  train_fold_pred = np.zeros((x_train.shape[0],1)) #(1459,1)
  test_fold_pred = np.zeros((test.shape[0],n_folds)) #(715,5)
  print(model.__class__.__name__,"model 시작")

  for folder_counter,(train_index,valid_index) in enumerate(kf.split(x_train)):
    # print('train_index:',train_index,'valid_index:',valid_index)
    # print('valid 갯수:',len(valid_index))
    # print('\t 폴드 세트:', folder_counter,"시작")

    #입력된 학습 데이터에서 기반 모델이 학습/예측할 폴드데이터 세트 추출
    x_tr = x_train[train_index]
    y_tr = y_train[train_index]
    x_te = x_train[valid_index]
    
    #폴드 세트 내부에서 다시 만들어진 학습데이터로 기반 모델의 학습 수행
    model.fit(x_tr,y_tr)
    
    #폴드 세트 내부에서 다시 만들어진 검증 데이터로 기반 모델 예측 후 데이터 저장
    train_fold_pred[valid_index,:]=model.predict(x_te).reshape(-1,1)
    
    #입력된 원본 테스트 데이터를 폴드 세트내 학습된 기반 모델에서 예측 후 데이터 저장
    test_fold_pred[:,folder_counter]=model.predict(test)

  #폴드 세트 내에서 원본테스트 데이터르 예측한데이터를 평균하여 테스트 데이터로 생성
  test_pred_mean = np.mean(test_fold_pred, axis =1).reshape(-1,1)

  #train_fold_pred는 최종 메타 모델이 사용하는 학습 데이터, test_pred_mean은 테스트 데이터
  return train_fold_pred, test_pred_mean

x_train_n=x_train.values
y_train_n=y_train.values
test_n=test.values
n_fold=5

knn_train,knn_test = get_stacking_base_datasets(knn,x_train_n,y_train_n,test_n,n_fold)
rf_train,rf_test = get_stacking_base_datasets(rf,x_train_n,y_train_n,test_n,n_fold)
xgb_train,xgb_test = get_stacking_base_datasets(xgb,x_train_n,y_train_n,test_n,n_fold)
# df_train,df_test = get_stacking_base_datasets(df,x_train_n,y_train_n,test_n,n_fold)
ada_train,ada_test = get_stacking_base_datasets(ada,x_train_n,y_train_n,test_n,n_fold)

KNeighborsRegressor model 시작
RandomForestRegressor model 시작
XGBRegressor model 시작
AdaBoostRegressor model 시작

stack_final_x_train=np.concatenate((knn_train,rf_train,xgb_train,ada_train),axis=1)
stack_final_x_test=np.concatenate((knn_test,rf_test,xgb_test,ada_test), axis=1)

lgb_reg.fit(stack_final_x_train,y_train) #원본 학습 label과 fit
stack_final=lgb_reg.predict(stack_final_x_test)
# evaluate(Y_test, stack_final)

스태킹관련 코드 내용들은 "머신러닝 완벽가이드"를 참고했습니다. 각 개별 모델의 최적 param을 찾은 뒤, 각 개별에서 실시한 train, valid 값을 np.concatenate 하여 최종 모델에서 다시 한번 학습과 예측을 실시합니다. 자세한 내용은 책을 통해 공부해 보시길 강추 드립니다. 해당 스태킹 결과는 내일이나 주중에 제출해 볼 계획입니다(제출 횟수 초과함)

'Data Diary' 카테고리의 다른 글

2021-09-28(따릉이 프로젝트 완성하기 9) (0)	2021.09.28
2021-09-25(Learn github) (0)	2021.09.25
2021-09-19(딥러닝 수학16) (0)	2021.09.21
2021-09-18(따릉이 프로젝트 완성하기 7) (0)	2021.09.21
2021-09-15(따릉이 프로젝트 완성하기 6) (0)	2021.09.15

H_record

2021-09-22(따릉이 프로젝트 완성하기 8)

*주요 요약

1. target 분포를 설명해줄 파생변수 2가지

1.1 시간별 평균 이용 수

1.2 hour_bef_precipitation에 따른 평균 이용 수

2. 변수들의 왜곡 확인 & target 변수의 이상치 제거

2.1 변수들의 왜곡 확인 (보통 1 이상일때 왜곡이 있다고 판정하여 log 변환 실시)

2.2 target 변수의 이상치 제거

3. CV 기반의 스태킹

'Data Diary' 카테고리의 다른 글

티스토리툴바

2021-09-22(따릉이 프로젝트 완성하기 8)

*주요 요약

1. target 분포를 설명해줄 파생변수 2가지

1.1 시간별 평균 이용 수

1.2 hour_bef_precipitation에 따른 평균 이용 수

2. 변수들의 왜곡 확인 & target 변수의 이상치 제거

2.1 변수들의 왜곡 확인 (보통 1 이상일때 왜곡이 있다고 판정하여 log 변환 실시)

2.2 target 변수의 이상치 제거

3. CV 기반의 스태킹

'Data Diary' 카테고리의 다른 글

'Data Diary' Related Articles

티스토리툴바