데이콘 Basic 여행상품 신청예측
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
0.준비
import pandas as pd
train = pd.read_csv('/content/drive/MyDrive/data/여행상품신청/train.csv')
test = pd.read_csv('/content/drive/MyDrive/data/여행상품신청/test.csv')
sample_submission = pd.read_csv('/content/drive/MyDrive/data/여행상품신청/sample_submission.csv')
- id : 샘플 아이디
- Age : 나이
- TypeofContact : 고객의 제품 인지 방법 (회사의 홍보 or 스스로 검색)
- CityTier : 주거 중인 도시의 등급. (인구, 시설, 생활 수준 기준) (1등급 > 2등급 > 3등급)
- DurationOfPitch : 영업 사원이 고객에게 제공하는 프레젠테이션 기간
- Occupation : 직업
- Gender : 성별
- NumberOfPersonVisiting : 고객과 함께 여행을 계획 중인 총 인원
- NumberOfFollowups : 영업 사원의 프레젠테이션 후 이루어진 후속 조치 수
- ProductPitched : 영업 사원이 제시한 상품
- PreferredPropertyStar : 선호 호텔 숙박업소 등급
- MaritalStatus : 결혼여부
- NumberOfTrips : 평균 연간 여행 횟수
- Passport : 여권 보유 여부 (0: 없음, 1: 있음)
- PitchSatisfactionScore : 영업 사원의 프레젠테이션 만족도
- OwnCar : 자동차 보유 여부 (0: 없음, 1: 있음)
- NumberOfChildrenVisiting : 함께 여행을 계획 중인 5세 미만의 어린이 수
- Designation : (직업의) 직급
- MonthlyIncome : 월 급여
- ProdTaken : 여행 패키지 신청 여부 (0: 신청 안 함, 1: 신청함)
train
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | ProdTaken | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 28.0 | Company Invited | 1 | 10.0 | Small Business | Male | 3 | 4.0 | Basic | 3.0 | Married | 3.0 | 0 | 1 | 0 | 1.0 | Executive | 20384.0 | 0 |
| 1 | 2 | 34.0 | Self Enquiry | 3 | NaN | Small Business | Female | 2 | 4.0 | Deluxe | 4.0 | Single | 1.0 | 1 | 5 | 1 | 0.0 | Manager | 19599.0 | 1 |
| 2 | 3 | 45.0 | Company Invited | 1 | NaN | Salaried | Male | 2 | 3.0 | Deluxe | 4.0 | Married | 2.0 | 0 | 4 | 1 | 0.0 | Manager | NaN | 0 |
| 3 | 4 | 29.0 | Company Invited | 1 | 7.0 | Small Business | Male | 3 | 5.0 | Basic | 4.0 | Married | 3.0 | 0 | 4 | 0 | 1.0 | Executive | 21274.0 | 1 |
| 4 | 5 | 42.0 | Self Enquiry | 3 | 6.0 | Salaried | Male | 2 | 3.0 | Deluxe | 3.0 | Divorced | 2.0 | 0 | 3 | 1 | 0.0 | Manager | 19907.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1950 | 1951 | 28.0 | Self Enquiry | 1 | 10.0 | Small Business | Male | 3 | 5.0 | Basic | 3.0 | Single | 2.0 | 0 | 1 | 1 | 2.0 | Executive | 20723.0 | 0 |
| 1951 | 1952 | 41.0 | Self Enquiry | 3 | 8.0 | Salaried | Female | 3 | 3.0 | Super Deluxe | 5.0 | Divorced | 1.0 | 0 | 5 | 1 | 1.0 | AVP | 31595.0 | 0 |
| 1952 | 1953 | 38.0 | Company Invited | 3 | 28.0 | Small Business | Female | 3 | 4.0 | Basic | 3.0 | Divorced | 7.0 | 0 | 2 | 1 | 2.0 | Executive | 21651.0 | 0 |
| 1953 | 1954 | 28.0 | Self Enquiry | 3 | 30.0 | Small Business | Female | 3 | 5.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 1 | 1 | 2.0 | Manager | 22218.0 | 0 |
| 1954 | 1955 | 22.0 | Company Invited | 1 | 9.0 | Salaried | Male | 2 | 4.0 | Basic | 3.0 | Divorced | 1.0 | 1 | 3 | 0 | 0.0 | Executive | 17853.0 | 1 |
1955 rows × 20 columns
0.1.데이터 전처리
0.1.1.결측치 처리
train.isna().sum()
id 0
Age 94
TypeofContact 10
CityTier 0
DurationOfPitch 102
Occupation 0
Gender 0
NumberOfPersonVisiting 0
NumberOfFollowups 13
ProductPitched 0
PreferredPropertyStar 10
MaritalStatus 0
NumberOfTrips 57
Passport 0
PitchSatisfactionScore 0
OwnCar 0
NumberOfChildrenVisiting 27
Designation 0
MonthlyIncome 100
ProdTaken 0
dtype: int64
test.isna().sum()
id 0
Age 132
TypeofContact 15
CityTier 0
DurationOfPitch 149
Occupation 0
Gender 0
NumberOfPersonVisiting 0
NumberOfFollowups 32
ProductPitched 0
PreferredPropertyStar 16
MaritalStatus 0
NumberOfTrips 83
Passport 0
PitchSatisfactionScore 0
OwnCar 0
NumberOfChildrenVisiting 39
Designation 0
MonthlyIncome 133
dtype: int64
def handle_na(data):
temp = data.copy()
for col, dtype in temp.dtypes.items():
if dtype == 'object':
# 문자형 칼럼의 경우 'Unknown'을 채워줍니다.
value = 'Unknown'
elif dtype == int or dtype == float:
# 수치형 칼럼의 경우 0을 채워줍니다.
value = 0
temp.loc[:,col] = temp[col].fillna(value)
return temp
train_nona = handle_na(train)
# 결측치 처리가 잘 되었는지 확인해 줍니다.
train_nona.isna().sum()
id 0
Age 0
TypeofContact 0
CityTier 0
DurationOfPitch 0
Occupation 0
Gender 0
NumberOfPersonVisiting 0
NumberOfFollowups 0
ProductPitched 0
PreferredPropertyStar 0
MaritalStatus 0
NumberOfTrips 0
Passport 0
PitchSatisfactionScore 0
OwnCar 0
NumberOfChildrenVisiting 0
Designation 0
MonthlyIncome 0
ProdTaken 0
dtype: int64
0.1.2.문자형 변수 전처리
object_columns = train_nona.columns[train_nona.dtypes == 'object']
print('object 칼럼은 다음과 같습니다 : ', list(object_columns))
# 해당 칼럼만 보아서 봅시다
train_nona[object_columns]
object 칼럼은 다음과 같습니다 : ['TypeofContact', 'Occupation', 'Gender', 'ProductPitched', 'MaritalStatus', 'Designation']
| TypeofContact | Occupation | Gender | ProductPitched | MaritalStatus | Designation | |
|---|---|---|---|---|---|---|
| 0 | Company Invited | Small Business | Male | Basic | Married | Executive |
| 1 | Self Enquiry | Small Business | Female | Deluxe | Single | Manager |
| 2 | Company Invited | Salaried | Male | Deluxe | Married | Manager |
| 3 | Company Invited | Small Business | Male | Basic | Married | Executive |
| 4 | Self Enquiry | Salaried | Male | Deluxe | Divorced | Manager |
| ... | ... | ... | ... | ... | ... | ... |
| 1950 | Self Enquiry | Small Business | Male | Basic | Single | Executive |
| 1951 | Self Enquiry | Salaried | Female | Super Deluxe | Divorced | AVP |
| 1952 | Company Invited | Small Business | Female | Basic | Divorced | Executive |
| 1953 | Self Enquiry | Small Business | Female | Deluxe | Married | Manager |
| 1954 | Company Invited | Salaried | Male | Basic | Divorced | Executive |
1955 rows × 6 columns
train_nona['Gender'].value_counts()
Male 1207
Female 692
Fe Male 56
Name: Gender, dtype: int64
train_nona.loc[train_nona['Gender']=='Fe Male','Gender'] = 'Female'
test.loc[test['Gender']=='Fe Male','Gender'] = 'Female'
train_nona['Gender'].value_counts()
Male 1207
Female 748
Name: Gender, dtype: int64
# LabelEncoder를 준비해줍니다.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
# LabelEcoder는 학습하는 과정을 필요로 합니다.
encoder.fit(train_nona['TypeofContact'])
#학습된 encoder를 사용하여 문자형 변수를 숫자로 변환해줍니다.
encoder.transform(train_nona['TypeofContact'])
array([0, 1, 0, ..., 0, 1, 0])
train_enc = train_nona.copy()
# 모든 문자형 변수에 대해 encoder를 적용합니다.
for o_col in object_columns:
encoder = LabelEncoder()
encoder.fit(train_enc[o_col])
train_enc[o_col] = encoder.transform(train_enc[o_col])
# 결과를 확인합니다.
train_enc
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | ProdTaken | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 28.0 | 0 | 1 | 10.0 | 3 | 1 | 3 | 4.0 | 0 | 3.0 | 1 | 3.0 | 0 | 1 | 0 | 1.0 | 1 | 20384.0 | 0 |
| 1 | 2 | 34.0 | 1 | 3 | 0.0 | 3 | 0 | 2 | 4.0 | 1 | 4.0 | 2 | 1.0 | 1 | 5 | 1 | 0.0 | 2 | 19599.0 | 1 |
| 2 | 3 | 45.0 | 0 | 1 | 0.0 | 2 | 1 | 2 | 3.0 | 1 | 4.0 | 1 | 2.0 | 0 | 4 | 1 | 0.0 | 2 | 0.0 | 0 |
| 3 | 4 | 29.0 | 0 | 1 | 7.0 | 3 | 1 | 3 | 5.0 | 0 | 4.0 | 1 | 3.0 | 0 | 4 | 0 | 1.0 | 1 | 21274.0 | 1 |
| 4 | 5 | 42.0 | 1 | 3 | 6.0 | 2 | 1 | 2 | 3.0 | 1 | 3.0 | 0 | 2.0 | 0 | 3 | 1 | 0.0 | 2 | 19907.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1950 | 1951 | 28.0 | 1 | 1 | 10.0 | 3 | 1 | 3 | 5.0 | 0 | 3.0 | 2 | 2.0 | 0 | 1 | 1 | 2.0 | 1 | 20723.0 | 0 |
| 1951 | 1952 | 41.0 | 1 | 3 | 8.0 | 2 | 0 | 3 | 3.0 | 4 | 5.0 | 0 | 1.0 | 0 | 5 | 1 | 1.0 | 0 | 31595.0 | 0 |
| 1952 | 1953 | 38.0 | 0 | 3 | 28.0 | 3 | 0 | 3 | 4.0 | 0 | 3.0 | 0 | 7.0 | 0 | 2 | 1 | 2.0 | 1 | 21651.0 | 0 |
| 1953 | 1954 | 28.0 | 1 | 3 | 30.0 | 3 | 0 | 3 | 5.0 | 1 | 3.0 | 1 | 3.0 | 0 | 1 | 1 | 2.0 | 2 | 22218.0 | 0 |
| 1954 | 1955 | 22.0 | 0 | 1 | 9.0 | 2 | 1 | 2 | 4.0 | 0 | 3.0 | 0 | 1.0 | 1 | 3 | 0 | 0.0 | 1 | 17853.0 | 1 |
1955 rows × 20 columns
# 결측치 처리
test = handle_na(test)
# 문자형 변수 전처리
for o_col in object_columns:
encoder = LabelEncoder()
# test 데이터를 이용해 encoder를 학습하는 것은 Data Leakage 입니다! 조심!
encoder.fit(train_nona[o_col])
# test 데이터는 오로지 transform 에서만 사용되어야 합니다.
test[o_col] = encoder.transform(test[o_col])
# 결과를 확인합니다.
test
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.524590 | 0 | 3 | 0.000000 | 3 | 1 | 2 | 5.0 | 1 | 3.0 | 1 | 1.0 | 0 | 2 | 0 | 1.0 | 2 | -0.363121 |
| 1 | 2 | 0.754098 | 1 | 2 | 0.305556 | 3 | 1 | 3 | 0.0 | 1 | 4.0 | 1 | 1.0 | 1 | 5 | 0 | 1.0 | 2 | -0.316471 |
| 2 | 3 | 0.606557 | 1 | 3 | 0.611111 | 3 | 1 | 3 | 4.0 | 1 | 3.0 | 1 | 5.0 | 0 | 5 | 1 | 0.0 | 2 | -0.142953 |
| 3 | 4 | 0.704918 | 1 | 1 | 1.000000 | 3 | 1 | 3 | 6.0 | 1 | 3.0 | 3 | 6.0 | 0 | 3 | 1 | 2.0 | 2 | 0.070608 |
| 4 | 5 | 0.409836 | 1 | 3 | 0.194444 | 1 | 0 | 4 | 4.0 | 0 | 4.0 | 3 | 3.0 | 1 | 4 | 1 | 3.0 | 1 | -0.070797 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2928 | 2929 | 0.885246 | 1 | 1 | 0.166667 | 3 | 0 | 2 | 3.0 | 4 | 3.0 | 2 | 7.0 | 0 | 4 | 1 | 1.0 | 0 | 1.309948 |
| 2929 | 2930 | 0.540984 | 1 | 1 | 0.250000 | 3 | 0 | 4 | 2.0 | 1 | 3.0 | 3 | 2.0 | 0 | 3 | 0 | 1.0 | 2 | 0.174085 |
| 2930 | 2931 | 0.540984 | 0 | 1 | 0.861111 | 2 | 1 | 4 | 4.0 | 1 | 3.0 | 0 | 3.0 | 0 | 4 | 1 | 1.0 | 2 | 0.207652 |
| 2931 | 2932 | 0.426230 | 1 | 1 | 0.250000 | 3 | 1 | 4 | 2.0 | 0 | 5.0 | 3 | 2.0 | 0 | 2 | 1 | 3.0 | 1 | -0.041459 |
| 2932 | 2933 | 0.508197 | 1 | 1 | 0.250000 | 2 | 1 | 3 | 5.0 | 1 | 3.0 | 0 | 3.0 | 0 | 4 | 1 | 1.0 | 2 | 0.054749 |
2933 rows × 19 columns
0.1.3.(추가) 결측치 처리 모듈 생성
- 0으로 채운 수치형 컬럼의 결측치를 회귀예측 모델로 예측하여 채워넣는 모듈 생성
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import ExtraTreesRegressor
def prod_val(feature:str):
if len(train_enc[train_enc[feature]==0])==0:
return 'already processed'
train_temp = train_enc[train_enc[feature]!=0]
features = train_temp.columns[1:-1].drop(feature)
target = feature
X = train_temp[features]
y = train_temp[target]
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.9,shuffle=False)
model_rf = ExtraTreesRegressor(n_estimators=300)
model_rf.fit(X_train,y_train)
train_predict = model_rf.predict(X_test)
print(f'{feature} MAE: {mean_absolute_error(train_predict,y_test)}')
X = train_enc[train_enc[feature]==0][features]
train_enc.loc[train_enc[feature]==0,feature] = model_rf.predict(X)
test_temp = test[test[feature]==0]
X = test_temp[features]
test.loc[test[feature]==0,feature] = model_rf.predict(X)
print(f'\ntrain set: \n{train_enc[feature].value_counts().sort_index().head(3)}')
print(f'\ntest set: \n{test[feature].value_counts().sort_index().head(3)}')
1.EDA
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
1.1.target
plt.figure(dpi=100)
train_enc['ProdTaken'].value_counts().plot(kind='bar')
plt.title('target distribution')
plt.xticks(np.arange(2),labels = ['not application','application'],rotation=45)
plt.show()

1.2.feature & target
1.2.1.Feature correlatoin(feature importances)
y = np.arange(len(train_enc.corr()['ProdTaken'].values))
ind = train_enc.corr()['ProdTaken'].index
values = abs(train_enc.corr()['ProdTaken'].values)
plt.figure(dpi=150)
plt.title('Feature correlation with target')
plt.barh(y,values)
plt.yticks(y,ind)
plt.show()

- 여권의 유무가 target과 가장 상관관계가 높은 것을 알 수 있다.
1.2.2.CityTier ~ target
train_enc.groupby(['CityTier','ProdTaken'])['id'].count()
CityTier ProdTaken
1 0 1065
1 218
2 0 66
1 24
3 0 441
1 141
Name: id, dtype: int64
temp = train_enc.groupby(['CityTier','ProdTaken'])['id'].count().values
temp
array([1065, 218, 66, 24, 441, 141])
no_application = temp[[0,2,4]]
application = temp[[1,3,5]]
alpha = 0.5
no_application
array([1065, 66, 441])
plt.figure(dpi=100)
p1 = plt.bar(np.arange(3), no_application, color='b', alpha=alpha)
p2 = plt.bar(np.arange(3), application, color='r', alpha=alpha,bottom=no_application) # stacked bar chart
plt.title('Target distribution according to CityTier')
plt.xlabel('CityTier')
plt.ylabel('ProdTaken')
plt.xticks(np.arange(3),labels=[1,2,3])
plt.legend((p1[0],p2[0]),('no application','application'))
plt.show()

ratio = application / no_application
plt.figure(dpi=100)
plt.bar(np.arange(3),ratio)
plt.xlabel('CityTier')
plt.title('application ratio')
plt.xticks(np.arange(3),labels=[1,2,3])
plt.show()

- 그래프를 통해 2등급 도시에는 주거 중인 시민이 별로 없다는 것을 알 수 있다.
- 또한 신청자 수는 1등급 도시에서 가장 많았지만 각 도시별 신청자 수 비율을 보면 2등급 도시가 가장 높은 것을 알 수 있다.
1.2.3.Passport ~ target
train_enc.groupby(['Passport','ProdTaken'])['id'].count()
Passport ProdTaken
0 0 1218
1 168
1 0 354
1 215
Name: id, dtype: int64
temp = train_enc.groupby(['Passport','ProdTaken'])['id'].count().values
temp
array([1218, 168, 354, 215])
no_application = temp[[0,2]]
application = temp[[1,3]]
alpha = 0.5
no_application
array([1218, 354])
plt.figure(dpi=100)
p1 = plt.bar(np.arange(2), no_application, color='b', alpha=alpha)
p2 = plt.bar(np.arange(2), application, color='r', alpha=alpha,bottom=no_application) # stacked bar chart
plt.title('Target distribution according to Passport')
plt.xlabel('Passport')
plt.ylabel('ProdTaken')
plt.xticks(np.arange(2),labels=['no passport','passport'])
plt.legend((p1[0],p2[0]),('no application','application'))
plt.show()

fig=plt.figure(figsize=(10,10), dpi=100)
(ax1, ax2)=fig.subplots(1,2).flatten()
temp = train_enc[train_enc['Passport']==0].reset_index(drop = True)
x = temp['ProdTaken'].value_counts()
x.plot.pie(ax=ax1)
_=ax1.set_title('no passport')
ax1.legend(bbox_to_anchor=(0.9, 1), loc=2,labels=['no application','application'])
temp = train_enc[train_enc['Passport']==1].reset_index(drop = True)
x = temp['ProdTaken'].value_counts()
x.plot.pie(ax=ax2)
_=ax2.set_title('have passport')
plt.show()

- 그래프를 통해 여권을 가지고 있는 사람은 가지고 있지 않는 사람보다 더 높은 비율로 신청을 하는 것으로 보인다. (어찌보면 당연한 결과이다.)
1.2.4. ProductPitched ~ target
train_nona['ProductPitched'].unique()
array(['Basic', 'Deluxe', 'King', 'Standard', 'Super Deluxe'],
dtype=object)
train_nona.groupby(['ProductPitched','ProdTaken'])['id'].count()
ProductPitched ProdTaken
Basic 0 522
1 223
Deluxe 0 599
1 90
King 0 80
1 9
Standard 0 251
1 51
Super Deluxe 0 120
1 10
Name: id, dtype: int64
temp = train_nona.groupby(['ProductPitched','ProdTaken'])['id'].count().values
temp
array([522, 223, 599, 90, 80, 9, 251, 51, 120, 10])
ind = train_nona['ProductPitched'].unique()
ind
array(['Basic', 'Deluxe', 'King', 'Standard', 'Super Deluxe'],
dtype=object)
no_application = temp[[0,2,4,6,8]]
application = temp[[1,3,5,7,9]]
alpha = 0.5
no_application
array([522, 599, 80, 251, 120])
plt.figure(figsize=(12,7))
p1 = plt.bar(np.arange(5), no_application, color='b', alpha=alpha)
p2 = plt.bar(np.arange(5), application, color='r', alpha=alpha,bottom=no_application) # stacked bar chart
plt.title('Target distribution according to ProductPitched')
plt.xlabel('ProductPitched')
plt.ylabel('ProdTaken')
plt.xticks(np.arange(5),labels=ind)
plt.legend((p1[0],p2[0]),('no application','application'))
plt.show()

ratio = application / no_application
plt.figure(figsize=(12,7))
plt.bar(np.arange(5),ratio)
plt.xlabel('ProductPitched')
plt.title('application ratio')
plt.xticks(np.arange(5),labels=ind)
plt.show()

- 그래프를 통해 영업사원이 Basic 상품을 추천했을때 가장 신청률이 높은 것을 알 수 있다.
1.2.5.MaritalStatus ~ target
train_nona['MaritalStatus'].value_counts()
Married 949
Divorced 375
Single 349
Unmarried 282
Name: MaritalStatus, dtype: int64
train_enc['MaritalStatus'].value_counts()
1 949
0 375
2 349
3 282
Name: MaritalStatus, dtype: int64
fig=plt.figure(figsize=(10,10), dpi=100)
(ax1, ax2,ax3,ax4)=fig.subplots(1,4).flatten()
temp = train_enc[train_enc['MaritalStatus']==0].reset_index(drop = True)
x = temp['ProdTaken'].value_counts()
x.plot.pie(ax=ax1)
_=ax1.set_title('Divorced')
temp = train_enc[train_enc['MaritalStatus']==1].reset_index(drop = True)
x = temp['ProdTaken'].value_counts()
x.plot.pie(ax=ax2)
_=ax2.set_title('Married')
temp = train_enc[train_enc['MaritalStatus']==2].reset_index(drop = True)
x = temp['ProdTaken'].value_counts()
x.plot.pie(ax=ax3)
_=ax3.set_title('Single')
temp = train_enc[train_enc['MaritalStatus']==3].reset_index(drop = True)
x = temp['ProdTaken'].value_counts()
x.plot.pie(ax=ax4)
_=ax4.set_title('Unmarried')
ax4.legend(bbox_to_anchor=(0.9, 1), loc=2,labels=['no application','application'])
plt.show()

- Divorced 는 이혼한 사람을 뜻하는데 이혼한 사람과 결혼한 사람이 비슷한 신청률을 보였고, 미혼인 사람과 싱글인 사람이 25퍼센트가 넘는 신청률을 보였다.
1.2.6.CityTier ~ MonthlyIncome
plt.rcParams['font.size'] = 15
data = train_enc.describe().loc['min':'max', 'MonthlyIncome']
plt.title('MonthlyIncome')
plt.plot(data, color = 'red', marker = 'o')
plt.grid(True)

plt.figure(figsize=(10,5))
sns.boxplot(data=train_nona, x="CityTier", y="MonthlyIncome")
plt.show()

plt.figure(figsize=(10,5))
sns.boxplot(data=train_nona[train_nona['MonthlyIncome']<60000], x="CityTier", y="MonthlyIncome")
plt.show()

1.2.7.Designation ~ MonthlyIncome
plt.figure(figsize=(10,5))
sns.boxplot(data=train_nona, x="Designation", y="MonthlyIncome")
plt.show()

plt.figure(figsize=(10,5))
sns.boxplot(data=train_nona[train_nona['MonthlyIncome']<60000], x="Designation", y="MonthlyIncome")
plt.show()

- MonthlyIncome의 이상치는 특정한 도시 등급과 직급에서 발견됨.
train_enc[train_enc['MonthlyIncome']>40000]['ProdTaken'].value_counts()
0 2
Name: ProdTaken, dtype: int64
1.2.8.NumberOfTrips ~ MonthlyIncome
plt.figure(figsize=(10,5))
sns.boxplot(data=train_nona, x="NumberOfTrips", y="MonthlyIncome")
plt.show()

- 평균 연간 여행 횟수가 0인 사람들의 월 급여의 평균이 가장 높았다. 꼭 월 급여가 높다고 여행을 많이 다니는 것은 아닌 것을 알 수 있다.
1.2.9.DurationOfPitch ~ PitchSatisfactionScore
plt.figure(figsize=(10,5))
sns.boxplot(data=train_enc, x="PitchSatisfactionScore", y="DurationOfPitch")
plt.show()

- 두 변수간 큰 상관관계가 없는 것으로 보임
1.2.10.heatmap
plt.figure(figsize=(15,15))
sns.heatmap(train_enc.corr(method='pearson'),annot=True, fmt='.1f', linewidths=.5, cmap='Blues')
<matplotlib.axes._subplots.AxesSubplot at 0x7fac6aeff410>

1.3.heatmap 정보를 기반으로 EDA
1.3.1. MonthlyIncome ~ Age
plt.figure(dpi=130)
sns.scatterplot(train_enc['Age'],train_enc['MonthlyIncome'])
plt.xlabel('Age')
plt.ylabel('MonthlyIncome')
plt.show()
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning

- 나이가 0인 데이터가 확인되고 또 그 중 월 급여가 0이 아닌 데이터가 보인다. 데이터의 오류가 아닐까 싶어 Age데이터를 확인해 봐야겠다.
train_enc[train_enc['Age']==0]
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | ProdTaken | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | 14 | 0.0 | 1 | 3 | 6.0 | 3 | 1 | 2 | 1.0 | 1 | 5.0 | 1 | 2.0 | 0 | 4 | 0 | 0.0 | 2 | 0.0 | 0 |
| 26 | 27 | 0.0 | 1 | 1 | 6.0 | 3 | 0 | 3 | 3.0 | 0 | 5.0 | 1 | 2.0 | 0 | 1 | 1 | 0.0 | 1 | 18591.0 | 0 |
| 35 | 36 | 0.0 | 1 | 2 | 14.0 | 2 | 1 | 2 | 3.0 | 1 | 4.0 | 1 | 3.0 | 0 | 3 | 1 | 1.0 | 2 | 0.0 | 0 |
| 87 | 88 | 0.0 | 1 | 2 | 8.0 | 2 | 1 | 3 | 3.0 | 0 | 3.0 | 2 | 1.0 | 0 | 1 | 0 | 0.0 | 1 | 18539.0 | 0 |
| 121 | 122 | 0.0 | 1 | 1 | 35.0 | 3 | 1 | 3 | 3.0 | 0 | 5.0 | 1 | 2.0 | 0 | 4 | 1 | 1.0 | 1 | 0.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1882 | 1883 | 0.0 | 1 | 1 | 15.0 | 2 | 1 | 1 | 4.0 | 0 | 3.0 | 2 | 1.0 | 0 | 2 | 1 | 0.0 | 1 | 0.0 | 0 |
| 1888 | 1889 | 0.0 | 1 | 1 | 12.0 | 3 | 0 | 3 | 4.0 | 1 | 3.0 | 2 | 2.0 | 1 | 4 | 1 | 1.0 | 2 | 0.0 | 0 |
| 1914 | 1915 | 0.0 | 1 | 1 | 7.0 | 2 | 0 | 3 | 3.0 | 0 | 3.0 | 1 | 2.0 | 0 | 1 | 1 | 2.0 | 1 | 0.0 | 0 |
| 1916 | 1917 | 0.0 | 1 | 2 | 26.0 | 3 | 0 | 3 | 3.0 | 0 | 4.0 | 1 | 1.0 | 1 | 3 | 0 | 1.0 | 1 | 18669.0 | 1 |
| 1923 | 1924 | 0.0 | 0 | 3 | 16.0 | 2 | 1 | 2 | 3.0 | 1 | 3.0 | 1 | 2.0 | 1 | 1 | 1 | 1.0 | 2 | 0.0 | 0 |
94 rows × 20 columns
test[test['Age']==0]
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20 | 21 | 0.0 | 1 | 1 | 8.0 | 3 | 1 | 2 | 5.0 | 0 | 3.0 | 1 | 6.0 | 1 | 3 | 1 | 1.0 | 1 | 18464.0 |
| 25 | 26 | 0.0 | 1 | 1 | 12.0 | 2 | 0 | 2 | 4.0 | 0 | 3.0 | 1 | 2.0 | 0 | 3 | 0 | 1.0 | 1 | 18702.0 |
| 54 | 55 | 0.0 | 1 | 1 | 6.0 | 3 | 1 | 2 | 3.0 | 0 | 4.0 | 1 | 2.0 | 0 | 3 | 0 | 0.0 | 1 | 0.0 |
| 67 | 68 | 0.0 | 1 | 1 | 6.0 | 3 | 1 | 3 | 3.0 | 0 | 4.0 | 1 | 2.0 | 0 | 5 | 0 | 1.0 | 1 | 0.0 |
| 95 | 96 | 0.0 | 0 | 1 | 15.0 | 2 | 1 | 2 | 3.0 | 1 | 3.0 | 1 | 4.0 | 0 | 3 | 1 | 0.0 | 2 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2799 | 2800 | 0.0 | 1 | 1 | 8.0 | 3 | 1 | 2 | 5.0 | 0 | 3.0 | 0 | 6.0 | 1 | 3 | 1 | 0.0 | 1 | 18464.0 |
| 2811 | 2812 | 0.0 | 1 | 1 | 13.0 | 2 | 1 | 3 | 1.0 | 0 | 5.0 | 2 | 1.0 | 0 | 1 | 1 | 0.0 | 1 | 18578.0 |
| 2834 | 2835 | 0.0 | 1 | 3 | 14.0 | 3 | 1 | 2 | 3.0 | 0 | 4.0 | 1 | 1.0 | 0 | 1 | 0 | 1.0 | 1 | 0.0 |
| 2841 | 2842 | 0.0 | 0 | 1 | 22.0 | 2 | 0 | 3 | 5.0 | 0 | 5.0 | 2 | 2.0 | 1 | 4 | 0 | 0.0 | 1 | 0.0 |
| 2916 | 2917 | 0.0 | 1 | 1 | 11.0 | 3 | 1 | 2 | 4.0 | 1 | 3.0 | 1 | 2.0 | 0 | 4 | 1 | 1.0 | 2 | 0.0 |
132 rows × 19 columns
plt.figure(figsize=(10,5),dpi=100)
sns.scatterplot(train_enc['Age'],train_enc['MonthlyIncome'],hue=train_enc['ProductPitched'],s=60)
plt.xlabel('Age')
plt.ylabel('MonthlyIncome')
plt.legend(title='ProductPitched',loc='upper right')
plt.show()
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning

1.3.1.1.Age가 0인 데이터 age 예측
- age가 0일 수는 없다고 생각해서 다른 feature들로 age를 예측해봐야겠다고 생각했다.
temp = train_enc[train_enc['Age']!=0]
temp
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | ProdTaken | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 28.0 | 0 | 1 | 10.0 | 3 | 1 | 3 | 4.0 | 0 | 3.0 | 1 | 3.0 | 0 | 1 | 0 | 1.0 | 1 | 20384.0 | 0 |
| 1 | 2 | 34.0 | 1 | 3 | 0.0 | 3 | 0 | 2 | 4.0 | 1 | 4.0 | 2 | 1.0 | 1 | 5 | 1 | 0.0 | 2 | 19599.0 | 1 |
| 2 | 3 | 45.0 | 0 | 1 | 0.0 | 2 | 1 | 2 | 3.0 | 1 | 4.0 | 1 | 2.0 | 0 | 4 | 1 | 0.0 | 2 | 0.0 | 0 |
| 3 | 4 | 29.0 | 0 | 1 | 7.0 | 3 | 1 | 3 | 5.0 | 0 | 4.0 | 1 | 3.0 | 0 | 4 | 0 | 1.0 | 1 | 21274.0 | 1 |
| 4 | 5 | 42.0 | 1 | 3 | 6.0 | 2 | 1 | 2 | 3.0 | 1 | 3.0 | 0 | 2.0 | 0 | 3 | 1 | 0.0 | 2 | 19907.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1950 | 1951 | 28.0 | 1 | 1 | 10.0 | 3 | 1 | 3 | 5.0 | 0 | 3.0 | 2 | 2.0 | 0 | 1 | 1 | 2.0 | 1 | 20723.0 | 0 |
| 1951 | 1952 | 41.0 | 1 | 3 | 8.0 | 2 | 0 | 3 | 3.0 | 4 | 5.0 | 0 | 1.0 | 0 | 5 | 1 | 1.0 | 0 | 31595.0 | 0 |
| 1952 | 1953 | 38.0 | 0 | 3 | 28.0 | 3 | 0 | 3 | 4.0 | 0 | 3.0 | 0 | 7.0 | 0 | 2 | 1 | 2.0 | 1 | 21651.0 | 0 |
| 1953 | 1954 | 28.0 | 1 | 3 | 30.0 | 3 | 0 | 3 | 5.0 | 1 | 3.0 | 1 | 3.0 | 0 | 1 | 1 | 2.0 | 2 | 22218.0 | 0 |
| 1954 | 1955 | 22.0 | 0 | 1 | 9.0 | 2 | 1 | 2 | 4.0 | 0 | 3.0 | 0 | 1.0 | 1 | 3 | 0 | 0.0 | 1 | 17853.0 | 1 |
1861 rows × 20 columns
features = temp.columns[2:-1]
target = 'Age'
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
X = temp[features]
y = temp[target]
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.9,shuffle=False)
model_rf = RandomForestRegressor()
model_rf.fit(X_train,y_train)
predict = model_rf.predict(X_test)
y_test
1755 54.0
1756 59.0
1757 21.0
1758 43.0
1759 48.0
...
1950 28.0
1951 41.0
1952 38.0
1953 28.0
1954 22.0
Name: Age, Length: 187, dtype: float64
plt.figure(dpi=130)
plt.plot(predict)
plt.plot(y_test.values,alpha=0.7)
plt.show()

mean_absolute_error(predict,y_test)
5.528181818181818
temp = train_enc[train_enc['Age']==0]
temp
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | ProdTaken | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | 14 | 0.0 | 1 | 3 | 6.0 | 3 | 1 | 2 | 1.0 | 1 | 5.0 | 1 | 2.0 | 0 | 4 | 0 | 0.0 | 2 | 0.0 | 0 |
| 26 | 27 | 0.0 | 1 | 1 | 6.0 | 3 | 0 | 3 | 3.0 | 0 | 5.0 | 1 | 2.0 | 0 | 1 | 1 | 0.0 | 1 | 18591.0 | 0 |
| 35 | 36 | 0.0 | 1 | 2 | 14.0 | 2 | 1 | 2 | 3.0 | 1 | 4.0 | 1 | 3.0 | 0 | 3 | 1 | 1.0 | 2 | 0.0 | 0 |
| 87 | 88 | 0.0 | 1 | 2 | 8.0 | 2 | 1 | 3 | 3.0 | 0 | 3.0 | 2 | 1.0 | 0 | 1 | 0 | 0.0 | 1 | 18539.0 | 0 |
| 121 | 122 | 0.0 | 1 | 1 | 35.0 | 3 | 1 | 3 | 3.0 | 0 | 5.0 | 1 | 2.0 | 0 | 4 | 1 | 1.0 | 1 | 0.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1882 | 1883 | 0.0 | 1 | 1 | 15.0 | 2 | 1 | 1 | 4.0 | 0 | 3.0 | 2 | 1.0 | 0 | 2 | 1 | 0.0 | 1 | 0.0 | 0 |
| 1888 | 1889 | 0.0 | 1 | 1 | 12.0 | 3 | 0 | 3 | 4.0 | 1 | 3.0 | 2 | 2.0 | 1 | 4 | 1 | 1.0 | 2 | 0.0 | 0 |
| 1914 | 1915 | 0.0 | 1 | 1 | 7.0 | 2 | 0 | 3 | 3.0 | 0 | 3.0 | 1 | 2.0 | 0 | 1 | 1 | 2.0 | 1 | 0.0 | 0 |
| 1916 | 1917 | 0.0 | 1 | 2 | 26.0 | 3 | 0 | 3 | 3.0 | 0 | 4.0 | 1 | 1.0 | 1 | 3 | 0 | 1.0 | 1 | 18669.0 | 1 |
| 1923 | 1924 | 0.0 | 0 | 3 | 16.0 | 2 | 1 | 2 | 3.0 | 1 | 3.0 | 1 | 2.0 | 1 | 1 | 1 | 1.0 | 2 | 0.0 | 0 |
94 rows × 20 columns
X = temp[features]
train_enc.loc[train_enc['Age']==0,'Age'] = model_rf.predict(X)
train_enc['Age'].value_counts().sort_index()
18.0 5
19.0 16
20.0 13
21.0 17
22.0 20
..
57.0 9
58.0 11
59.0 14
60.0 12
61.0 3
Name: Age, Length: 135, dtype: int64
plt.figure(figsize=(10,5),dpi=100)
sns.scatterplot(train_enc['Age'],train_enc['MonthlyIncome'],hue=train_enc['ProductPitched'],s=60)
plt.xlabel('Age')
plt.ylabel('MonthlyIncome')
plt.legend(title='ProductPitched',loc='upper right')
plt.show()
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning

- Age 가 0인 데이터가 없어진 것을 확인 할 수있다.
1.3.2.MonthlyIncome ~ NumberOfPersonVisiting
train_enc['NumberOfPersonVisiting'].value_counts()
3 988
2 543
4 412
1 11
5 1
Name: NumberOfPersonVisiting, dtype: int64
temp = train_enc['NumberOfPersonVisiting'].value_counts()
plt.figure(dpi=100)
plt.bar(temp.index,temp.values)
plt.xlabel('NumberOfPersonVisiting')
plt.ylabel('counts')
plt.show()

- 혼자 가거나 5명이서 가는 경우는 거의 없고 2,3,4명이서 가는 경우가 많다
plt.figure(dpi=100)
sns.boxplot(data=train_enc, x="NumberOfPersonVisiting", y="MonthlyIncome")
plt.show()

plt.figure(dpi=100)
sns.violinplot(data=train_enc, x="NumberOfPersonVisiting", y="MonthlyIncome")
plt.show()

- 2,3,4 에는 차이가 크지 않지만 확실히 혼자가는 사람은 월 급여가 낮은 것을 알 수 있다.
1.3.3.MonthlyIncome ~ DurationOfPitch
train_enc['DurationOfPitch'].value_counts()
9.0 199
7.0 126
8.0 122
6.0 116
16.0 114
14.0 112
15.0 105
10.0 103
0.0 102
12.0 85
11.0 83
13.0 83
17.0 75
23.0 41
30.0 39
22.0 36
31.0 34
25.0 32
27.0 31
32.0 30
20.0 29
35.0 29
26.0 27
29.0 27
24.0 27
28.0 25
21.0 24
18.0 23
33.0 22
19.0 18
34.0 18
36.0 15
5.0 3
Name: DurationOfPitch, dtype: int64
plt.figure(dpi=130)
sns.scatterplot(train_enc['DurationOfPitch'],train_enc['MonthlyIncome'])
plt.xlabel('DurationOfPitch')
plt.ylabel('MonthlyIncome')
plt.show()
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning

- 상관관계를 찾기 힘들어보임.
2.Feature engineering
plt.figure(figsize=(15,15))
sns.heatmap(train_enc.corr(method='pearson'),annot=True, fmt='.1f', linewidths=.5, cmap='Blues')
<matplotlib.axes._subplots.AxesSubplot at 0x7fac6aa35410>

- 새로운 파생변수를 만들고 싶지만 데이터의 타입이 대부분 카테고리형태여서 힘들 것 같다.
2.1.drop
- target과 상관관계가 낮은 feature는 설정X
train_enc.corr()['ProdTaken']
id -0.048933
Age -0.136257
TypeofContact -0.047598
CityTier 0.085583
DurationOfPitch 0.069795
Occupation -0.042101
Gender 0.019991
NumberOfPersonVisiting 0.006483
NumberOfFollowups 0.102778
ProductPitched -0.150399
PreferredPropertyStar 0.108886
MaritalStatus 0.169245
NumberOfTrips 0.060995
Passport 0.293726
PitchSatisfactionScore 0.067736
OwnCar -0.040465
NumberOfChildrenVisiting 0.010089
Designation -0.096041
MonthlyIncome -0.077508
ProdTaken 1.000000
Name: ProdTaken, dtype: float64
train_enc.corr()['ProdTaken'].index
Index(['id', 'Age', 'TypeofContact', 'CityTier', 'DurationOfPitch',
'Occupation', 'Gender', 'NumberOfPersonVisiting', 'NumberOfFollowups',
'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus',
'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar',
'NumberOfChildrenVisiting', 'Designation', 'MonthlyIncome',
'ProdTaken'],
dtype='object')
features = ['Age','CityTier', 'DurationOfPitch',
'Occupation', 'NumberOfFollowups',
'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus',
'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar',
'Designation', 'MonthlyIncome',
'ProdTaken']
features
['Age',
'CityTier',
'DurationOfPitch',
'Occupation',
'NumberOfFollowups',
'ProductPitched',
'PreferredPropertyStar',
'MaritalStatus',
'NumberOfTrips',
'Passport',
'PitchSatisfactionScore',
'OwnCar',
'Designation',
'MonthlyIncome',
'ProdTaken']
2.2.scaling
train_enc.head()
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | ProdTaken | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 28.0 | 0 | 1 | 10.0 | 3 | 1 | 3 | 4.0 | 0 | 3.0 | 1 | 3.0 | 0 | 1 | 0 | 1.0 | 1 | 20384.0 | 0 |
| 1 | 2 | 34.0 | 1 | 3 | 0.0 | 3 | 0 | 2 | 4.0 | 1 | 4.0 | 2 | 1.0 | 1 | 5 | 1 | 0.0 | 2 | 19599.0 | 1 |
| 2 | 3 | 45.0 | 0 | 1 | 0.0 | 2 | 1 | 2 | 3.0 | 1 | 4.0 | 1 | 2.0 | 0 | 4 | 1 | 0.0 | 2 | 0.0 | 0 |
| 3 | 4 | 29.0 | 0 | 1 | 7.0 | 3 | 1 | 3 | 5.0 | 0 | 4.0 | 1 | 3.0 | 0 | 4 | 0 | 1.0 | 1 | 21274.0 | 1 |
| 4 | 5 | 42.0 | 1 | 3 | 6.0 | 2 | 1 | 2 | 3.0 | 1 | 3.0 | 0 | 2.0 | 0 | 3 | 1 | 0.0 | 2 | 19907.0 | 0 |
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_enc[['MonthlyIncome']] = scaler.fit_transform(train_enc[['MonthlyIncome']])
train_enc.head()
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | ProdTaken | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 28.0 | 0 | 1 | 10.0 | 3 | 1 | 3 | 4.0 | 0 | 3.0 | 1 | 3.0 | 0 | 1 | 0 | 1.0 | 1 | -0.268499 | 0 |
| 1 | 2 | 34.0 | 1 | 3 | 0.0 | 3 | 0 | 2 | 4.0 | 1 | 4.0 | 2 | 1.0 | 1 | 5 | 1 | 0.0 | 2 | -0.372240 | 1 |
| 2 | 3 | 45.0 | 0 | 1 | 0.0 | 2 | 1 | 2 | 3.0 | 1 | 4.0 | 1 | 2.0 | 0 | 4 | 1 | 0.0 | 2 | -2.962326 | 0 |
| 3 | 4 | 29.0 | 0 | 1 | 7.0 | 3 | 1 | 3 | 5.0 | 0 | 4.0 | 1 | 3.0 | 0 | 4 | 0 | 1.0 | 1 | -0.150882 | 1 |
| 4 | 5 | 42.0 | 1 | 3 | 6.0 | 2 | 1 | 2 | 3.0 | 1 | 3.0 | 0 | 2.0 | 0 | 3 | 1 | 0.0 | 2 | -0.331537 | 0 |
train_enc['MonthlyIncome'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7fac6ad5add0>

3.모델링을 위한 데이터 전처리
0.1.3.(추가) 결측치 처리 모듈 생성
- 0으로 채운 수치형 컬럼의 결측치를 회귀예측 모델로 예측하여 채워넣는 모듈 생성
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import ExtraTreesRegressor
def prod_val(feature:str):
if len(train_enc[train_enc[feature]==0])==0:
return 'already processed'
train_temp = train_enc[train_enc[feature]!=0]
features = train_temp.columns[1:-1].drop(feature)
target = feature
X = train_temp[features]
y = train_temp[target]
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.9,shuffle=False)
model_rf = ExtraTreesRegressor(n_estimators=300)
model_rf.fit(X_train,y_train)
train_predict = model_rf.predict(X_test)
print(f'{feature} MAE: {mean_absolute_error(train_predict,y_test)}')
X = train_enc[train_enc[feature]==0][features]
train_enc.loc[train_enc[feature]==0,feature] = model_rf.predict(X)
test_temp = test[test[feature]==0]
X = test_temp[features]
test.loc[test[feature]==0,feature] = model_rf.predict(X)
print(f'\ntrain set: \n{train_enc[feature].value_counts().sort_index().head(3)}')
print(f'\ntest set: \n{test[feature].value_counts().sort_index().head(3)}')
train = pd.read_csv('/content/drive/MyDrive/data/여행상품신청/train.csv')
test = pd.read_csv('/content/drive/MyDrive/data/여행상품신청/test.csv')
sample_submission = pd.read_csv('/content/drive/MyDrive/data/여행상품신청/sample_submission.csv')
3.1.결측치 처리
def handle_na(data):
temp = data.copy()
for col, dtype in temp.dtypes.items():
if dtype == 'object':
# 문자형 칼럼의 경우 'Unknown'을 채워줍니다.
value = 'Unknown'
elif dtype == int or dtype == float:
# 수치형 칼럼의 경우 0을 채워줍니다.
value = 0
temp.loc[:,col] = temp[col].fillna(value)
return temp
train_nona = handle_na(train)
# 결측치 처리가 잘 되었는지 확인해 줍니다.
train_nona.isna().sum()
id 0
Age 0
TypeofContact 0
CityTier 0
DurationOfPitch 0
Occupation 0
Gender 0
NumberOfPersonVisiting 0
NumberOfFollowups 0
ProductPitched 0
PreferredPropertyStar 0
MaritalStatus 0
NumberOfTrips 0
Passport 0
PitchSatisfactionScore 0
OwnCar 0
NumberOfChildrenVisiting 0
Designation 0
MonthlyIncome 0
ProdTaken 0
dtype: int64
3.2.문자형 변수 전처리
train_nona.loc[train_nona['Gender']=='Fe Male','Gender'] = 'Female'
test.loc[test['Gender']=='Fe Male','Gender'] = 'Female'
train_nona['Gender'].value_counts()
Male 1207
Female 748
Name: Gender, dtype: int64
# LabelEncoder를 준비해줍니다.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
# LabelEcoder는 학습하는 과정을 필요로 합니다.
encoder.fit(train_nona['TypeofContact'])
#학습된 encoder를 사용하여 문자형 변수를 숫자로 변환해줍니다.
encoder.transform(train_nona['TypeofContact'])
array([0, 1, 0, ..., 0, 1, 0])
train_enc = train_nona.copy()
object_columns = train_nona.columns[train_nona.dtypes == 'object']
# 모든 문자형 변수에 대해 encoder를 적용합니다.
for o_col in object_columns:
encoder = LabelEncoder()
encoder.fit(train_enc[o_col])
train_enc[o_col] = encoder.transform(train_enc[o_col])
# 결과를 확인합니다.
train_enc
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | ProdTaken | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 28.0 | 0 | 1 | 10.0 | 3 | 1 | 3 | 4.0 | 0 | 3.0 | 1 | 3.0 | 0 | 1 | 0 | 1.0 | 1 | 20384.0 | 0 |
| 1 | 2 | 34.0 | 1 | 3 | 0.0 | 3 | 0 | 2 | 4.0 | 1 | 4.0 | 2 | 1.0 | 1 | 5 | 1 | 0.0 | 2 | 19599.0 | 1 |
| 2 | 3 | 45.0 | 0 | 1 | 0.0 | 2 | 1 | 2 | 3.0 | 1 | 4.0 | 1 | 2.0 | 0 | 4 | 1 | 0.0 | 2 | 0.0 | 0 |
| 3 | 4 | 29.0 | 0 | 1 | 7.0 | 3 | 1 | 3 | 5.0 | 0 | 4.0 | 1 | 3.0 | 0 | 4 | 0 | 1.0 | 1 | 21274.0 | 1 |
| 4 | 5 | 42.0 | 1 | 3 | 6.0 | 2 | 1 | 2 | 3.0 | 1 | 3.0 | 0 | 2.0 | 0 | 3 | 1 | 0.0 | 2 | 19907.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1950 | 1951 | 28.0 | 1 | 1 | 10.0 | 3 | 1 | 3 | 5.0 | 0 | 3.0 | 2 | 2.0 | 0 | 1 | 1 | 2.0 | 1 | 20723.0 | 0 |
| 1951 | 1952 | 41.0 | 1 | 3 | 8.0 | 2 | 0 | 3 | 3.0 | 4 | 5.0 | 0 | 1.0 | 0 | 5 | 1 | 1.0 | 0 | 31595.0 | 0 |
| 1952 | 1953 | 38.0 | 0 | 3 | 28.0 | 3 | 0 | 3 | 4.0 | 0 | 3.0 | 0 | 7.0 | 0 | 2 | 1 | 2.0 | 1 | 21651.0 | 0 |
| 1953 | 1954 | 28.0 | 1 | 3 | 30.0 | 3 | 0 | 3 | 5.0 | 1 | 3.0 | 1 | 3.0 | 0 | 1 | 1 | 2.0 | 2 | 22218.0 | 0 |
| 1954 | 1955 | 22.0 | 0 | 1 | 9.0 | 2 | 1 | 2 | 4.0 | 0 | 3.0 | 0 | 1.0 | 1 | 3 | 0 | 0.0 | 1 | 17853.0 | 1 |
1955 rows × 20 columns
# 결측치 처리
test = handle_na(test)
# 문자형 변수 전처리
for o_col in object_columns:
encoder = LabelEncoder()
# test 데이터를 이용해 encoder를 학습하는 것은 Data Leakage 입니다! 조심!
encoder.fit(train_nona[o_col])
# test 데이터는 오로지 transform 에서만 사용되어야 합니다.
test[o_col] = encoder.transform(test[o_col])
# 결과를 확인합니다.
test
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 32.0 | 0 | 3 | 0.0 | 3 | 1 | 2 | 5.0 | 1 | 3.0 | 1 | 1.0 | 0 | 2 | 0 | 1.0 | 2 | 19668.0 |
| 1 | 2 | 46.0 | 1 | 2 | 11.0 | 3 | 1 | 3 | 0.0 | 1 | 4.0 | 1 | 1.0 | 1 | 5 | 0 | 1.0 | 2 | 20021.0 |
| 2 | 3 | 37.0 | 1 | 3 | 22.0 | 3 | 1 | 3 | 4.0 | 1 | 3.0 | 1 | 5.0 | 0 | 5 | 1 | 0.0 | 2 | 21334.0 |
| 3 | 4 | 43.0 | 1 | 1 | 36.0 | 3 | 1 | 3 | 6.0 | 1 | 3.0 | 3 | 6.0 | 0 | 3 | 1 | 2.0 | 2 | 22950.0 |
| 4 | 5 | 25.0 | 1 | 3 | 7.0 | 1 | 0 | 4 | 4.0 | 0 | 4.0 | 3 | 3.0 | 1 | 4 | 1 | 3.0 | 1 | 21880.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2928 | 2929 | 54.0 | 1 | 1 | 6.0 | 3 | 0 | 2 | 3.0 | 4 | 3.0 | 2 | 7.0 | 0 | 4 | 1 | 1.0 | 0 | 32328.0 |
| 2929 | 2930 | 33.0 | 1 | 1 | 9.0 | 3 | 0 | 4 | 2.0 | 1 | 3.0 | 3 | 2.0 | 0 | 3 | 0 | 1.0 | 2 | 23733.0 |
| 2930 | 2931 | 33.0 | 0 | 1 | 31.0 | 2 | 1 | 4 | 4.0 | 1 | 3.0 | 0 | 3.0 | 0 | 4 | 1 | 1.0 | 2 | 23987.0 |
| 2931 | 2932 | 26.0 | 1 | 1 | 9.0 | 3 | 1 | 4 | 2.0 | 0 | 5.0 | 3 | 2.0 | 0 | 2 | 1 | 3.0 | 1 | 22102.0 |
| 2932 | 2933 | 31.0 | 1 | 1 | 9.0 | 2 | 1 | 3 | 5.0 | 1 | 3.0 | 0 | 3.0 | 0 | 4 | 1 | 1.0 | 2 | 22830.0 |
2933 rows × 19 columns
3.3.결측치 처리 모듈 적용
prod_val('Age')
Age MAE: 5.071390374331551
train set:
18.0 5
19.0 16
20.0 13
Name: Age, dtype: int64
test set:
18.0 9
19.0 16
20.0 25
Name: Age, dtype: int64
train_enc['Age'] = train_enc['Age'].round()
test['Age'] = test['Age'].round()
train_enc['Age'].unique()
array([28., 34., 45., 29., 42., 32., 43., 36., 35., 31., 49., 33., 52.,
22., 50., 23., 41., 37., 40., 56., 54., 39., 20., 38., 46., 27.,
25., 26., 24., 30., 21., 51., 47., 55., 44., 53., 48., 18., 57.,
60., 59., 19., 58., 61.])
prod_val('MonthlyIncome')
MonthlyIncome MAE: 1205.1359677419355
train set:
1000.00 1
14781.04 1
16009.00 1
Name: MonthlyIncome, dtype: int64
test set:
2196.853333 1
4678.000000 1
14921.480000 1
Name: MonthlyIncome, dtype: int64
prod_val('DurationOfPitch')
DurationOfPitch MAE: 6.203512544802866
train set:
5.0 3
6.0 116
7.0 126
Name: DurationOfPitch, dtype: int64
test set:
5.0 3
6.0 191
7.0 216
Name: DurationOfPitch, dtype: int64
train_enc['DurationOfPitch'] = train_enc['DurationOfPitch'].round()
test['DurationOfPitch'] = test['DurationOfPitch'].round()
test['DurationOfPitch'].unique()
array([ 16., 11., 22., 36., 7., 8., 6., 29., 9., 12., 13.,
17., 15., 10., 14., 35., 24., 31., 21., 19., 32., 27.,
18., 33., 30., 26., 34., 23., 20., 28., 25., 126., 5.,
127.])
prod_val('NumberOfTrips')
NumberOfTrips MAE: 1.2843333333333333
train set:
1.000000 234
1.963333 1
2.000000 594
Name: NumberOfTrips, dtype: int64
test set:
1.000000 386
2.000000 870
2.856667 1
Name: NumberOfTrips, dtype: int64
train_enc['NumberOfTrips'] = train_enc['NumberOfTrips'].round()
test['NumberOfTrips'] = test['NumberOfTrips'].round()
test['NumberOfTrips'].unique()
array([ 1., 5., 6., 3., 7., 4., 2., 8., 22., 21., 20.])
prod_val('NumberOfFollowups')
NumberOfFollowups MAE: 0.605094017094017
train set:
1.000000 74
2.000000 89
2.436667 1
Name: NumberOfFollowups, dtype: int64
test set:
1.000000 102
2.000000 140
2.706667 1
Name: NumberOfFollowups, dtype: int64
train_enc['NumberOfFollowups'] = train_enc['NumberOfFollowups'].round()
test['NumberOfFollowups'] = test['NumberOfFollowups'].round()
test['NumberOfFollowups'].unique()
array([5., 3., 4., 6., 1., 2.])
prod_val('PreferredPropertyStar')
PreferredPropertyStar MAE: 0.6105299145299146
train set:
3.000000 1212
3.126667 1
3.353333 1
Name: PreferredPropertyStar, dtype: int64
test set:
3.000000 1781
3.033333 1
3.100000 1
Name: PreferredPropertyStar, dtype: int64
train_enc['PreferredPropertyStar'] = train_enc['PreferredPropertyStar'].round()
test['PreferredPropertyStar'] = test['PreferredPropertyStar'].round()
test['PreferredPropertyStar'].unique()
array([3., 4., 5.])
prod_val('NumberOfChildrenVisiting')
NumberOfChildrenVisiting MAE: 0.5187938596491228
train set:
1.000000 1099
1.003333 6
1.030000 1
Name: NumberOfChildrenVisiting, dtype: int64
test set:
1.000000 1712
1.003333 3
1.043333 1
Name: NumberOfChildrenVisiting, dtype: int64
train_enc['NumberOfChildrenVisiting'] = train_enc['NumberOfChildrenVisiting'].round()
test['NumberOfChildrenVisiting'] = test['NumberOfChildrenVisiting'].round()
test['NumberOfChildrenVisiting'].unique()
array([1., 2., 3.])
3.3.(모듈대체) Age가 0인 데이터 age 예측
- age의 결측치를 0으로 채웠었는데 다른 feature들로 age를 예측해봐야겠다고 생각했다.
# temp = train_enc[train_enc['Age']!=0]
# temp.head()
# features = temp.columns[2:-1]
# target = 'Age'
# from sklearn.ensemble import RandomForestRegressor
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import mean_absolute_error
# X = temp[features]
# y = temp[target]
# X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.9,shuffle=False)
# model_rf = ExtraTreesRegressor(n_estimators=500)
# model_rf.fit(X_train,y_train)
# predict = model_rf.predict(X_test)
# mean_absolute_error(predict,y_test)
# temp = train_enc[train_enc['Age']==0]
# temp.head()
# X = temp[features]
# train_enc.loc[train_enc['Age']==0,'Age'] = model_rf.predict(X)
# train_enc['Age'].value_counts().sort_index()
# temp = test[test['Age']==0]
# temp.head()
# X = temp[features]
# test.loc[test['Age']==0,'Age'] = model_rf.predict(X)
# test['Age'].value_counts().sort_index()
3.4.(모듈대체) Monthly Income 결측치 예측
- Age와 같은 방법으로 Monthly Income을 예측해보자
# temp = train_enc[train_enc['MonthlyIncome']!=0]
# temp.head()
# features = temp.columns[1:-1].drop('MonthlyIncome')
# target = 'MonthlyIncome'
# features
# from sklearn.ensemble import ExtraTreesRegressor
# X = temp[features]
# y = temp[target]
# X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.9,shuffle=False)
# model_rf = ExtraTreesRegressor(n_estimators=500)
# model_rf.fit(X_train,y_train)
# predict = model_rf.predict(X_test)
# mean_absolute_error(predict,y_test)
# temp = train_enc[train_enc['MonthlyIncome']==0]
# temp.head()
# X = temp[features]
# train_enc.loc[train_enc['MonthlyIncome']==0,'MonthlyIncome'] = model_rf.predict(X)
# train_enc['MonthlyIncome'].value_counts().sort_index()
# temp = test[test['MonthlyIncome']==0]
# temp.head()
# X = temp[features]
# test.loc[test['MonthlyIncome']==0,'MonthlyIncome'] = model_rf.predict(X)
# test['MonthlyIncome'].value_counts().sort_index()
3.4.scaling
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
scaler = StandardScaler()
min_scaler = MinMaxScaler()
train_enc[['MonthlyIncome']] = scaler.fit_transform(train_enc[['MonthlyIncome']])
test[['MonthlyIncome']] = scaler.transform(test[['MonthlyIncome']])
train_enc[['Age','DurationOfPitch']] = min_scaler.fit_transform(train_enc[['Age','DurationOfPitch']])
test[['Age','DurationOfPitch']] = min_scaler.transform(test[['Age','DurationOfPitch']])
train_enc.head()
| id | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | ProdTaken | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.232558 | 0 | 1 | 0.161290 | 3 | 1 | 3 | 4.0 | 0 | 3.0 | 1 | 3.0 | 0 | 1 | 0 | 1.0 | 1 | -0.543896 | 0 |
| 1 | 2 | 0.372093 | 1 | 3 | 0.516129 | 3 | 0 | 2 | 4.0 | 1 | 4.0 | 2 | 1.0 | 1 | 5 | 1 | 1.0 | 2 | -0.684698 | 1 |
| 2 | 3 | 0.627907 | 0 | 1 | 0.322581 | 2 | 1 | 2 | 3.0 | 1 | 4.0 | 1 | 2.0 | 0 | 4 | 1 | 1.0 | 2 | -0.697951 | 0 |
| 3 | 4 | 0.255814 | 0 | 1 | 0.064516 | 3 | 1 | 3 | 5.0 | 0 | 4.0 | 1 | 3.0 | 0 | 4 | 0 | 1.0 | 1 | -0.384262 | 1 |
| 4 | 5 | 0.558140 | 1 | 3 | 0.032258 | 2 | 1 | 2 | 3.0 | 1 | 3.0 | 0 | 2.0 | 0 | 3 | 1 | 1.0 | 2 | -0.629453 | 0 |
4.모델링
features = ['Age','CityTier', 'DurationOfPitch',
'Occupation', 'NumberOfFollowups',
'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus',
'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar',
'Designation', 'MonthlyIncome']
features
['Age',
'CityTier',
'DurationOfPitch',
'Occupation',
'NumberOfFollowups',
'ProductPitched',
'PreferredPropertyStar',
'MaritalStatus',
'NumberOfTrips',
'Passport',
'PitchSatisfactionScore',
'OwnCar',
'Designation',
'MonthlyIncome']
4.1.RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
X = train_enc[features]
y = train_enc['ProdTaken']
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.9,random_state=42,shuffle=False)
rf_model = RandomForestClassifier()
rf_model.fit(X_train,y_train)
y_pred_rf = rf_model.predict(X_test)
print(accuracy_score(y_pred_rf,y_test))
0.8826530612244898
4.2.Xgboost
from xgboost import XGBClassifier
X = train_enc[features]
y = train_enc['ProdTaken']
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.9,random_state=42,shuffle=False)
xgb_model = XGBClassifier()
xgb_model.fit(X_train,y_train)
y_pred_xgb = xgb_model.predict(X_test)
print(accuracy_score(y_pred_xgb,y_test))
0.8673469387755102
4.3.catboost
!pip install catboost
X
| Age | CityTier | DurationOfPitch | Occupation | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.232558 | 1 | 0.161290 | 3 | 4.0 | 0 | 3.0 | 1 | 3.0 | 0 | 1 | 0 | 1 | -0.543896 |
| 1 | 0.372093 | 3 | 0.516129 | 3 | 4.0 | 1 | 4.0 | 2 | 1.0 | 1 | 5 | 1 | 2 | -0.684698 |
| 2 | 0.627907 | 1 | 0.322581 | 2 | 3.0 | 1 | 4.0 | 1 | 2.0 | 0 | 4 | 1 | 2 | -0.697951 |
| 3 | 0.255814 | 1 | 0.064516 | 3 | 5.0 | 0 | 4.0 | 1 | 3.0 | 0 | 4 | 0 | 1 | -0.384262 |
| 4 | 0.558140 | 3 | 0.032258 | 2 | 3.0 | 1 | 3.0 | 0 | 2.0 | 0 | 3 | 1 | 2 | -0.629453 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1950 | 0.232558 | 1 | 0.161290 | 3 | 5.0 | 0 | 3.0 | 2 | 2.0 | 0 | 1 | 1 | 1 | -0.483092 |
| 1951 | 0.534884 | 3 | 0.096774 | 2 | 3.0 | 4 | 5.0 | 0 | 1.0 | 0 | 5 | 1 | 0 | 1.466964 |
| 1952 | 0.465116 | 3 | 0.741935 | 3 | 4.0 | 0 | 3.0 | 0 | 7.0 | 0 | 2 | 1 | 1 | -0.316641 |
| 1953 | 0.232558 | 3 | 0.806452 | 3 | 5.0 | 1 | 3.0 | 1 | 3.0 | 0 | 1 | 1 | 2 | -0.214941 |
| 1954 | 0.093023 | 1 | 0.129032 | 2 | 4.0 | 0 | 3.0 | 0 | 1.0 | 1 | 3 | 0 | 1 | -0.997869 |
1955 rows × 14 columns
cat_features = [1,3,10,11,12]
train_enc[features].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1955 entries, 0 to 1954
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1955 non-null float64
1 CityTier 1955 non-null int64
2 DurationOfPitch 1955 non-null float64
3 Occupation 1955 non-null int64
4 NumberOfFollowups 1955 non-null float64
5 ProductPitched 1955 non-null int64
6 PreferredPropertyStar 1955 non-null float64
7 MaritalStatus 1955 non-null int64
8 NumberOfTrips 1955 non-null float64
9 Passport 1955 non-null int64
10 PitchSatisfactionScore 1955 non-null int64
11 OwnCar 1955 non-null int64
12 Designation 1955 non-null int64
13 MonthlyIncome 1955 non-null float64
dtypes: float64(6), int64(8)
memory usage: 214.0 KB
train_enc = train_enc.astype({'DurationOfPitch':'int','NumberOfFollowups':'int','PreferredPropertyStar':'int','NumberOfTrips':'int'})
train_enc[features].dtypes
Age float64
CityTier int64
DurationOfPitch int64
Occupation int64
NumberOfFollowups int64
ProductPitched int64
PreferredPropertyStar int64
MaritalStatus int64
NumberOfTrips int64
Passport int64
PitchSatisfactionScore int64
OwnCar int64
Designation int64
MonthlyIncome float64
dtype: object
from catboost import CatBoostClassifier
X = train_enc[features]
y = train_enc['ProdTaken']
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.9,random_state=42,shuffle=False)
cat_model = CatBoostClassifier()
cat_model.fit(X_train,y_train,
eval_set=(X_test,y_test),
cat_features=cat_features,
use_best_model=True,
verbose=True
)
y_pred_cat = cat_model.predict(X_test)
print(accuracy_score(y_pred_cat,y_test))
0.8826530612244898
4.4.LightGBM
from lightgbm import LGBMClassifier
X = train_enc[features]
y = train_enc['ProdTaken']
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.9,random_state=42,shuffle=False)
lgbm_model = LGBMClassifier()
lgbm_model.fit(X_train,y_train)
y_pred_lgbm = lgbm_model.predict(X_test)
print(accuracy_score(y_pred_lgbm,y_test))
0.8724489795918368
4.5.ExtraTrees
from sklearn.ensemble import ExtraTreesClassifier
X = train_enc[features]
y = train_enc['ProdTaken']
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.9,random_state=42,shuffle=False)
et_model = ExtraTreesClassifier()
et_model.fit(X_train,y_train)
y_pred_et = et_model.predict(X_test)
print(accuracy_score(y_pred_et,y_test))
0.8979591836734694
5.모델 앙상블
from sklearn.model_selection import GridSearchCV
train_enc.dtypes
id int64
Age float64
TypeofContact int64
CityTier int64
DurationOfPitch int64
Occupation int64
Gender int64
NumberOfPersonVisiting int64
NumberOfFollowups int64
ProductPitched int64
PreferredPropertyStar int64
MaritalStatus int64
NumberOfTrips int64
Passport int64
PitchSatisfactionScore int64
OwnCar int64
NumberOfChildrenVisiting float64
Designation int64
MonthlyIncome float64
ProdTaken int64
dtype: object
X = train_enc[features]
y = train_enc['ProdTaken']
X
| Age | CityTier | DurationOfPitch | Occupation | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.232558 | 1 | 0 | 3 | 4 | 0 | 3 | 1 | 3 | 0 | 1 | 0 | 1 | -0.543896 |
| 1 | 0.372093 | 3 | 0 | 3 | 4 | 1 | 4 | 2 | 1 | 1 | 5 | 1 | 2 | -0.684698 |
| 2 | 0.627907 | 1 | 0 | 2 | 3 | 1 | 4 | 1 | 2 | 0 | 4 | 1 | 2 | -0.697951 |
| 3 | 0.255814 | 1 | 0 | 3 | 5 | 0 | 4 | 1 | 3 | 0 | 4 | 0 | 1 | -0.384262 |
| 4 | 0.558140 | 3 | 0 | 2 | 3 | 1 | 3 | 0 | 2 | 0 | 3 | 1 | 2 | -0.629453 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1950 | 0.232558 | 1 | 0 | 3 | 5 | 0 | 3 | 2 | 2 | 0 | 1 | 1 | 1 | -0.483092 |
| 1951 | 0.534884 | 3 | 0 | 2 | 3 | 4 | 5 | 0 | 1 | 0 | 5 | 1 | 0 | 1.466964 |
| 1952 | 0.465116 | 3 | 0 | 3 | 4 | 0 | 3 | 0 | 7 | 0 | 2 | 1 | 1 | -0.316641 |
| 1953 | 0.232558 | 3 | 0 | 3 | 5 | 1 | 3 | 1 | 3 | 0 | 1 | 1 | 2 | -0.214941 |
| 1954 | 0.093023 | 1 | 0 | 2 | 4 | 0 | 3 | 0 | 1 | 1 | 3 | 0 | 1 | -0.997869 |
1955 rows × 14 columns
models = []
rf = RandomForestClassifier()
models.append(rf)
xgb = XGBClassifier()
models.append(xgb)
cat = CatBoostClassifier()
models.append(cat)
lgbm = LGBMClassifier()
models.append(lgbm)
et = ExtraTreesClassifier()
models.append(et)
param_rf = {
'max_depth': [80, 100,None],
'min_samples_leaf': [1,3],
'min_samples_split': [2,8, 10],
'n_estimators': [100, 300, 500,700],
'criterion': ['gini','entropy']
}
param_xgb = {
'learning_rate': [0.05, 0.1],
'min_child_weight': [1, 5, 10],
'n_estimators': [100, 300,500,700],
'gamma': [0,0.5, 1],
#'subsample': [0.6, 1.0],
#'colsample_bytree': [0.6, 1.0],
'max_depth': [3, 4, 5]
}
param_cat = {
#'iterations':[600,None],
'learning_rate': [0.05, 0.1,None],
'depth': [4, 6,None],
'l2_leaf_reg': [1, 3, 5,None],
'cat_features' : [cat_features]
}
param_lgbm = {
'learning_rate': [0.05, 0.1],
'n_estimators': [100, 300, 500],
#'num_leaves': [6,16,31], # large num_leaves helps improve accuracy but might lead to over-fitting
'boosting_type' : ['dart','gbdt'], # for better accuracy -> try dart ['gbdt'] drop함
'objective' : ['binary'],
#'colsample_bytree' : [0.64, 0.66,1],
#'subsample' : [0.7,0.75,1],
'reg_alpha' : [1,1.2,0],
'reg_lambda' : [1,1.4,0],
}
param_et = {
'max_depth': [80, 90, 100,None],
'min_samples_leaf': [1,2,3],
#'min_samples_split': [2,10, 12],
'n_estimators': [100, 300, 500,700],
'criterion': ['gini','entropy']
}
params = []
params.append(param_rf)
params.append(param_xgb)
params.append(param_cat)
params.append(param_lgbm)
params.append(param_et)
params
[{'max_depth': [80, 100, None],
'min_samples_leaf': [1, 3],
'min_samples_split': [2, 8, 10],
'n_estimators': [100, 300, 500, 700],
'criterion': ['gini', 'entropy']},
{'learning_rate': [0.05, 0.1],
'min_child_weight': [1, 5, 10],
'n_estimators': [100, 300, 500, 700],
'gamma': [0, 0.5, 1],
'max_depth': [3, 4, 5]},
{'learning_rate': [0.05, 0.1, None],
'depth': [4, 6, None],
'l2_leaf_reg': [1, 3, 5, None],
'cat_features': [[1, 3, 10, 11, 12]]},
{'learning_rate': [0.05, 0.1],
'n_estimators': [100, 300, 500],
'boosting_type': ['dart', 'gbdt'],
'objective': ['binary'],
'reg_alpha': [1, 1.2, 0],
'reg_lambda': [1, 1.4, 0]},
{'max_depth': [80, 90, 100, None],
'min_samples_leaf': [1, 2, 3],
'n_estimators': [100, 300, 500, 700],
'criterion': ['gini', 'entropy']}]
# best_models = {}
# models = GridSearchCV(models[0],param_grid = params[0], cv=7, return_train_score = True, verbose=2)
# models.fit(X,y)
# best_models[0] = models[0].best_estimator_
# models = GridSearchCV(models[1],param_grid = params[1], cv=7, return_train_score = True, verbose=2)
# models.fit(X,y)
# best_models[1] = models[1].best_estimator_
# models = GridSearchCV(models[2],param_grid = params[2], cv=7, return_train_score = True, verbose=10)
# models.fit(X,y)
# best_models[2] = models[2].best_estimator_
# models = GridSearchCV(models[3],param_grid = params[3], cv=7, return_train_score = True, verbose=2)
# models.fit(X,y)
# best_models[3] = models[3].best_estimator_
# models = GridSearchCV(models[4],param_grid = params[4], cv=7, return_train_score = True, verbose=2)
# models.fit(X,y)
# best_models[4] = models[4].best_estimator_
best_models = {}
for i,model in enumerate(models):
model = GridSearchCV(model,param_grid = params[i], cv=7, return_train_score = True, verbose=2)
model.fit(X,y)
best_models[i] = model.best_estimator_
test = test.astype({'DurationOfPitch':'int','NumberOfFollowups':'int','PreferredPropertyStar':'int','NumberOfTrips':'int'})
test[features].dtypes
Age float64
CityTier int64
DurationOfPitch int64
Occupation int64
NumberOfFollowups int64
ProductPitched int64
PreferredPropertyStar int64
MaritalStatus int64
NumberOfTrips int64
Passport int64
PitchSatisfactionScore int64
OwnCar int64
Designation int64
MonthlyIncome float64
dtype: object
best_models
{0: RandomForestClassifier(n_estimators=700),
1: XGBClassifier(learning_rate=0.05, max_depth=5, n_estimators=700),
2: <catboost.core.CatBoostClassifier at 0x7f2430a31e90>,
3: LGBMClassifier(n_estimators=500, objective='binary', reg_alpha=0, reg_lambda=0),
4: ExtraTreesClassifier(max_depth=100, n_estimators=300)}
pred0 = best_models[0].predict(test[features])
pred1 = best_models[1].predict(test[features])
pred2 = best_models[2].predict(test[features])
pred3 = best_models[3].predict(test[features])
pred4 = best_models[4].predict(test[features])
pred = pd.DataFrame({'pred0':pred0,'pred1':pred1,'pred2':pred2,'pred3':pred3,'pred4':pred4})
pred.head()
| pred0 | pred1 | pred2 | pred3 | pred4 | |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1 | 1 | 1 | 1 | 1 |
pred
| pred0 | pred1 | pred2 | pred3 | pred4 | |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1 | 1 | 1 | 1 | 1 |
| ... | ... | ... | ... | ... | ... |
| 2928 | 0 | 0 | 0 | 0 | 0 |
| 2929 | 0 | 0 | 0 | 0 | 0 |
| 2930 | 0 | 0 | 0 | 0 | 0 |
| 2931 | 0 | 0 | 0 | 0 | 0 |
| 2932 | 0 | 0 | 0 | 0 | 0 |
2933 rows × 5 columns
pred['pred3'].value_counts()
0 2511
1 422
Name: pred3, dtype: int64
pred['pred'] = pred.mode(axis=1)[0].astype(int)
pred
| pred0 | pred1 | pred2 | pred3 | pred4 | pred | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1 | 1 | 1 | 1 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... |
| 2928 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2929 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2930 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2931 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2932 | 0 | 0 | 0 | 0 | 0 | 0 |
2933 rows × 6 columns
pred[pred['pred4']!= pred['pred']]
| pred0 | pred1 | pred2 | pred3 | pred4 | pred | |
|---|---|---|---|---|---|---|
| 7 | 1 | 1 | 1 | 1 | 0 | 1 |
| 22 | 1 | 0 | 0 | 0 | 1 | 0 |
| 70 | 0 | 0 | 0 | 0 | 1 | 0 |
| 83 | 0 | 0 | 0 | 0 | 1 | 0 |
| 114 | 0 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 2830 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2869 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2890 | 0 | 1 | 1 | 1 | 0 | 1 |
| 2908 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2909 | 0 | 0 | 0 | 0 | 1 | 0 |
118 rows × 6 columns
sample_submission['ProdTaken'] = pred['pred']
sample_submission.to_csv('submission1.csv',index=False)
sample_submission['ProdTaken'] = pred['pred4']
sample_submission.to_csv('submission3.csv',index=False)
pred0 = best_models[0].predict_proba(test[features])
pred1 = best_models[1].predict_proba(test[features])
pred2 = best_models[2].predict_proba(test[features])
pred3 = best_models[3].predict_proba(test[features])
pred4 = best_models[4].predict_proba(test[features])
pred = pd.DataFrame(pred0+pred1+pred2+pred3+pred4/5)
pred.head()
| 0 | 1 | |
|---|---|---|
| 0 | 3.278279 | 0.921721 |
| 1 | 3.971699 | 0.228301 |
| 2 | 4.168868 | 0.031132 |
| 3 | 4.053194 | 0.146806 |
| 4 | 0.285364 | 3.914636 |
import numpy as np
pred['pred'] = pd.DataFrame(np.argmax(np.array(pred),axis =1 ))
pred
| 0 | 1 | pred | |
|---|---|---|---|
| 0 | 3.278279 | 0.921721 | 0 |
| 1 | 3.971699 | 0.228301 | 0 |
| 2 | 4.168868 | 0.031132 | 0 |
| 3 | 4.053194 | 0.146806 | 0 |
| 4 | 0.285364 | 3.914636 | 1 |
| ... | ... | ... | ... |
| 2928 | 4.167008 | 0.032992 | 0 |
| 2929 | 4.162987 | 0.037013 | 0 |
| 2930 | 3.921309 | 0.278691 | 0 |
| 2931 | 3.899447 | 0.300553 | 0 |
| 2932 | 4.135267 | 0.064733 | 0 |
2933 rows × 3 columns
sample_submission['ProdTaken'] = pred['pred']
sample_submission.to_csv('submission2.csv',index=False)
estimators =[
('rf', best_models[0]),
('xgb', best_models[1]),
('cat', best_models[2]),
('lgbm', best_models[3]),
('et', best_models[4])
]
from sklearn.ensemble import VotingClassifier
model = VotingClassifier(estimators = estimators, voting='soft')
model.fit(X,y)
model.predict(test[features])
array([0, 0, 0, ..., 0, 0, 0])
model.score(X,y)
1.0
sample_submission['ProdTaken'] = pd.DataFrame(model.predict(test[features]))
sample_submission
| id | ProdTaken | |
|---|---|---|
| 0 | 1 | 0 |
| 1 | 2 | 0 |
| 2 | 3 | 0 |
| 3 | 4 | 0 |
| 4 | 5 | 1 |
| ... | ... | ... |
| 2928 | 2929 | 0 |
| 2929 | 2930 | 0 |
| 2930 | 2931 | 0 |
| 2931 | 2932 | 0 |
| 2932 | 2933 | 0 |
2933 rows × 2 columns
sample_submission.to_csv('submission.csv', index=False)