9 분 소요

타이타닉 데이터셋 도전

  • 승객의 나이, 성별, 승객 등급, 승선 위치 같은 속성을 기반으로 하여 승객의 생존 여부를 예측하는 것이 목표

  • 캐글타이타닉 챌린지에서 train.csvtest.csv를 다운로드
  • 두 파일을 각각 datasets 디렉토리에 titanic_train.csv titanic_test.csv로 저장

1. 데이터 탐색

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1. 데이터 가져오기

train_df = pd.read_csv("./datasets/titanic_train.csv")
test_df = pd.read_csv("./datasets/titanic_test.csv")
submission = pd.read_csv("./datasets/gender_submission.csv")

2. 데이터 훑어보기

train_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
  • Survived: 타깃. 0은 생존하지 못한 것이고 1은 생존을 의미
  • Pclass: 승객 등급. 1, 2, 3등석.
  • Name, Sex, Age: 이름 그대로의 의미
  • SibSp: 함께 탑승한 형제, 배우자의 수
  • Parch: 함께 탑승한 자녀, 부모의 수
  • Ticket: 티켓 아이디
  • Fare: 티켓 요금 (파운드)
  • Cabin: 객실 번호
  • Embarked: 승객이 탑승한 곳. C(Cherbourg), Q(Queenstown), S(Southampton)
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

범주형 특성 탐색

  • Pclass, Sex, Embarked
  • Embarked 특성은 승객이 탑승한 곳 : C=Cherbourg, Q=Queenstown, S=Southampton.
train_df['Pclass'].value_counts(dropna=False)
3    491
1    216
2    184
Name: Pclass, dtype: int64
train_df['Sex'].value_counts(dropna=False)
male      577
female    314
Name: Sex, dtype: int64
train_df['Embarked'].value_counts(dropna=False)
S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

수치형 특성 탐색

train_df.hist(bins=50, figsize=(20, 10))
plt.show()

png

train_df.corrwith(train_df['Survived']).sort_values(ascending=False)
Survived       1.000000
Fare           0.257307
Parch          0.081629
PassengerId   -0.005007
SibSp         -0.035322
Age           -0.077221
Pclass        -0.338481
dtype: float64

특잇값 확인

train_df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
  • Fare 특성을 boxplot으로 분포 확인하기
train_df['Fare'].plot(kind='box', figsize=(6, 6))
plt.show()

png

  • 숫자 특성들에 대한 분포 boxplot으로 확인하기
cat_columns = ['Age', 'Fare', 'SibSp', 'Parch']
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(12, 8))
for i, column in enumerate(cat_columns):
  train_df[column].plot(kind='box', ax=axes[i])
plt.show()

png

# seaborn을 그리면 더 세련된 시각화
num_columns = ['Age', 'Fare', 'SibSp', 'Parch']
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(20, 4))
for i, column in enumerate(num_columns):
  sns.boxplot(data=train_df, x=column, ax=axes[i])
plt.show()

png

q1 = train_df["Fare"].quantile(0.25)
q3 = train_df["Fare"].quantile(0.75)
iqr = q3 -q1 # 박스의 길이이

cond = train_df["Fare"] >= q3 + (1.5 * iqr)
outliers = train_df.loc[cond]
max_wo_outlier = outliers['Fare'].min()
print(outliers.index, len(outliers))
Int64Index([  1,  27,  31,  34,  52,  61,  62,  72,  88, 102,
            ...
            792, 802, 820, 829, 835, 846, 849, 856, 863, 879],
           dtype='int64', length=116) 116
train_df.loc[outliers.index, 'Fare'] = max_wo_outlier
train_df['Fare'].describe()
count    891.000000
mean      24.172526
std       20.738142
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max       66.600000
Name: Fare, dtype: float64

타깃에 영향을 미치는 특성 탐색

cat_columns = ['Pclass', 'Embarked', 'Sex', 'SibSp',	'Parch']
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(25, 4))
for i, column in enumerate(cat_columns):
  sns.barplot(data=train_df, x=column, y='Survived', ax=axes[i])
plt.show()

png

# 기존 특성을 바탕으로 새로운 특성 추가
train_df['Family_size'] = train_df['Parch'] + train_df['SibSp'] + 1
sns.barplot(data=train_df, x='Family_size', y='Survived')
plt.show()

png

train_df['Family_size'].value_counts()
1     537
2     161
3     102
4      29
6      22
5      15
7      12
11      7
8       6
Name: Family_size, dtype: int64
bins = [0, 1, 2, 4, 12]
train_df['Family_size'] = pd.cut(train_df['Family_size'], bins=bins, labels=['Single', 'SmallF', 'MedF', 'LargeF'])
sns.barplot(data=train_df, x='Family_size', y='Survived')
plt.show()

png

bins = [0, 18, 25, 60, 90]
group_names = ['Children', 'Youth', 'Adult', 'Senior']

train_df['Age_cat'] = pd.cut(train_df['Age'], bins, labels=group_names)
sns.barplot(data=train_df, x='Age_cat', y='Survived')
plt.show()

png

target 분포 확인

train_df['Survived'].value_counts()
0    549
1    342
Name: Survived, dtype: int64
sns.countplot(data=train_df, x='Survived')
plt.show()

png

train_df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Family_size',
       'Age_cat'],
      dtype='object')
# 현재까지 train_df에 적용했던 내용들을 test_df에도 적용
# 특잇값 처리, Family_size, Age_cat 열 추가

# (1)
cond = test_df["Fare"] >= q3 + (1.5 * iqr)
outliers = test_df.loc[cond]
max_wo_outlier = outliers['Fare'].min()
test_df.loc[outliers.index, 'Fare'] = max_wo_outlier

# (2)
test_df['Family_size'] = test_df['Parch'] + test_df['SibSp'] + 1
bins = [0, 1, 2, 4, 12]
test_df['Family_size'] = pd.cut(test_df['Family_size'], bins=bins, labels=['Single', 'SmallF', 'MedF', 'LargeF'])

# (3)
bins = [0, 18, 25, 60, 80]
group_names = ['Children', 'Youth', 'Adult', 'Senior']

test_df['Age_cat'] = pd.cut(test_df['Age'], bins, labels=group_names)

test_df.columns
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'Family_size', 'Age_cat'],
      dtype='object')

2. 데이터 전처리 (누락 데이터 처리, 범주화 등)

2.1 특성데이터와 레이블 데이터 분리

X_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']

2.2 범주형 데이터 전처리

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
import sklearn
sklearn.__version__
'0.24.1'
cat_pipeline = Pipeline([
                ('imputer', SimpleImputer(strategy='most_frequent')), # 누락된 데이터를 채워주는 변환기
                ('oh_encoder', OneHotEncoder(sparse=False)) # Onehot Encoding 해주는 변환기, sklearn 1.2 에서는 sparse_output
            ])

2.3 수치형 데이터 전처리

num_pipeline = Pipeline([
                ('imputer', SimpleImputer(strategy='median')), # 누락된 데이터를 채워주는 변환기
                ('std_scaler', StandardScaler()) # 평균을 0, 표준편차1로 만들어주는 변환기기
            ])

2.4 수치형 데이터와 범주형 데이터 연결

from sklearn.compose import ColumnTransformer

num_attribs = ['Age', 'Fare']
cat_attribs = ['Pclass', 'Sex', 'Embarked', 'Family_size'] #, 'Age_cat']

full_pipeline = ColumnTransformer([
                    #('num', SimpleImputer(strategy='median'), num_attribs), # 수치형
                    ('num', num_pipeline, num_attribs), # 수치형
                    ('cat', cat_pipeline, cat_attribs)  # 범주형
            ])

X_train_prepraed = full_pipeline.fit_transform(X_train)
X_train.shape
(891, 13)
X_train_prepraed.shape
(891, 14)

3. 모델 선택과 훈련

  • Logistic Regression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
log_clf = LogisticRegression(max_iter=1000, random_state=42) # C가 작을 수록 규제가 강해짐
log_clf_scores = cross_val_score(log_clf, X_train_prepraed, y_train, scoring="accuracy", cv=3)
log_clf_scores
array([0.7979798 , 0.81144781, 0.81481481])
  • KNearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf_scores = cross_val_score(knn_clf, X_train_prepraed, y_train, scoring="accuracy", cv=3)
knn_clf_scores
array([0.7979798, 0.8047138, 0.8047138])
  • Decision Tree
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier()
tree_clf_scores = cross_val_score(tree_clf, X_train_prepraed, y_train, scoring="accuracy", cv=3)
tree_clf_scores
array([0.73400673, 0.78114478, 0.76430976])
  • RandomForest
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(random_state=42, n_jobs=-1)
rnd_clf_scores = cross_val_score(rnd_clf, X_train_prepraed, y_train, scoring="accuracy", cv=3)
rnd_clf_scores
array([0.75757576, 0.81481481, 0.79124579])
  • SVM
from sklearn.svm import LinearSVC, SVC

# 1. LinearSVC
linear_svc = LinearSVC(random_state = 42)
linear_svc_scores = cross_val_score(linear_svc, X_train_prepraed, y_train, scoring="accuracy", cv=3)
print(linear_svc_scores)

# 2. SCV
linear_kernel_svc = SVC(kernel='linear')
linear_kernel_svc_scores = cross_val_score(linear_kernel_svc, X_train_prepraed, y_train, scoring="accuracy", cv=3)
print(linear_kernel_svc_scores)
[0.7979798  0.81144781 0.80808081]
[0.7979798  0.81481481 0.7979798 ]

4. 하이퍼파라미터 튜닝

from sklearn.model_selection import GridSearchCV
  • LogisticRegression
param_grid = {'C':[0.1, 0.2, 0.5, 1, 10], 'max_iter':[1000, 2000]} # 5x2 = 10개의 조합

grid_search = GridSearchCV(log_clf, param_grid, cv=5, scoring="accuracy", n_jobs=-1) # 5x2x5 = 50번의 학습과 검증
grid_search.fit(X_train_prepraed, y_train)
GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=1000, random_state=42),
             n_jobs=-1,
             param_grid={'C': [0.1, 0.2, 0.5, 1, 10], 'max_iter': [1000, 2000]},
             scoring='accuracy')
grid_search.best_params_
{'C': 1, 'max_iter': 1000}
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
  print(mean_score, params)
0.8047329106772958 {'C': 0.1, 'max_iter': 1000}
0.8047329106772958 {'C': 0.1, 'max_iter': 2000}
0.8103195028560668 {'C': 0.2, 'max_iter': 1000}
0.8103195028560668 {'C': 0.2, 'max_iter': 2000}
0.8114493754315486 {'C': 0.5, 'max_iter': 1000}
0.8114493754315486 {'C': 0.5, 'max_iter': 2000}
0.8136965664427844 {'C': 1, 'max_iter': 1000}
0.8136965664427844 {'C': 1, 'max_iter': 2000}
0.813684012303057 {'C': 10, 'max_iter': 1000}
0.813684012303057 {'C': 10, 'max_iter': 2000}
lr_final_model = grid_search.best_estimator_
lr_final_model.fit(X_train_prepraed, y_train)
LogisticRegression(C=1, max_iter=1000, random_state=42)
  • KNeighborsClassifier
param_grid = {'n_neighbors':[3, 5, 7, 9, 11]}

grid_search = GridSearchCV(knn_clf, param_grid, cv=5, scoring="accuracy", n_jobs=-1) # 24 * 5 = 100번의 학습과 검증
grid_search.fit(X_train_prepraed, y_train)
GridSearchCV(cv=5, estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'n_neighbors': [3, 5, 7, 9, 11]}, scoring='accuracy')
grid_search.best_params_
{'n_neighbors': 5}
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
  print(mean_score, params)
0.7980101688531793 {'n_neighbors': 3}
0.8148327160881299 {'n_neighbors': 5}
0.7946582135459168 {'n_neighbors': 7}
0.7968865733475614 {'n_neighbors': 9}
0.7969305128366079 {'n_neighbors': 11}
  • Decision Tree
param_grid = {'max_depth':[3, 5, 7, 9, 11]}

grid_search = GridSearchCV(tree_clf, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train_prepraed, y_train)
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'max_depth': [3, 5, 7, 9, 11]}, scoring='accuracy')
grid_search.best_params_
{'max_depth': 7}
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
  print(mean_score, params)
0.8103132257862029 {'max_depth': 3}
0.8181721172556651 {'max_depth': 5}
0.8282719226664993 {'max_depth': 7}
0.8036155922415418 {'max_depth': 9}
0.7991212102190698 {'max_depth': 11}
tree_final_model = grid_search.best_estimator_
tree_final_model.fit(X_train_prepraed, y_train)
DecisionTreeClassifier(max_depth=7)
num_attribs = ['Age', 'Fare']
cat_attribs = ['Pclass', 'Sex', 'Embarked', 'Family_size'] #, 'Age_cat']
cat_ohe_attribs = ['Pclass_1', 'Pclass_2', 'Pclass_3',
                   'Sex_1', 'Sex_2', 'Embarked_1', 'Embarked_2', 'Embarked_3',
                   'Family_size_1', 'Family_size_2', 'Family_size_3', 'Family_size_4']
                   #'Age_cat_1', 'Age_cat_2', 'Age_cat_3', 'Age_cat_4', 'Age_cat_5']

all_attribs = num_attribs + cat_ohe_attribs
target_names = ['Not survived', 'Survived']
from sklearn import tree
plt.figure(figsize=(150, 50))
res = tree.plot_tree(tree_final_model,
               feature_names = all_attribs,
               class_names = target_names,
               rounded = True,
               filled = True,
               fontsize=25)

png

sorted(zip(tree_final_model.feature_importances_, all_attribs), reverse=True)
[(0.4799753034557417, 'Sex_1'),
 (0.1470328405953695, 'Fare'),
 (0.13179082253077198, 'Age'),
 (0.12021196678995298, 'Pclass_3'),
 (0.04869029678945934, 'Family_size_1'),
 (0.04661302686623672, 'Pclass_1'),
 (0.01001711785051913, 'Embarked_3'),
 (0.00633026610480512, 'Family_size_4'),
 (0.006172009452184985, 'Family_size_3'),
 (0.002922820151598823, 'Embarked_1'),
 (0.00024352941335955124, 'Family_size_2'),
 (0.0, 'Sex_2'),
 (0.0, 'Pclass_2'),
 (0.0, 'Embarked_2')]
  • RandomForest
param_grid = {'n_estimators':[100, 200, 300], 'max_depth':[5, 7, 9, 11]}

grid_search = GridSearchCV(rnd_clf, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train_prepraed, y_train)
GridSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
             n_jobs=-1,
             param_grid={'max_depth': [5, 7, 9, 11],
                         'n_estimators': [100, 200, 300]},
             scoring='accuracy')
print(grid_search.best_params_)
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
  print(mean_score, params)

rf_final_model = grid_search.best_estimator_
rf_final_model.fit(X_train_prepraed, y_train)

{'max_depth': 7, 'n_estimators': 100}
0.8193082669010107 {'max_depth': 5, 'n_estimators': 100}
0.8193019898311468 {'max_depth': 5, 'n_estimators': 200}
0.8170610758897746 {'max_depth': 5, 'n_estimators': 300}
0.8226916075575922 {'max_depth': 7, 'n_estimators': 100}
0.8226853304877284 {'max_depth': 7, 'n_estimators': 200}
0.8182035026049841 {'max_depth': 7, 'n_estimators': 300}
0.8215805661917017 {'max_depth': 9, 'n_estimators': 100}
0.8215805661917017 {'max_depth': 9, 'n_estimators': 200}
0.8193333751804659 {'max_depth': 9, 'n_estimators': 300}
0.8159625886636119 {'max_depth': 11, 'n_estimators': 100}
0.8159688657334756 {'max_depth': 11, 'n_estimators': 200}
0.8170861841692298 {'max_depth': 11, 'n_estimators': 300}





RandomForestClassifier(max_depth=7, n_jobs=-1, random_state=42)
sorted(zip(rf_final_model.feature_importances_, all_attribs), reverse=True)
[(0.22661319164248106, 'Sex_2'),
 (0.20598466532945145, 'Sex_1'),
 (0.15065113082226642, 'Fare'),
 (0.1476700317091729, 'Age'),
 (0.06749172988344757, 'Pclass_3'),
 (0.04309627388087689, 'Pclass_1'),
 (0.03810687703951553, 'Family_size_1'),
 (0.031037269118605965, 'Family_size_2'),
 (0.0198646408180714, 'Pclass_2'),
 (0.01908234726067075, 'Family_size_3'),
 (0.015539397763291867, 'Embarked_3'),
 (0.01481044882815969, 'Embarked_1'),
 (0.01111682809322282, 'Family_size_4'),
 (0.008935167810765757, 'Embarked_2')]
  • SVM

  • (1) linear_svc

param_grid = {'C':[0.1, 1, 2, 5, 10]}

grid_search = GridSearchCV(linear_svc, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train_prepraed, y_train)

print(grid_search.best_params_)
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
  print(mean_score, params)
{'C': 10}
0.8080723118448307 {'C': 0.1}
0.8080785889146945 {'C': 1}
0.8080785889146945 {'C': 2}
0.8080785889146945 {'C': 5}
0.8092021844203126 {'C': 10}


C:\Users\mue\anaconda3\lib\site-packages\sklearn\svm\_base.py:985: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn("Liblinear failed to converge, increase "
  • (2) svc
# 시간이 오래 걸림. 파라미터 줄여서 탐색하기기
# param_grid = [{'kernel': ['poly'], 'degree': [2, 3, 4], 'coef0':[1, 50, 100], 'C':[0.1, 1, 2, 5, 10, 50, 100, 500, 1000]},
#               {'kernel': ['rbf'], 'gamma': [0.1, 5, 10], 'C':[0.1, 1, 2, 5, 10, 100, 500, 1000]}]

# grid_search = GridSearchCV(linear_kernel_svc, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
# grid_search.fit(X_train_prepraed, y_train)

# print(grid_search.best_params_)
# cvres = grid_search.cv_results_

# for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
#   print(mean_score, params)

5. 예측과 성능 평가(by kaggle)

X_test_preprocessed = full_pipeline.transform(test_df)
X_test_preprocessed
array([[ 0.39488658, -0.78852313,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       [ 1.35550962, -0.82852989,  0.        , ...,  0.        ,
         0.        ,  1.        ],
       [ 2.50825727, -0.69886497,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 0.70228595, -0.81646803,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       [-0.1046374 , -0.7778701 ,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       [-0.1046374 , -0.08753169,  0.        , ...,  1.        ,
         0.        ,  0.        ]])
test_df.shape, X_test_preprocessed.shape
((418, 13), (418, 14))
final_pred = rf_final_model.predict(X_test_preprocessed)
submission["Survived"] = final_pred
ver = 10
submission.to_csv('./datasets/titanic_ver_{}_submission.csv'.format(ver), index=False)

Reference

댓글남기기