Machine learning 14 (pca 활용)

2023-03-09 7 분 소요

PCA 활용

default of credit card clients dataset

UCI Machine Learning Repository의 default of credit card clients dataset에서 default of credit card clients.xls를 다운로드
신용카드 데이터 세트 PCA 변환하여 분류 예측 성능 확인하기

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_excel('./datasets/pca_credit_card.xls', header=1)
df.head(3).T

	0	1	2
ID	1	2	3
LIMIT_BAL	20000	120000	90000
SEX	2	2	2
EDUCATION	2	2	2
MARRIAGE	1	2	2
AGE	24	26	34
PAY_0	2	-1	0
PAY_2	2	2	0
PAY_3	-1	0	0
PAY_4	-1	0	0
PAY_5	-2	0	0
PAY_6	-2	2	0
BILL_AMT1	3913	2682	29239
BILL_AMT2	3102	1725	14027
BILL_AMT3	689	2682	13559
BILL_AMT4	0	3272	14331
BILL_AMT5	0	3455	14948
BILL_AMT6	0	3261	15549
PAY_AMT1	0	0	1518
PAY_AMT2	689	1000	1500
PAY_AMT3	0	1000	1000
PAY_AMT4	0	1000	1000
PAY_AMT5	0	0	1000
PAY_AMT6	0	2000	5000
default payment next month	1	1	0

ID: ID of each client
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
SEX: Gender (1=male, 2=female)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
AGE: Age in years
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
default.payment.next.month: Default payment (1=yes, 0=no)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 ID                          30000 non-null  int64
 LIMIT_BAL                   30000 non-null  int64
 SEX                         30000 non-null  int64
 EDUCATION                   30000 non-null  int64
 MARRIAGE                    30000 non-null  int64
 AGE                         30000 non-null  int64
 PAY_0                       30000 non-null  int64
 PAY_2                       30000 non-null  int64
 PAY_3                       30000 non-null  int64
 PAY_4                       30000 non-null  int64
PAY_5                       30000 non-null  int64
PAY_6                       30000 non-null  int64
BILL_AMT1                   30000 non-null  int64
BILL_AMT2                   30000 non-null  int64
BILL_AMT3                   30000 non-null  int64
BILL_AMT4                   30000 non-null  int64
BILL_AMT5                   30000 non-null  int64
BILL_AMT6                   30000 non-null  int64
PAY_AMT1                    30000 non-null  int64
PAY_AMT2                    30000 non-null  int64
PAY_AMT3                    30000 non-null  int64
PAY_AMT4                    30000 non-null  int64
PAY_AMT5                    30000 non-null  int64
PAY_AMT6                    30000 non-null  int64
default payment next month  30000 non-null  int64
dtypes: int64(25)
memory usage: 5.7 MB

corr_matrix = df.corr()
corr_matrix

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
ID	1.000000	0.026179	0.018497	0.039177	-0.029079	0.018678	-0.030575	-0.011215	-0.018494	-0.002735	...	0.040351	0.016705	0.016730	0.009742	0.008406	0.039151	0.007793	0.000652	0.003000	-0.013952
LIMIT_BAL	0.026179	1.000000	0.024755	-0.219161	-0.108139	0.144713	-0.271214	-0.296382	-0.286123	-0.267460	...	0.293988	0.295562	0.290389	0.195236	0.178408	0.210167	0.203242	0.217202	0.219595	-0.153520
SEX	0.018497	0.024755	1.000000	0.014232	-0.031389	-0.090874	-0.057643	-0.070771	-0.066096	-0.060173	...	-0.021880	-0.017005	-0.016733	-0.000242	-0.001391	-0.008597	-0.002229	-0.001667	-0.002766	-0.039961
EDUCATION	0.039177	-0.219161	0.014232	1.000000	-0.143464	0.175061	0.105364	0.121566	0.114025	0.108793	...	-0.000451	-0.007567	-0.009099	-0.037456	-0.030038	-0.039943	-0.038218	-0.040358	-0.037200	0.028006
MARRIAGE	-0.029079	-0.108139	-0.031389	-0.143464	1.000000	-0.414170	0.019917	0.024199	0.032688	0.033122	...	-0.023344	-0.025393	-0.021207	-0.005979	-0.008093	-0.003541	-0.012659	-0.001205	-0.006641	-0.024339
AGE	0.018678	0.144713	-0.090874	0.175061	-0.414170	1.000000	-0.039447	-0.050148	-0.053048	-0.049722	...	0.051353	0.049345	0.047613	0.026147	0.021785	0.029247	0.021379	0.022850	0.019478	0.013890
PAY_0	-0.030575	-0.271214	-0.057643	0.105364	0.019917	-0.039447	1.000000	0.672164	0.574245	0.538841	...	0.179125	0.180635	0.176980	-0.079269	-0.070101	-0.070561	-0.064005	-0.058190	-0.058673	0.324794
PAY_2	-0.011215	-0.296382	-0.070771	0.121566	0.024199	-0.050148	0.672164	1.000000	0.766552	0.662067	...	0.222237	0.221348	0.219403	-0.080701	-0.058990	-0.055901	-0.046858	-0.037093	-0.036500	0.263551
PAY_3	-0.018494	-0.286123	-0.066096	0.114025	0.032688	-0.053048	0.574245	0.766552	1.000000	0.777359	...	0.227202	0.225145	0.222327	0.001295	-0.066793	-0.053311	-0.046067	-0.035863	-0.035861	0.235253
PAY_4	-0.002735	-0.267460	-0.060173	0.108793	0.033122	-0.049722	0.538841	0.662067	0.777359	1.000000	...	0.245917	0.242902	0.239154	-0.009362	-0.001944	-0.069235	-0.043461	-0.033590	-0.026565	0.216614
PAY_5	-0.022199	-0.249411	-0.055064	0.097520	0.035629	-0.053826	0.509426	0.622780	0.686775	0.819835	...	0.271915	0.269783	0.262509	-0.006089	-0.003191	0.009062	-0.058299	-0.033337	-0.023027	0.204149
PAY_6	-0.020270	-0.235195	-0.044008	0.082316	0.034345	-0.048773	0.474553	0.575501	0.632684	0.716449	...	0.266356	0.290894	0.285091	-0.001496	-0.005223	0.005834	0.019018	-0.046434	-0.025299	0.186866
BILL_AMT1	0.019389	0.285430	-0.033642	0.023581	-0.023472	0.056239	0.187068	0.234887	0.208473	0.202812	...	0.860272	0.829779	0.802650	0.140277	0.099355	0.156887	0.158303	0.167026	0.179341	-0.019644
BILL_AMT2	0.017982	0.278314	-0.031183	0.018749	-0.021602	0.054283	0.189859	0.235257	0.237295	0.225816	...	0.892482	0.859778	0.831594	0.280365	0.100851	0.150718	0.147398	0.157957	0.174256	-0.014193
BILL_AMT3	0.024354	0.283236	-0.024563	0.013002	-0.024909	0.053710	0.179785	0.224146	0.227494	0.244983	...	0.923969	0.883910	0.853320	0.244335	0.316936	0.130011	0.143405	0.179712	0.182326	-0.014076
BILL_AMT4	0.040351	0.293988	-0.021880	-0.000451	-0.023344	0.051353	0.179125	0.222237	0.227202	0.245917	...	1.000000	0.940134	0.900941	0.233012	0.207564	0.300023	0.130191	0.160433	0.177637	-0.010156
BILL_AMT5	0.016705	0.295562	-0.017005	-0.007567	-0.025393	0.049345	0.180635	0.221348	0.225145	0.242902	...	0.940134	1.000000	0.946197	0.217031	0.181246	0.252305	0.293118	0.141574	0.164184	-0.006760
BILL_AMT6	0.016730	0.290389	-0.016733	-0.009099	-0.021207	0.047613	0.176980	0.219403	0.222327	0.239154	...	0.900941	0.946197	1.000000	0.199965	0.172663	0.233770	0.250237	0.307729	0.115494	-0.005372
PAY_AMT1	0.009742	0.195236	-0.000242	-0.037456	-0.005979	0.026147	-0.079269	-0.080701	0.001295	-0.009362	...	0.233012	0.217031	0.199965	1.000000	0.285576	0.252191	0.199558	0.148459	0.185735	-0.072929
PAY_AMT2	0.008406	0.178408	-0.001391	-0.030038	-0.008093	0.021785	-0.070101	-0.058990	-0.066793	-0.001944	...	0.207564	0.181246	0.172663	0.285576	1.000000	0.244770	0.180107	0.180908	0.157634	-0.058579
PAY_AMT3	0.039151	0.210167	-0.008597	-0.039943	-0.003541	0.029247	-0.070561	-0.055901	-0.053311	-0.069235	...	0.300023	0.252305	0.233770	0.252191	0.244770	1.000000	0.216325	0.159214	0.162740	-0.056250
PAY_AMT4	0.007793	0.203242	-0.002229	-0.038218	-0.012659	0.021379	-0.064005	-0.046858	-0.046067	-0.043461	...	0.130191	0.293118	0.250237	0.199558	0.180107	0.216325	1.000000	0.151830	0.157834	-0.056827
PAY_AMT5	0.000652	0.217202	-0.001667	-0.040358	-0.001205	0.022850	-0.058190	-0.037093	-0.035863	-0.033590	...	0.160433	0.141574	0.307729	0.148459	0.180908	0.159214	0.151830	1.000000	0.154896	-0.055124
PAY_AMT6	0.003000	0.219595	-0.002766	-0.037200	-0.006641	0.019478	-0.058673	-0.036500	-0.035861	-0.026565	...	0.177637	0.164184	0.115494	0.185735	0.157634	0.162740	0.157834	0.154896	1.000000	-0.053183
default payment next month	-0.013952	-0.153520	-0.039961	0.028006	-0.024339	0.013890	0.324794	0.263551	0.235253	0.216614	...	-0.010156	-0.006760	-0.005372	-0.072929	-0.058579	-0.056250	-0.056827	-0.055124	-0.053183	1.000000

25 rows × 25 columns

df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

df.corrwith(df['default payment next month']).sort_values(ascending=False)

default payment next month    1.000000
PAY_0                         0.324794
PAY_2                         0.263551
PAY_3                         0.235253
PAY_4                         0.216614
PAY_5                         0.204149
PAY_6                         0.186866
EDUCATION                     0.028006
AGE                           0.013890
BILL_AMT6                    -0.005372
BILL_AMT5                    -0.006760
BILL_AMT4                    -0.010156
ID                           -0.013952
BILL_AMT3                    -0.014076
BILL_AMT2                    -0.014193
BILL_AMT1                    -0.019644
MARRIAGE                     -0.024339
SEX                          -0.039961
PAY_AMT6                     -0.053183
PAY_AMT5                     -0.055124
PAY_AMT3                     -0.056250
PAY_AMT4                     -0.056827
PAY_AMT2                     -0.058579
PAY_AMT1                     -0.072929
LIMIT_BAL                    -0.153520
dtype: float64

plt.figure(figsize=(16, 16))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='Blues')
plt.show()

png

df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

from sklearn.model_selection import train_test_split
X = df.drop(['ID', 'default payment next month'], axis=1)
y = df['default payment next month']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape)

(24000, 23) (24000,)

상관관계가 높은 BILL_AMT1 ~ BILL_AMT6 6개 열의 변동 비율

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

bill_columns = ['BILL_AMT' + str(i) for i in range(1, 7)]


scaler = StandardScaler()
bill_columns_sclaed = scaler.fit_transform(X_train[bill_columns])

pca = PCA(n_components = 2)
bill_columns_scaled_pca = pca.fit_transform(bill_columns_sclaed)

pca.explained_variance_ratio_

array([0.90358461, 0.05178874])

상관관계가 높은 PAY_0, PAY_2 ~ PAY_6 6개 열의 변동 비율

pay_columns = ['PAY_' + str(i) for i in range(0, 7) if i != 1]


scaler = StandardScaler()
pay_columns_sclaed = scaler.fit_transform(X_train[pay_columns])

pca = PCA(n_components = 2)
pay_columns_scaled_pca = pca.fit_transform(pay_columns_sclaed)
print(pca.explained_variance_ratio_)

[0.71700378 0.11658426]

원본 데이터로 모델 성능 측정하기

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf_clf = RandomForestClassifier(random_state=42)
scores = cross_val_score(rf_clf, X_train, y_train, scoring='accuracy', cv=3)
scores

array([0.8155  , 0.818125, 0.810125])

PCA로 압축된 데이터로 모델 성능 측정하기

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

pca = PCA(n_components=8)
X_train_scaled_pca = pca.fit_transform(X_train_scaled)
X_train_scaled_pca.shape

(24000, 8)

rf_clf = RandomForestClassifier(random_state=42)
scores = cross_val_score(rf_clf, X_train_scaled_pca, y_train, scoring='accuracy', cv=3)
scores

array([0.796875, 0.795125, 0.797125])

pca.explained_variance_ratio_ # 각 component의 분산 비율

array([0.28465193, 0.17799685, 0.06810077, 0.06421023, 0.04460868,
       0.04197626, 0.03963272, 0.0382991 ])

pca.explained_variance_ratio_.sum()

0.7594765578111686

X_train_scaled_pca.shape

(24000, 8)

Reference

파이썬 머신러닝 완벽 가이드 (권철민 저)

Twitter Facebook LinkedIn

Machine learning 14 (pca 활용)

PCA 활용

default of credit card clients dataset

Reference

공유하기

댓글남기기

참고

소개(about me)

Rnn 20 (attention and seq2seq learning using date dataset in pytorch)

Rnn 19 (attention and seq2seq learning using addition dataset in pytorch)

Rnn 18 (어텐션)