Data analysis 1 (overview)

2023-02-15 6 분 소요

데이터 분석

1. 데이터 분석 개요

Step 1: 질문하기 (Ask questions)

데이터가 주어진 상태에서 질문을 할 수도 있고, 질문에 답할 수 있는 데이터를 수집할 수도 있다.

Step 2: 데이터 랭글링 (Wrangle data)

데이터 랭글링 : 원자료(raw data)를 보다 쉽게 접근하고 분석할 수 있도록 데이터를 정리하고 통합하는 과정 (참고. 위키피디아)
세부적으로는 데이터의 수집(gather), 평가(assess), 정제(clean) 작업으로 나눌 수 있다.

Step 3: 데이터 탐색 (Exploratory Data Analysis)

데이터의 패턴을 찾고, 관계를 시각화 하는 작업을 통해 데이터에 대한 직관을 극대화 한다.

Step 4: 결론 도출 또는 예측 (Draw conclusions or make predictions)

Step 3에서 분석한 내용을 근거로 질문에 대한 답과 결론을 도출 할 수 있다.
머신러닝 또는 통계 추정 과정을 거치게 되면 예측을 만들어 낼 수도 있다.

Step 5: 결과 공유 (Communicate the results)

보고서, 이메일, 블로그 등 다양한 방법을 통해 발견한 통찰들을 공유할 수 있다.

2. Case Study

도시 자전거 공유 시스템 사용 예측
캐글의 Bike Sharing Demand에서 train.csv와 test.csv를 다운로드
두 파일을 각각 datasets 디렉토리에 bike_train.csv bike_test.csv로 저장

datetime : hourly date + timestamp
season : 1 = 봄, 2 = 여름, 3 = 가을, 4 = 겨울
holiday: 1 = 토, 일요일의 주말을 제외한 국경일 등의 휴일, 0 = 휴일이 아닌 날
workingday: 1 = 토, 일요일의 주말 및 휴일이 아닌 주중, 0 = 주말 및 휴일
weather:

1 = 맑음, 약간 구름 낀 흐림
2 = 안개, 안개 + 흐림
3 = 가벼운 눈, 가벼운 비 + 천둥
4 = 심한 눈/비, 천둥/번개

temp: 온도(섭씨)
atemp: 체감온도(섭씨)
humidity: 상대습도
windspeed: 풍속
casual: 사전에 등록되지 않는 사용자가 대여한 횟수
registered: 사전에 등록된 사용자가 대여한 횟수
count: 대여 횟수

Step 1: 질문하기 (Ask questions)

예시

(질문 1) 어떤 기상정보가 자전거 대여량에 영향을 미칠까?
(질문 2) 어떤 날짜(요일, 달, 계절)에 대여량이 많을까(혹은 적을까)?
(질문 3) 언제 프로모션을 하면 좋을까?

Step 2: 데이터 랭글링 (Wrangle data)

데이터 적재

import pandas as pd

bike = pd.read_csv('./datasets/bike_train.csv')

type(bike)

pandas.core.frame.DataFrame

데이터 평가

bike.head() # 데이터 훑어보기

	datetime	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1

bike.tail()

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
10881	2012-12-19 19:00:00	4	1	1	15.58	19.695	50	26.0027	7	329	336
10882	2012-12-19 20:00:00	4	1	1	14.76	17.425	57	15.0013	10	231	241
10883	2012-12-19 21:00:00	4	1	1	13.94	15.910	61	15.0013	4	164	168
10884	2012-12-19 22:00:00	4	1	1	13.94	17.425	61	6.0032	12	117	129
10885	2012-12-19 23:00:00	4	1	1	13.12	16.665	66	8.9981	4	84	88

bike.info() # 데이터 타입, 데이터 누락건수, 몇개의 컬럼, 몇개의 샘플

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB

데이터 정제 (누락된 값 처리, 잘못된 데이터 타입)

#bike.datetime
bike['datetime'] # bike 데이터프레임에서 datetime이라는 열의 값

      2011-01-01 00:00:00
      2011-01-01 01:00:00
      2011-01-01 02:00:00
      2011-01-01 03:00:00
      2011-01-01 04:00:00
                ...         
  2012-12-19 19:00:00
  2012-12-19 20:00:00
  2012-12-19 21:00:00
  2012-12-19 22:00:00
  2012-12-19 23:00:00
Name: datetime, Length: 10886, dtype: object

bike['datetime']= bike['datetime'].apply(pd.to_datetime)

bike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB

bike['year']= bike['datetime'].apply(lambda x: x.year)
bike['month']= bike['datetime'].apply(lambda x: x.month)
bike['hour']= bike['datetime'].apply(lambda x: x.hour)
bike['dayofweek']= bike['datetime'].apply(lambda x: x.dayofweek)

bike.head()

	datetime	season	weather	temp	atemp	humidity	casual	registered	count	year	month	hour	dayofweek
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16	2011	1	0	5
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40	2011	1	1	5
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32	2011	1	2	5
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13	2011	1	3	5
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1	2011	1	4	5

Step 3: 데이터 탐색 (Exploratory Data Analysis)

질문 1에 대한 분석 : 기상정보(온도, 체감온도, 풍속, 습도)와 자전거 대여량의 관계

수치 데이터 특성간의 상관관계를 확인할 때

(1) 산점도로 확인
(2) 상관계수 확인

bike.plot(kind='scatter', x='temp', y='count', alpha=0.3)

<matplotlib.axes._subplots.AxesSubplot at 0x7ff37f1ba820>

png

import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 6))
axes[0][0].scatter(bike['temp'], bike['count'], alpha=0.3) # 온도와 대여량의 산점도
axes[0][1].scatter(bike['atemp'], bike['count'], alpha=0.3) # 체감온도와 대여량의 산점도
axes[1][0].scatter(bike['windspeed'], bike['count'], alpha=0.3) # 풍속과 대여량의 산점도
axes[1][1].scatter(bike['humidity'], bike['count'], alpha=0.3) # 습도와 대여량의 산점도

<matplotlib.collections.PathCollection at 0x7ff37cadac10>

png

bike.corr()

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count	year	month	hour	dayofweek
season	1.000000	0.029368	-0.008126	0.008879	0.258689	0.264744	0.190610	-0.147121	0.096758	0.164011	0.163439	-0.004797	0.971524	-0.006546	-0.010553
holiday	0.029368	1.000000	-0.250491	-0.007074	0.000295	-0.005215	0.001929	0.008409	0.043799	-0.020956	-0.005393	0.012021	0.001731	-0.000354	-0.191832
workingday	-0.008126	-0.250491	1.000000	0.033772	0.029966	0.024660	-0.010880	0.013373	-0.319111	0.119460	0.011594	-0.002482	-0.003394	0.002780	-0.704267
weather	0.008879	-0.007074	0.033772	1.000000	-0.055035	-0.055376	0.406244	0.007261	-0.135918	-0.109340	-0.128655	-0.012548	0.012144	-0.022740	-0.047692
temp	0.258689	0.000295	0.029966	-0.055035	1.000000	0.984948	-0.064949	-0.017852	0.467097	0.318571	0.394454	0.061226	0.257589	0.145430	-0.038466
atemp	0.264744	-0.005215	0.024660	-0.055376	0.984948	1.000000	-0.043536	-0.057473	0.462067	0.314635	0.389784	0.058540	0.264173	0.140343	-0.040235
humidity	0.190610	0.001929	-0.010880	0.406244	-0.064949	-0.043536	1.000000	-0.318607	-0.348187	-0.265458	-0.317371	-0.078606	0.204537	-0.278011	-0.026507
windspeed	-0.147121	0.008409	0.013373	0.007261	-0.017852	-0.057473	-0.318607	1.000000	0.092276	0.091052	0.101369	-0.015221	-0.150192	0.146631	-0.024804
casual	0.096758	0.043799	-0.319111	-0.135918	0.467097	0.462067	-0.348187	0.092276	1.000000	0.497250	0.690414	0.145241	0.092722	0.302045	0.246959
registered	0.164011	-0.020956	0.119460	-0.109340	0.318571	0.314635	-0.265458	0.091052	0.497250	1.000000	0.970948	0.264265	0.169451	0.380540	-0.084427
count	0.163439	-0.005393	0.011594	-0.128655	0.394454	0.389784	-0.317371	0.101369	0.690414	0.970948	1.000000	0.260403	0.166862	0.400601	-0.002283
year	-0.004797	0.012021	-0.002482	-0.012548	0.061226	0.058540	-0.078606	-0.015221	0.145241	0.264265	0.260403	1.000000	-0.004932	-0.004234	-0.003785
month	0.971524	0.001731	-0.003394	0.012144	0.257589	0.264173	0.204537	-0.150192	0.092722	0.169451	0.166862	-0.004932	1.000000	-0.006818	-0.002266
hour	-0.006546	-0.000354	0.002780	-0.022740	0.145430	0.140343	-0.278011	0.146631	0.302045	0.380540	0.400601	-0.004234	-0.006818	1.000000	-0.002925
dayofweek	-0.010553	-0.191832	-0.704267	-0.047692	-0.038466	-0.040235	-0.026507	-0.024804	0.246959	-0.084427	-0.002283	-0.003785	-0.002266	-0.002925	1.000000

분석 결과

기상 정보 중 온도와 체감온도가 자건거 대여 수량에 영향을 미칠것으로 보임

질문 2에 대한 분석 : 날짜정보(연도, 월, 시간, 요일)와 자전거 대여량의 관계

참고

year, month, hour, dayofweek : 범주형 데이터
count(자전거 대여량): 수치형 데이터
범주형 데이터 값에 따라 수치형 데이터가 어떻게 달라지는 파악할 때 막대그래프(barplot)

import seaborn as sns

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 6))

sns.barplot(data=bike, x='year', y='count', ax=axes[0][0])
sns.barplot(data=bike, x='month', y='count', ax=axes[0][1])
sns.barplot(data=bike, x='hour', y='count', hue='workingday', ax=axes[1][0])
sns.barplot(data=bike, x='dayofweek', y='count', ax=axes[1][1])
plt.show()

png

분석 결과

연도별 평균 대여량은 2011년도보다 2012년도에 더 많음
월별 평균 대여량은 6월에 가장 많고, 7~10월에도 많음. 1월에 가장 적음
시간대별 평균 대여량은 오전 8시 전후와 오후 5~6시 부근에 많음
시간대별 평균 대여량을 workingday로 나누어서 시각화하면 휴일과 근무일의 대여량 추이가 다름을 알 수 있음

Step 4: 결론 도출 또는 예측 (Draw conclusions or make predictions)

질문1, 질문2에 대한 분석결과를 확인
온도에 따른 자전거 대여량 변화가 예상이 되므로 이에 맞는 재고 관리 전략 수립
시기별(연도, 월, 시간)로 대여량 변화가 예상이 되므로 이제 맞는 프로모션 전략 수립

Step 5: 결과 공유 (Communicate the results)

자전거 대여량을 예측할 때 고려해야할 중요한 특성(기상정보, 시기)을 설명하는 보고서, PPT등을 준비

Twitter Facebook LinkedIn

Data analysis 1 (overview)

데이터 분석

1. 데이터 분석 개요

Step 1: 질문하기 (Ask questions)

Step 2: 데이터 랭글링 (Wrangle data)

Step 3: 데이터 탐색 (Exploratory Data Analysis)

Step 4: 결론 도출 또는 예측 (Draw conclusions or make predictions)

Step 5: 결과 공유 (Communicate the results)

2. Case Study

Step 1: 질문하기 (Ask questions)

Step 2: 데이터 랭글링 (Wrangle data)

Step 3: 데이터 탐색 (Exploratory Data Analysis)

Step 4: 결론 도출 또는 예측 (Draw conclusions or make predictions)

Step 5: 결과 공유 (Communicate the results)

공유하기

댓글남기기

참고

소개(about me)

Rnn 20 (attention and seq2seq learning using date dataset in pytorch)

Rnn 19 (attention and seq2seq learning using addition dataset in pytorch)

Rnn 18 (어텐션)

데이터 분석

1. 데이터 분석 개요

Step 1: 질문하기 (Ask questions)

Step 2: 데이터 랭글링 (Wrangle data)

Step 3: 데이터 탐색 (Exploratory Data Analysis)

Step 4: 결론 도출 또는 예측 (Draw conclusions or make predictions)

Step 5: 결과 공유 (Communicate the results)

2. Case Study

Bike Sharing Demand

Step 1: 질문하기 (Ask questions)

Step 2: 데이터 랭글링 (Wrangle data)

Step 3: 데이터 탐색 (Exploratory Data Analysis)

Step 4: 결론 도출 또는 예측 (Draw conclusions or make predictions)

Step 5: 결과 공유 (Communicate the results)

공유하기

댓글남기기

참고

소개(about me)

Rnn 20 (attention and seq2seq learning using date dataset in pytorch)

Rnn 19 (attention and seq2seq learning using addition dataset in pytorch)

Rnn 18 (어텐션)