Data analysis 6 (pandas)
Pandas (4)
10 시계열 데이터 다루기
import pandas as pd
(1) 다른 자료형을 시계열 객체로 변환
- 문자열을 Timestamp로 변환
df = pd.read_csv('./examples/stock-data.csv')
df.head()
Date | Close | Start | High | Low | Volume | |
---|---|---|---|---|---|---|
0 | 2018-07-02 | 10100 | 10850 | 10900 | 10000 | 137977 |
1 | 2018-06-29 | 10700 | 10550 | 10900 | 9990 | 170253 |
2 | 2018-06-28 | 10400 | 10900 | 10950 | 10150 | 155769 |
3 | 2018-06-27 | 10900 | 10800 | 11050 | 10500 | 133548 |
4 | 2018-06-26 | 10800 | 10900 | 11000 | 10700 | 63039 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 20 non-null object
1 Close 20 non-null int64
2 Start 20 non-null int64
3 High 20 non-null int64
4 Low 20 non-null int64
5 Volume 20 non-null int64
dtypes: int64(5), object(1)
memory usage: 1.1+ KB
# 문자열 데이터를 Timestamp 로 변환환
# df['new_Date']= pd.to_datetime(df['Date'])
df['new_Date']= df['Date'].apply(pd.to_datetime)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 20 non-null object
1 Close 20 non-null int64
2 Start 20 non-null int64
3 High 20 non-null int64
4 Low 20 non-null int64
5 Volume 20 non-null int64
6 new_Date 20 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 1.2+ KB
df['new_Date'][0], type(df['new_Date'][0]) # new_Date 컬럼의 행 하나의 값과 타입 확인인
(Timestamp('2018-07-02 00:00:00'), pandas._libs.tslibs.timestamps.Timestamp)
df.head()
Date | Close | Start | High | Low | Volume | new_Date | |
---|---|---|---|---|---|---|---|
0 | 2018-07-02 | 10100 | 10850 | 10900 | 10000 | 137977 | 2018-07-02 |
1 | 2018-06-29 | 10700 | 10550 | 10900 | 9990 | 170253 | 2018-06-29 |
2 | 2018-06-28 | 10400 | 10900 | 10950 | 10150 | 155769 | 2018-06-28 |
3 | 2018-06-27 | 10900 | 10800 | 11050 | 10500 | 133548 | 2018-06-27 |
4 | 2018-06-26 | 10800 | 10900 | 11000 | 10700 | 63039 | 2018-06-26 |
df.set_index('new_Date', inplace=True)
df.drop('Date', axis=1, inplace=True)
df.head()
Close | Start | High | Low | Volume | |
---|---|---|---|---|---|
new_Date | |||||
2018-07-02 | 10100 | 10850 | 10900 | 10000 | 137977 |
2018-06-29 | 10700 | 10550 | 10900 | 9990 | 170253 |
2018-06-28 | 10400 | 10900 | 10950 | 10150 | 155769 |
2018-06-27 | 10900 | 10800 | 11050 | 10500 | 133548 |
2018-06-26 | 10800 | 10900 | 11000 | 10700 | 63039 |
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20 entries, 2018-07-02 to 2018-06-01
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Close 20 non-null int64
1 Start 20 non-null int64
2 High 20 non-null int64
3 Low 20 non-null int64
4 Volume 20 non-null int64
dtypes: int64(5)
memory usage: 960.0 bytes
df.index
DatetimeIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
'2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
'2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
'2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
'2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
dtype='datetime64[ns]', name='new_Date', freq=None)
- TimeStamp를 Period로 변환
df.index.to_period(freq='D') # day (1일)
PeriodIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
'2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
'2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
'2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
'2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
dtype='period[D]', name='new_Date')
df.index.to_period(freq='M') # month end
PeriodIndex(['2018-07', '2018-06', '2018-06', '2018-06', '2018-06', '2018-06',
'2018-06', '2018-06', '2018-06', '2018-06', '2018-06', '2018-06',
'2018-06', '2018-06', '2018-06', '2018-06', '2018-06', '2018-06',
'2018-06', '2018-06'],
dtype='period[M]', name='new_Date')
df.index.to_period(freq='A') # year end
PeriodIndex(['2018', '2018', '2018', '2018', '2018', '2018', '2018', '2018',
'2018', '2018', '2018', '2018', '2018', '2018', '2018', '2018',
'2018', '2018', '2018', '2018'],
dtype='period[A-DEC]', name='new_Date')
(2) 시계열 데이터 만들기
- TimeStamp 배열
ts_ms = pd.date_range(start='2023-01-01', # 날짜 범위의 시작
end=None, # 날짜 범위의 끝
periods = 10, # 생성할 Timestamp의 개수수
freq='MS', # 시간 간격 (MS : 월의 시작일)
tz='Asia/Seoul') # 시간대 (timezone)
ts_ms
DatetimeIndex(['2023-01-01 00:00:00+09:00', '2023-02-01 00:00:00+09:00',
'2023-03-01 00:00:00+09:00', '2023-04-01 00:00:00+09:00',
'2023-05-01 00:00:00+09:00', '2023-06-01 00:00:00+09:00',
'2023-07-01 00:00:00+09:00', '2023-08-01 00:00:00+09:00',
'2023-09-01 00:00:00+09:00', '2023-10-01 00:00:00+09:00'],
dtype='datetime64[ns, Asia/Seoul]', freq='MS')
ts_me = pd.date_range(start='2023-01-01', # 날짜 범위의 시작
end=None, # 날짜 범위의 끝
periods = 10, # 생성할 Timestamp의 개수수
freq='M', # 시간 간격 (MS : 월의 마지막날날)
tz='Asia/Seoul') # 시간대 (timezone)
ts_me
DatetimeIndex(['2023-01-31 00:00:00+09:00', '2023-02-28 00:00:00+09:00',
'2023-03-31 00:00:00+09:00', '2023-04-30 00:00:00+09:00',
'2023-05-31 00:00:00+09:00', '2023-06-30 00:00:00+09:00',
'2023-07-31 00:00:00+09:00', '2023-08-31 00:00:00+09:00',
'2023-09-30 00:00:00+09:00', '2023-10-31 00:00:00+09:00'],
dtype='datetime64[ns, Asia/Seoul]', freq='M')
# 분기(3개월) 간격, 월의 마지막 날 기준준
ts_3m = pd.date_range(start='2023-01-01', # 날짜 범위의 시작
end=None, # 날짜 범위의 끝
periods = 10, # 생성할 Timestamp의 개수수
freq='3M', # 시간 간격 (3M : 3개월)
tz='Asia/Seoul') # 시간대 (timezone)
ts_3m
DatetimeIndex(['2023-01-31 00:00:00+09:00', '2023-04-30 00:00:00+09:00',
'2023-07-31 00:00:00+09:00', '2023-10-31 00:00:00+09:00',
'2024-01-31 00:00:00+09:00', '2024-04-30 00:00:00+09:00',
'2024-07-31 00:00:00+09:00', '2024-10-31 00:00:00+09:00',
'2025-01-31 00:00:00+09:00', '2025-04-30 00:00:00+09:00'],
dtype='datetime64[ns, Asia/Seoul]', freq='3M')
- Period 배열
# 1개월 길이로 Period 배열 만들기기
pr_m = pd.period_range(start='2023-01-01', # 날짜 범위의 시작
end = None, # 날짜 범위의 끝
periods=3, # 생성할 Period 개수
freq='M') # 기간의 길이 (M:월)
pr_m
PeriodIndex(['2023-01', '2023-02', '2023-03'], dtype='period[M]')
# 1시간 길이로 Period 배열 만들기기
pr_h = pd.period_range(start='2023-01-01', # 날짜 범위의 시작
end = None, # 날짜 범위의 끝
periods=3, # 생성할 Period 개수
freq='H') # 기간의 길이 (H:시간)
pr_h
PeriodIndex(['2023-01-01 00:00', '2023-01-01 01:00', '2023-01-01 02:00'], dtype='period[H]')
# 2시간 길이로 Period 배열 만들기기
pr_2h = pd.period_range(start='2023-01-01', # 날짜 범위의 시작
end = None, # 날짜 범위의 끝
periods=3, # 생성할 Period 개수
freq='2H') # 기간의 길이 (H:시간)
pr_2h
PeriodIndex(['2023-01-01 00:00', '2023-01-01 02:00', '2023-01-01 04:00'], dtype='period[2H]')
(3) 시계열 데이터 활용
- 날짜 데이터 분리
df = pd.read_csv('./examples/stock-data.csv')
df.head()
Date | Close | Start | High | Low | Volume | |
---|---|---|---|---|---|---|
0 | 2018-07-02 | 10100 | 10850 | 10900 | 10000 | 137977 |
1 | 2018-06-29 | 10700 | 10550 | 10900 | 9990 | 170253 |
2 | 2018-06-28 | 10400 | 10900 | 10950 | 10150 | 155769 |
3 | 2018-06-27 | 10900 | 10800 | 11050 | 10500 | 133548 |
4 | 2018-06-26 | 10800 | 10900 | 11000 | 10700 | 63039 |
df['new_Date'] = pd.to_datetime(df['Date'])
# dt 속성
df['Year']= df['new_Date'].dt.year
df['Month'] = df['new_Date'].dt.month
df['Day'] = df['new_Date'].dt.day
df.head()
Date | Close | Start | High | Low | Volume | new_Date | Year | Month | Day | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-07-02 | 10100 | 10850 | 10900 | 10000 | 137977 | 2018-07-02 | 2018 | 7 | 2 |
1 | 2018-06-29 | 10700 | 10550 | 10900 | 9990 | 170253 | 2018-06-29 | 2018 | 6 | 29 |
2 | 2018-06-28 | 10400 | 10900 | 10950 | 10150 | 155769 | 2018-06-28 | 2018 | 6 | 28 |
3 | 2018-06-27 | 10900 | 10800 | 11050 | 10500 | 133548 | 2018-06-27 | 2018 | 6 | 27 |
4 | 2018-06-26 | 10800 | 10900 | 11000 | 10700 | 63039 | 2018-06-26 | 2018 | 6 | 26 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 20 non-null object
1 Close 20 non-null int64
2 Start 20 non-null int64
3 High 20 non-null int64
4 Low 20 non-null int64
5 Volume 20 non-null int64
6 new_Date 20 non-null datetime64[ns]
7 Year 20 non-null int64
8 Month 20 non-null int64
9 Day 20 non-null int64
dtypes: datetime64[ns](1), int64(8), object(1)
memory usage: 1.7+ KB
# Timestamp를 Period로 변환하여 년월일 표기 변경하기
df['Date_yr'] = df['new_Date'].dt.to_period(freq='A')
df['Date_m'] = df['new_Date'].dt.to_period(freq='M')
df.head()
Date | Close | Start | High | Low | Volume | new_Date | Year | Month | Day | Date_yr | Date_m | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-07-02 | 10100 | 10850 | 10900 | 10000 | 137977 | 2018-07-02 | 2018 | 7 | 2 | 2018 | 2018-07 |
1 | 2018-06-29 | 10700 | 10550 | 10900 | 9990 | 170253 | 2018-06-29 | 2018 | 6 | 29 | 2018 | 2018-06 |
2 | 2018-06-28 | 10400 | 10900 | 10950 | 10150 | 155769 | 2018-06-28 | 2018 | 6 | 28 | 2018 | 2018-06 |
3 | 2018-06-27 | 10900 | 10800 | 11050 | 10500 | 133548 | 2018-06-27 | 2018 | 6 | 27 | 2018 | 2018-06 |
4 | 2018-06-26 | 10800 | 10900 | 11000 | 10700 | 63039 | 2018-06-26 | 2018 | 6 | 26 | 2018 | 2018-06 |
# 원하는 열을 새로운 행 인덱스로 지정
df.set_index('Date_m', inplace=True)
df.head()
Date | Close | Start | High | Low | Volume | new_Date | Year | Month | Day | Date_yr | |
---|---|---|---|---|---|---|---|---|---|---|---|
Date_m | |||||||||||
2018-07 | 2018-07-02 | 10100 | 10850 | 10900 | 10000 | 137977 | 2018-07-02 | 2018 | 7 | 2 | 2018 |
2018-06 | 2018-06-29 | 10700 | 10550 | 10900 | 9990 | 170253 | 2018-06-29 | 2018 | 6 | 29 | 2018 |
2018-06 | 2018-06-28 | 10400 | 10900 | 10950 | 10150 | 155769 | 2018-06-28 | 2018 | 6 | 28 | 2018 |
2018-06 | 2018-06-27 | 10900 | 10800 | 11050 | 10500 | 133548 | 2018-06-27 | 2018 | 6 | 27 | 2018 |
2018-06 | 2018-06-26 | 10800 | 10900 | 11000 | 10700 | 63039 | 2018-06-26 | 2018 | 6 | 26 | 2018 |
- 날짜 인덱스 활용
df = pd.read_csv('./examples/stock-data.csv')
df['new_Date'] = pd.to_datetime(df['Date'])
df.set_index('new_Date', inplace=True)
df.head()
Date | Close | Start | High | Low | Volume | |
---|---|---|---|---|---|---|
new_Date | ||||||
2018-07-02 | 2018-07-02 | 10100 | 10850 | 10900 | 10000 | 137977 |
2018-06-29 | 2018-06-29 | 10700 | 10550 | 10900 | 9990 | 170253 |
2018-06-28 | 2018-06-28 | 10400 | 10900 | 10950 | 10150 | 155769 |
2018-06-27 | 2018-06-27 | 10900 | 10800 | 11050 | 10500 | 133548 |
2018-06-26 | 2018-06-26 | 10800 | 10900 | 11000 | 10700 | 63039 |
df.index
DatetimeIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
'2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
'2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
'2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
'2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
dtype='datetime64[ns]', name='new_Date', freq=None)
df.loc['2018']
Date | Close | Start | High | Low | Volume | |
---|---|---|---|---|---|---|
new_Date | ||||||
2018-07-02 | 2018-07-02 | 10100 | 10850 | 10900 | 10000 | 137977 |
2018-06-29 | 2018-06-29 | 10700 | 10550 | 10900 | 9990 | 170253 |
2018-06-28 | 2018-06-28 | 10400 | 10900 | 10950 | 10150 | 155769 |
2018-06-27 | 2018-06-27 | 10900 | 10800 | 11050 | 10500 | 133548 |
2018-06-26 | 2018-06-26 | 10800 | 10900 | 11000 | 10700 | 63039 |
2018-06-25 | 2018-06-25 | 11150 | 11400 | 11450 | 11000 | 55519 |
2018-06-22 | 2018-06-22 | 11300 | 11250 | 11450 | 10750 | 134805 |
2018-06-21 | 2018-06-21 | 11200 | 11350 | 11750 | 11200 | 133002 |
2018-06-20 | 2018-06-20 | 11550 | 11200 | 11600 | 10900 | 308596 |
2018-06-19 | 2018-06-19 | 11300 | 11850 | 11950 | 11300 | 180656 |
2018-06-18 | 2018-06-18 | 12000 | 13400 | 13400 | 12000 | 309787 |
2018-06-15 | 2018-06-15 | 13400 | 13600 | 13600 | 12900 | 201376 |
2018-06-14 | 2018-06-14 | 13450 | 13200 | 13700 | 13150 | 347451 |
2018-06-12 | 2018-06-12 | 13200 | 12200 | 13300 | 12050 | 558148 |
2018-06-11 | 2018-06-11 | 11950 | 12000 | 12250 | 11950 | 62293 |
2018-06-08 | 2018-06-08 | 11950 | 11950 | 12200 | 11800 | 59258 |
2018-06-07 | 2018-06-07 | 11950 | 12200 | 12300 | 11900 | 49088 |
2018-06-05 | 2018-06-05 | 12150 | 11800 | 12250 | 11800 | 42485 |
2018-06-04 | 2018-06-04 | 11900 | 11900 | 12200 | 11700 | 25171 |
2018-06-01 | 2018-06-01 | 11900 | 11800 | 12100 | 11750 | 32062 |
df.loc['2018', 'Start':'High'] # 열 범위 추가로 슬라이싱
Start | High | |
---|---|---|
new_Date | ||
2018-07-02 | 10850 | 10900 |
2018-06-29 | 10550 | 10900 |
2018-06-28 | 10900 | 10950 |
2018-06-27 | 10800 | 11050 |
2018-06-26 | 10900 | 11000 |
2018-06-25 | 11400 | 11450 |
2018-06-22 | 11250 | 11450 |
2018-06-21 | 11350 | 11750 |
2018-06-20 | 11200 | 11600 |
2018-06-19 | 11850 | 11950 |
2018-06-18 | 13400 | 13400 |
2018-06-15 | 13600 | 13600 |
2018-06-14 | 13200 | 13700 |
2018-06-12 | 12200 | 13300 |
2018-06-11 | 12000 | 12250 |
2018-06-08 | 11950 | 12200 |
2018-06-07 | 12200 | 12300 |
2018-06-05 | 11800 | 12250 |
2018-06-04 | 11900 | 12200 |
2018-06-01 | 11800 | 12100 |
df.loc['2018-06-07']
Date | Close | Start | High | Low | Volume | |
---|---|---|---|---|---|---|
new_Date | ||||||
2018-06-07 | 2018-06-07 | 11950 | 12200 | 12300 | 11900 | 49088 |
df.loc['2018-06-07':'2018-06-19']
Date | Close | Start | High | Low | Volume | |
---|---|---|---|---|---|---|
new_Date | ||||||
2018-06-19 | 2018-06-19 | 11300 | 11850 | 11950 | 11300 | 180656 |
2018-06-18 | 2018-06-18 | 12000 | 13400 | 13400 | 12000 | 309787 |
2018-06-15 | 2018-06-15 | 13400 | 13600 | 13600 | 12900 | 201376 |
2018-06-14 | 2018-06-14 | 13450 | 13200 | 13700 | 13150 | 347451 |
2018-06-12 | 2018-06-12 | 13200 | 12200 | 13300 | 12050 | 558148 |
2018-06-11 | 2018-06-11 | 11950 | 12000 | 12250 | 11950 | 62293 |
2018-06-08 | 2018-06-08 | 11950 | 11950 | 12200 | 11800 | 59258 |
today = pd.to_datetime('2018-12-25') # 기준일 생성
df.index
DatetimeIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
'2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
'2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
'2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
'2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
dtype='datetime64[ns]', name='new_Date', freq=None)
today - df.index # 기준일로부터 날짜 차이 생성
TimedeltaIndex(['176 days', '179 days', '180 days', '181 days', '182 days',
'183 days', '186 days', '187 days', '188 days', '189 days',
'190 days', '193 days', '194 days', '196 days', '197 days',
'200 days', '201 days', '203 days', '204 days', '207 days'],
dtype='timedelta64[ns]', name='new_Date', freq=None)
df['time_delta']= today - df.index
df.set_index('time_delta', inplace=True)
df.head()
Date | Close | Start | High | Low | Volume | |
---|---|---|---|---|---|---|
time_delta | ||||||
176 days | 2018-07-02 | 10100 | 10850 | 10900 | 10000 | 137977 |
179 days | 2018-06-29 | 10700 | 10550 | 10900 | 9990 | 170253 |
180 days | 2018-06-28 | 10400 | 10900 | 10950 | 10150 | 155769 |
181 days | 2018-06-27 | 10900 | 10800 | 11050 | 10500 | 133548 |
182 days | 2018-06-26 | 10800 | 10900 | 11000 | 10700 | 63039 |
df.loc['179 days':'181 days']
Date | Close | Start | High | Low | Volume | |
---|---|---|---|---|---|---|
time_delta | ||||||
179 days | 2018-06-29 | 10700 | 10550 | 10900 | 9990 | 170253 |
180 days | 2018-06-28 | 10400 | 10900 | 10950 | 10150 | 155769 |
181 days | 2018-06-27 | 10900 | 10800 | 11050 | 10500 | 133548 |
댓글남기기