Data analysis 6 (pandas)

2023-02-22 8 분 소요

Pandas (4)

10 시계열 데이터 다루기

import pandas as pd

(1) 다른 자료형을 시계열 객체로 변환

문자열을 Timestamp로 변환

df = pd.read_csv('./examples/stock-data.csv')
df.head()

	Date	Close	Start	High	Low	Volume
0	2018-07-02	10100	10850	10900	10000	137977
1	2018-06-29	10700	10550	10900	9990	170253
2	2018-06-28	10400	10900	10950	10150	155769
3	2018-06-27	10900	10800	11050	10500	133548
4	2018-06-26	10800	10900	11000	10700	63039

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Date    20 non-null     object
 1   Close   20 non-null     int64 
 2   Start   20 non-null     int64 
 3   High    20 non-null     int64 
 4   Low     20 non-null     int64 
 5   Volume  20 non-null     int64 
dtypes: int64(5), object(1)
memory usage: 1.1+ KB

# 문자열 데이터를 Timestamp 로 변환환
# df['new_Date']= pd.to_datetime(df['Date'])
df['new_Date']= df['Date'].apply(pd.to_datetime)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      20 non-null     object        
 1   Close     20 non-null     int64         
 2   Start     20 non-null     int64         
 3   High      20 non-null     int64         
 4   Low       20 non-null     int64         
 5   Volume    20 non-null     int64         
 6   new_Date  20 non-null     datetime64[ns]
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 1.2+ KB

df['new_Date'][0], type(df['new_Date'][0]) # new_Date 컬럼의 행 하나의 값과 타입 확인인

(Timestamp('2018-07-02 00:00:00'), pandas._libs.tslibs.timestamps.Timestamp)

df.head()

	Date	Close	Start	High	Low	Volume	new_Date
0	2018-07-02	10100	10850	10900	10000	137977	2018-07-02
1	2018-06-29	10700	10550	10900	9990	170253	2018-06-29
2	2018-06-28	10400	10900	10950	10150	155769	2018-06-28
3	2018-06-27	10900	10800	11050	10500	133548	2018-06-27
4	2018-06-26	10800	10900	11000	10700	63039	2018-06-26

df.set_index('new_Date', inplace=True)
df.drop('Date', axis=1, inplace=True)
df.head()

	Close	Start	High	Low	Volume
new_Date
2018-07-02	10100	10850	10900	10000	137977
2018-06-29	10700	10550	10900	9990	170253
2018-06-28	10400	10900	10950	10150	155769
2018-06-27	10900	10800	11050	10500	133548
2018-06-26	10800	10900	11000	10700	63039

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20 entries, 2018-07-02 to 2018-06-01
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Close   20 non-null     int64
 1   Start   20 non-null     int64
 2   High    20 non-null     int64
 3   Low     20 non-null     int64
 4   Volume  20 non-null     int64
dtypes: int64(5)
memory usage: 960.0 bytes

df.index

DatetimeIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
               '2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
               '2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
               '2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
               '2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
              dtype='datetime64[ns]', name='new_Date', freq=None)

TimeStamp를 Period로 변환

df.index.to_period(freq='D') # day (1일)

PeriodIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
             '2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
             '2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
             '2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
             '2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
            dtype='period[D]', name='new_Date')

df.index.to_period(freq='M') # month end

PeriodIndex(['2018-07', '2018-06', '2018-06', '2018-06', '2018-06', '2018-06',
             '2018-06', '2018-06', '2018-06', '2018-06', '2018-06', '2018-06',
             '2018-06', '2018-06', '2018-06', '2018-06', '2018-06', '2018-06',
             '2018-06', '2018-06'],
            dtype='period[M]', name='new_Date')

df.index.to_period(freq='A') # year end

PeriodIndex(['2018', '2018', '2018', '2018', '2018', '2018', '2018', '2018',
             '2018', '2018', '2018', '2018', '2018', '2018', '2018', '2018',
             '2018', '2018', '2018', '2018'],
            dtype='period[A-DEC]', name='new_Date')

(2) 시계열 데이터 만들기

TimeStamp 배열

ts_ms = pd.date_range(start='2023-01-01', # 날짜 범위의 시작
              end=None,           # 날짜 범위의 끝
              periods = 10,       # 생성할 Timestamp의 개수수
              freq='MS',          # 시간 간격 (MS : 월의 시작일)
              tz='Asia/Seoul')    # 시간대 (timezone)
ts_ms              

DatetimeIndex(['2023-01-01 00:00:00+09:00', '2023-02-01 00:00:00+09:00',
               '2023-03-01 00:00:00+09:00', '2023-04-01 00:00:00+09:00',
               '2023-05-01 00:00:00+09:00', '2023-06-01 00:00:00+09:00',
               '2023-07-01 00:00:00+09:00', '2023-08-01 00:00:00+09:00',
               '2023-09-01 00:00:00+09:00', '2023-10-01 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='MS')

ts_me = pd.date_range(start='2023-01-01', # 날짜 범위의 시작
              end=None,           # 날짜 범위의 끝
              periods = 10,       # 생성할 Timestamp의 개수수
              freq='M',          # 시간 간격 (MS : 월의 마지막날날)
              tz='Asia/Seoul')    # 시간대 (timezone)
ts_me  

DatetimeIndex(['2023-01-31 00:00:00+09:00', '2023-02-28 00:00:00+09:00',
               '2023-03-31 00:00:00+09:00', '2023-04-30 00:00:00+09:00',
               '2023-05-31 00:00:00+09:00', '2023-06-30 00:00:00+09:00',
               '2023-07-31 00:00:00+09:00', '2023-08-31 00:00:00+09:00',
               '2023-09-30 00:00:00+09:00', '2023-10-31 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='M')

# 분기(3개월) 간격, 월의 마지막 날 기준준
ts_3m = pd.date_range(start='2023-01-01', # 날짜 범위의 시작
              end=None,           # 날짜 범위의 끝
              periods = 10,       # 생성할 Timestamp의 개수수
              freq='3M',          # 시간 간격 (3M : 3개월)
              tz='Asia/Seoul')    # 시간대 (timezone)
ts_3m  

DatetimeIndex(['2023-01-31 00:00:00+09:00', '2023-04-30 00:00:00+09:00',
               '2023-07-31 00:00:00+09:00', '2023-10-31 00:00:00+09:00',
               '2024-01-31 00:00:00+09:00', '2024-04-30 00:00:00+09:00',
               '2024-07-31 00:00:00+09:00', '2024-10-31 00:00:00+09:00',
               '2025-01-31 00:00:00+09:00', '2025-04-30 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='3M')

Period 배열

# 1개월 길이로 Period 배열 만들기기
pr_m = pd.period_range(start='2023-01-01', # 날짜 범위의 시작
                       end = None,         # 날짜 범위의 끝
                       periods=3,           # 생성할 Period 개수
                       freq='M')           # 기간의 길이 (M:월)
pr_m                       

PeriodIndex(['2023-01', '2023-02', '2023-03'], dtype='period[M]')

# 1시간 길이로 Period 배열 만들기기
pr_h = pd.period_range(start='2023-01-01', # 날짜 범위의 시작
                       end = None,         # 날짜 범위의 끝
                       periods=3,           # 생성할 Period 개수
                       freq='H')           # 기간의 길이 (H:시간)
pr_h  

PeriodIndex(['2023-01-01 00:00', '2023-01-01 01:00', '2023-01-01 02:00'], dtype='period[H]')

# 2시간 길이로 Period 배열 만들기기
pr_2h = pd.period_range(start='2023-01-01', # 날짜 범위의 시작
                       end = None,         # 날짜 범위의 끝
                       periods=3,           # 생성할 Period 개수
                       freq='2H')           # 기간의 길이 (H:시간)
pr_2h  

PeriodIndex(['2023-01-01 00:00', '2023-01-01 02:00', '2023-01-01 04:00'], dtype='period[2H]')

(3) 시계열 데이터 활용

날짜 데이터 분리

df = pd.read_csv('./examples/stock-data.csv')
df.head()

	Date	Close	Start	High	Low	Volume
0	2018-07-02	10100	10850	10900	10000	137977
1	2018-06-29	10700	10550	10900	9990	170253
2	2018-06-28	10400	10900	10950	10150	155769
3	2018-06-27	10900	10800	11050	10500	133548
4	2018-06-26	10800	10900	11000	10700	63039

df['new_Date'] = pd.to_datetime(df['Date'])

# dt 속성
df['Year']= df['new_Date'].dt.year
df['Month'] = df['new_Date'].dt.month
df['Day'] = df['new_Date'].dt.day
df.head()

	Date	Close	Start	High	Low	Volume	new_Date	Year	Month	Day
0	2018-07-02	10100	10850	10900	10000	137977	2018-07-02	2018	7	2
1	2018-06-29	10700	10550	10900	9990	170253	2018-06-29	2018	6	29
2	2018-06-28	10400	10900	10950	10150	155769	2018-06-28	2018	6	28
3	2018-06-27	10900	10800	11050	10500	133548	2018-06-27	2018	6	27
4	2018-06-26	10800	10900	11000	10700	63039	2018-06-26	2018	6	26

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      20 non-null     object        
 1   Close     20 non-null     int64         
 2   Start     20 non-null     int64         
 3   High      20 non-null     int64         
 4   Low       20 non-null     int64         
 5   Volume    20 non-null     int64         
 6   new_Date  20 non-null     datetime64[ns]
 7   Year      20 non-null     int64         
 8   Month     20 non-null     int64         
 9   Day       20 non-null     int64         
dtypes: datetime64[ns](1), int64(8), object(1)
memory usage: 1.7+ KB

# Timestamp를 Period로 변환하여 년월일 표기 변경하기
df['Date_yr'] = df['new_Date'].dt.to_period(freq='A')
df['Date_m'] = df['new_Date'].dt.to_period(freq='M')
df.head()

	Date	Close	Start	High	Low	Volume	new_Date	Year	Month	Day	Date_yr	Date_m
0	2018-07-02	10100	10850	10900	10000	137977	2018-07-02	2018	7	2	2018	2018-07
1	2018-06-29	10700	10550	10900	9990	170253	2018-06-29	2018	6	29	2018	2018-06
2	2018-06-28	10400	10900	10950	10150	155769	2018-06-28	2018	6	28	2018	2018-06
3	2018-06-27	10900	10800	11050	10500	133548	2018-06-27	2018	6	27	2018	2018-06
4	2018-06-26	10800	10900	11000	10700	63039	2018-06-26	2018	6	26	2018	2018-06

# 원하는 열을 새로운 행 인덱스로 지정
df.set_index('Date_m', inplace=True)
df.head()

	Date	Close	Start	High	Low	Volume	new_Date	Year	Month	Day	Date_yr
Date_m
2018-07	2018-07-02	10100	10850	10900	10000	137977	2018-07-02	2018	7	2	2018
2018-06	2018-06-29	10700	10550	10900	9990	170253	2018-06-29	2018	6	29	2018
2018-06	2018-06-28	10400	10900	10950	10150	155769	2018-06-28	2018	6	28	2018
2018-06	2018-06-27	10900	10800	11050	10500	133548	2018-06-27	2018	6	27	2018
2018-06	2018-06-26	10800	10900	11000	10700	63039	2018-06-26	2018	6	26	2018

날짜 인덱스 활용

df = pd.read_csv('./examples/stock-data.csv')
df['new_Date'] = pd.to_datetime(df['Date'])
df.set_index('new_Date', inplace=True)
df.head()

	Date	Close	Start	High	Low	Volume
new_Date
2018-07-02	2018-07-02	10100	10850	10900	10000	137977
2018-06-29	2018-06-29	10700	10550	10900	9990	170253
2018-06-28	2018-06-28	10400	10900	10950	10150	155769
2018-06-27	2018-06-27	10900	10800	11050	10500	133548
2018-06-26	2018-06-26	10800	10900	11000	10700	63039

df.index

DatetimeIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
               '2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
               '2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
               '2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
               '2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
              dtype='datetime64[ns]', name='new_Date', freq=None)

df.loc['2018']

	Date	Close	Start	High	Low	Volume
new_Date
2018-07-02	2018-07-02	10100	10850	10900	10000	137977
2018-06-29	2018-06-29	10700	10550	10900	9990	170253
2018-06-28	2018-06-28	10400	10900	10950	10150	155769
2018-06-27	2018-06-27	10900	10800	11050	10500	133548
2018-06-26	2018-06-26	10800	10900	11000	10700	63039
2018-06-25	2018-06-25	11150	11400	11450	11000	55519
2018-06-22	2018-06-22	11300	11250	11450	10750	134805
2018-06-21	2018-06-21	11200	11350	11750	11200	133002
2018-06-20	2018-06-20	11550	11200	11600	10900	308596
2018-06-19	2018-06-19	11300	11850	11950	11300	180656
2018-06-18	2018-06-18	12000	13400	13400	12000	309787
2018-06-15	2018-06-15	13400	13600	13600	12900	201376
2018-06-14	2018-06-14	13450	13200	13700	13150	347451
2018-06-12	2018-06-12	13200	12200	13300	12050	558148
2018-06-11	2018-06-11	11950	12000	12250	11950	62293
2018-06-08	2018-06-08	11950	11950	12200	11800	59258
2018-06-07	2018-06-07	11950	12200	12300	11900	49088
2018-06-05	2018-06-05	12150	11800	12250	11800	42485
2018-06-04	2018-06-04	11900	11900	12200	11700	25171
2018-06-01	2018-06-01	11900	11800	12100	11750	32062

df.loc['2018', 'Start':'High'] # 열 범위 추가로 슬라이싱

	Start	High
new_Date
2018-07-02	10850	10900
2018-06-29	10550	10900
2018-06-28	10900	10950
2018-06-27	10800	11050
2018-06-26	10900	11000
2018-06-25	11400	11450
2018-06-22	11250	11450
2018-06-21	11350	11750
2018-06-20	11200	11600
2018-06-19	11850	11950
2018-06-18	13400	13400
2018-06-15	13600	13600
2018-06-14	13200	13700
2018-06-12	12200	13300
2018-06-11	12000	12250
2018-06-08	11950	12200
2018-06-07	12200	12300
2018-06-05	11800	12250
2018-06-04	11900	12200
2018-06-01	11800	12100

df.loc['2018-06-07']

	Date	Close	Start	High	Low	Volume
new_Date
2018-06-07	2018-06-07	11950	12200	12300	11900	49088

df.loc['2018-06-07':'2018-06-19']

	Date	Close	Start	High	Low	Volume
new_Date
2018-06-19	2018-06-19	11300	11850	11950	11300	180656
2018-06-18	2018-06-18	12000	13400	13400	12000	309787
2018-06-15	2018-06-15	13400	13600	13600	12900	201376
2018-06-14	2018-06-14	13450	13200	13700	13150	347451
2018-06-12	2018-06-12	13200	12200	13300	12050	558148
2018-06-11	2018-06-11	11950	12000	12250	11950	62293
2018-06-08	2018-06-08	11950	11950	12200	11800	59258

today = pd.to_datetime('2018-12-25') # 기준일 생성

df.index

DatetimeIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
               '2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
               '2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
               '2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
               '2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
              dtype='datetime64[ns]', name='new_Date', freq=None)

today - df.index # 기준일로부터 날짜 차이 생성

TimedeltaIndex(['176 days', '179 days', '180 days', '181 days', '182 days',
                '183 days', '186 days', '187 days', '188 days', '189 days',
                '190 days', '193 days', '194 days', '196 days', '197 days',
                '200 days', '201 days', '203 days', '204 days', '207 days'],
               dtype='timedelta64[ns]', name='new_Date', freq=None)

df['time_delta']= today - df.index
df.set_index('time_delta', inplace=True)
df.head()

	Date	Close	Start	High	Low	Volume
time_delta
176 days	2018-07-02	10100	10850	10900	10000	137977
179 days	2018-06-29	10700	10550	10900	9990	170253
180 days	2018-06-28	10400	10900	10950	10150	155769
181 days	2018-06-27	10900	10800	11050	10500	133548
182 days	2018-06-26	10800	10900	11000	10700	63039

df.loc['179 days':'181 days']

	Date	Close	Start	High	Low	Volume
time_delta
179 days	2018-06-29	10700	10550	10900	9990	170253
180 days	2018-06-28	10400	10900	10950	10150	155769
181 days	2018-06-27	10900	10800	11050	10500	133548

Reference

파이썬 라이브러리를 활용한 데이터 분석 (웨스 맥키니 저)

Twitter Facebook LinkedIn

Data analysis 6 (pandas)

Pandas (4)

10 시계열 데이터 다루기

(1) 다른 자료형을 시계열 객체로 변환

(2) 시계열 데이터 만들기

(3) 시계열 데이터 활용

Reference

공유하기

댓글남기기

참고

소개(about me)

Rnn 20 (attention and seq2seq learning using date dataset in pytorch)

Rnn 19 (attention and seq2seq learning using addition dataset in pytorch)

Rnn 18 (어텐션)