8 분 소요

Pandas (4)

10 시계열 데이터 다루기

import pandas as pd

(1) 다른 자료형을 시계열 객체로 변환

  • 문자열을 Timestamp로 변환
df = pd.read_csv('./examples/stock-data.csv')
df.head()
Date Close Start High Low Volume
0 2018-07-02 10100 10850 10900 10000 137977
1 2018-06-29 10700 10550 10900 9990 170253
2 2018-06-28 10400 10900 10950 10150 155769
3 2018-06-27 10900 10800 11050 10500 133548
4 2018-06-26 10800 10900 11000 10700 63039
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Date    20 non-null     object
 1   Close   20 non-null     int64 
 2   Start   20 non-null     int64 
 3   High    20 non-null     int64 
 4   Low     20 non-null     int64 
 5   Volume  20 non-null     int64 
dtypes: int64(5), object(1)
memory usage: 1.1+ KB
# 문자열 데이터를 Timestamp 로 변환환
# df['new_Date']= pd.to_datetime(df['Date'])
df['new_Date']= df['Date'].apply(pd.to_datetime)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      20 non-null     object        
 1   Close     20 non-null     int64         
 2   Start     20 non-null     int64         
 3   High      20 non-null     int64         
 4   Low       20 non-null     int64         
 5   Volume    20 non-null     int64         
 6   new_Date  20 non-null     datetime64[ns]
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 1.2+ KB
df['new_Date'][0], type(df['new_Date'][0]) # new_Date 컬럼의 행 하나의 값과 타입 확인인
(Timestamp('2018-07-02 00:00:00'), pandas._libs.tslibs.timestamps.Timestamp)
df.head()
Date Close Start High Low Volume new_Date
0 2018-07-02 10100 10850 10900 10000 137977 2018-07-02
1 2018-06-29 10700 10550 10900 9990 170253 2018-06-29
2 2018-06-28 10400 10900 10950 10150 155769 2018-06-28
3 2018-06-27 10900 10800 11050 10500 133548 2018-06-27
4 2018-06-26 10800 10900 11000 10700 63039 2018-06-26
df.set_index('new_Date', inplace=True)
df.drop('Date', axis=1, inplace=True)
df.head()
Close Start High Low Volume
new_Date
2018-07-02 10100 10850 10900 10000 137977
2018-06-29 10700 10550 10900 9990 170253
2018-06-28 10400 10900 10950 10150 155769
2018-06-27 10900 10800 11050 10500 133548
2018-06-26 10800 10900 11000 10700 63039
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20 entries, 2018-07-02 to 2018-06-01
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Close   20 non-null     int64
 1   Start   20 non-null     int64
 2   High    20 non-null     int64
 3   Low     20 non-null     int64
 4   Volume  20 non-null     int64
dtypes: int64(5)
memory usage: 960.0 bytes
df.index
DatetimeIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
               '2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
               '2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
               '2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
               '2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
              dtype='datetime64[ns]', name='new_Date', freq=None)
  • TimeStamp를 Period로 변환
df.index.to_period(freq='D') # day (1일)
PeriodIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
             '2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
             '2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
             '2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
             '2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
            dtype='period[D]', name='new_Date')
df.index.to_period(freq='M') # month end
PeriodIndex(['2018-07', '2018-06', '2018-06', '2018-06', '2018-06', '2018-06',
             '2018-06', '2018-06', '2018-06', '2018-06', '2018-06', '2018-06',
             '2018-06', '2018-06', '2018-06', '2018-06', '2018-06', '2018-06',
             '2018-06', '2018-06'],
            dtype='period[M]', name='new_Date')
df.index.to_period(freq='A') # year end
PeriodIndex(['2018', '2018', '2018', '2018', '2018', '2018', '2018', '2018',
             '2018', '2018', '2018', '2018', '2018', '2018', '2018', '2018',
             '2018', '2018', '2018', '2018'],
            dtype='period[A-DEC]', name='new_Date')

(2) 시계열 데이터 만들기

  • TimeStamp 배열
ts_ms = pd.date_range(start='2023-01-01', # 날짜 범위의 시작
              end=None,           # 날짜 범위의 끝
              periods = 10,       # 생성할 Timestamp의 개수수
              freq='MS',          # 시간 간격 (MS : 월의 시작일)
              tz='Asia/Seoul')    # 시간대 (timezone)
ts_ms              
DatetimeIndex(['2023-01-01 00:00:00+09:00', '2023-02-01 00:00:00+09:00',
               '2023-03-01 00:00:00+09:00', '2023-04-01 00:00:00+09:00',
               '2023-05-01 00:00:00+09:00', '2023-06-01 00:00:00+09:00',
               '2023-07-01 00:00:00+09:00', '2023-08-01 00:00:00+09:00',
               '2023-09-01 00:00:00+09:00', '2023-10-01 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='MS')
ts_me = pd.date_range(start='2023-01-01', # 날짜 범위의 시작
              end=None,           # 날짜 범위의 끝
              periods = 10,       # 생성할 Timestamp의 개수수
              freq='M',          # 시간 간격 (MS : 월의 마지막날날)
              tz='Asia/Seoul')    # 시간대 (timezone)
ts_me  
DatetimeIndex(['2023-01-31 00:00:00+09:00', '2023-02-28 00:00:00+09:00',
               '2023-03-31 00:00:00+09:00', '2023-04-30 00:00:00+09:00',
               '2023-05-31 00:00:00+09:00', '2023-06-30 00:00:00+09:00',
               '2023-07-31 00:00:00+09:00', '2023-08-31 00:00:00+09:00',
               '2023-09-30 00:00:00+09:00', '2023-10-31 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='M')
# 분기(3개월) 간격, 월의 마지막 날 기준준
ts_3m = pd.date_range(start='2023-01-01', # 날짜 범위의 시작
              end=None,           # 날짜 범위의 끝
              periods = 10,       # 생성할 Timestamp의 개수수
              freq='3M',          # 시간 간격 (3M : 3개월)
              tz='Asia/Seoul')    # 시간대 (timezone)
ts_3m  
DatetimeIndex(['2023-01-31 00:00:00+09:00', '2023-04-30 00:00:00+09:00',
               '2023-07-31 00:00:00+09:00', '2023-10-31 00:00:00+09:00',
               '2024-01-31 00:00:00+09:00', '2024-04-30 00:00:00+09:00',
               '2024-07-31 00:00:00+09:00', '2024-10-31 00:00:00+09:00',
               '2025-01-31 00:00:00+09:00', '2025-04-30 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='3M')
  • Period 배열
# 1개월 길이로 Period 배열 만들기기
pr_m = pd.period_range(start='2023-01-01', # 날짜 범위의 시작
                       end = None,         # 날짜 범위의 끝
                       periods=3,           # 생성할 Period 개수
                       freq='M')           # 기간의 길이 (M:월)
pr_m                       
PeriodIndex(['2023-01', '2023-02', '2023-03'], dtype='period[M]')
# 1시간 길이로 Period 배열 만들기기
pr_h = pd.period_range(start='2023-01-01', # 날짜 범위의 시작
                       end = None,         # 날짜 범위의 끝
                       periods=3,           # 생성할 Period 개수
                       freq='H')           # 기간의 길이 (H:시간)
pr_h  
PeriodIndex(['2023-01-01 00:00', '2023-01-01 01:00', '2023-01-01 02:00'], dtype='period[H]')
# 2시간 길이로 Period 배열 만들기기
pr_2h = pd.period_range(start='2023-01-01', # 날짜 범위의 시작
                       end = None,         # 날짜 범위의 끝
                       periods=3,           # 생성할 Period 개수
                       freq='2H')           # 기간의 길이 (H:시간)
pr_2h  
PeriodIndex(['2023-01-01 00:00', '2023-01-01 02:00', '2023-01-01 04:00'], dtype='period[2H]')

(3) 시계열 데이터 활용

  • 날짜 데이터 분리
df = pd.read_csv('./examples/stock-data.csv')
df.head()
Date Close Start High Low Volume
0 2018-07-02 10100 10850 10900 10000 137977
1 2018-06-29 10700 10550 10900 9990 170253
2 2018-06-28 10400 10900 10950 10150 155769
3 2018-06-27 10900 10800 11050 10500 133548
4 2018-06-26 10800 10900 11000 10700 63039
df['new_Date'] = pd.to_datetime(df['Date'])
# dt 속성
df['Year']= df['new_Date'].dt.year
df['Month'] = df['new_Date'].dt.month
df['Day'] = df['new_Date'].dt.day
df.head()
Date Close Start High Low Volume new_Date Year Month Day
0 2018-07-02 10100 10850 10900 10000 137977 2018-07-02 2018 7 2
1 2018-06-29 10700 10550 10900 9990 170253 2018-06-29 2018 6 29
2 2018-06-28 10400 10900 10950 10150 155769 2018-06-28 2018 6 28
3 2018-06-27 10900 10800 11050 10500 133548 2018-06-27 2018 6 27
4 2018-06-26 10800 10900 11000 10700 63039 2018-06-26 2018 6 26
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      20 non-null     object        
 1   Close     20 non-null     int64         
 2   Start     20 non-null     int64         
 3   High      20 non-null     int64         
 4   Low       20 non-null     int64         
 5   Volume    20 non-null     int64         
 6   new_Date  20 non-null     datetime64[ns]
 7   Year      20 non-null     int64         
 8   Month     20 non-null     int64         
 9   Day       20 non-null     int64         
dtypes: datetime64[ns](1), int64(8), object(1)
memory usage: 1.7+ KB
# Timestamp를 Period로 변환하여 년월일 표기 변경하기
df['Date_yr'] = df['new_Date'].dt.to_period(freq='A')
df['Date_m'] = df['new_Date'].dt.to_period(freq='M')
df.head()
Date Close Start High Low Volume new_Date Year Month Day Date_yr Date_m
0 2018-07-02 10100 10850 10900 10000 137977 2018-07-02 2018 7 2 2018 2018-07
1 2018-06-29 10700 10550 10900 9990 170253 2018-06-29 2018 6 29 2018 2018-06
2 2018-06-28 10400 10900 10950 10150 155769 2018-06-28 2018 6 28 2018 2018-06
3 2018-06-27 10900 10800 11050 10500 133548 2018-06-27 2018 6 27 2018 2018-06
4 2018-06-26 10800 10900 11000 10700 63039 2018-06-26 2018 6 26 2018 2018-06
# 원하는 열을 새로운 행 인덱스로 지정
df.set_index('Date_m', inplace=True)
df.head()
Date Close Start High Low Volume new_Date Year Month Day Date_yr
Date_m
2018-07 2018-07-02 10100 10850 10900 10000 137977 2018-07-02 2018 7 2 2018
2018-06 2018-06-29 10700 10550 10900 9990 170253 2018-06-29 2018 6 29 2018
2018-06 2018-06-28 10400 10900 10950 10150 155769 2018-06-28 2018 6 28 2018
2018-06 2018-06-27 10900 10800 11050 10500 133548 2018-06-27 2018 6 27 2018
2018-06 2018-06-26 10800 10900 11000 10700 63039 2018-06-26 2018 6 26 2018
  • 날짜 인덱스 활용
df = pd.read_csv('./examples/stock-data.csv')
df['new_Date'] = pd.to_datetime(df['Date'])
df.set_index('new_Date', inplace=True)
df.head()
Date Close Start High Low Volume
new_Date
2018-07-02 2018-07-02 10100 10850 10900 10000 137977
2018-06-29 2018-06-29 10700 10550 10900 9990 170253
2018-06-28 2018-06-28 10400 10900 10950 10150 155769
2018-06-27 2018-06-27 10900 10800 11050 10500 133548
2018-06-26 2018-06-26 10800 10900 11000 10700 63039
df.index
DatetimeIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
               '2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
               '2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
               '2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
               '2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
              dtype='datetime64[ns]', name='new_Date', freq=None)
df.loc['2018']
Date Close Start High Low Volume
new_Date
2018-07-02 2018-07-02 10100 10850 10900 10000 137977
2018-06-29 2018-06-29 10700 10550 10900 9990 170253
2018-06-28 2018-06-28 10400 10900 10950 10150 155769
2018-06-27 2018-06-27 10900 10800 11050 10500 133548
2018-06-26 2018-06-26 10800 10900 11000 10700 63039
2018-06-25 2018-06-25 11150 11400 11450 11000 55519
2018-06-22 2018-06-22 11300 11250 11450 10750 134805
2018-06-21 2018-06-21 11200 11350 11750 11200 133002
2018-06-20 2018-06-20 11550 11200 11600 10900 308596
2018-06-19 2018-06-19 11300 11850 11950 11300 180656
2018-06-18 2018-06-18 12000 13400 13400 12000 309787
2018-06-15 2018-06-15 13400 13600 13600 12900 201376
2018-06-14 2018-06-14 13450 13200 13700 13150 347451
2018-06-12 2018-06-12 13200 12200 13300 12050 558148
2018-06-11 2018-06-11 11950 12000 12250 11950 62293
2018-06-08 2018-06-08 11950 11950 12200 11800 59258
2018-06-07 2018-06-07 11950 12200 12300 11900 49088
2018-06-05 2018-06-05 12150 11800 12250 11800 42485
2018-06-04 2018-06-04 11900 11900 12200 11700 25171
2018-06-01 2018-06-01 11900 11800 12100 11750 32062
df.loc['2018', 'Start':'High'] # 열 범위 추가로 슬라이싱
Start High
new_Date
2018-07-02 10850 10900
2018-06-29 10550 10900
2018-06-28 10900 10950
2018-06-27 10800 11050
2018-06-26 10900 11000
2018-06-25 11400 11450
2018-06-22 11250 11450
2018-06-21 11350 11750
2018-06-20 11200 11600
2018-06-19 11850 11950
2018-06-18 13400 13400
2018-06-15 13600 13600
2018-06-14 13200 13700
2018-06-12 12200 13300
2018-06-11 12000 12250
2018-06-08 11950 12200
2018-06-07 12200 12300
2018-06-05 11800 12250
2018-06-04 11900 12200
2018-06-01 11800 12100
df.loc['2018-06-07']
Date Close Start High Low Volume
new_Date
2018-06-07 2018-06-07 11950 12200 12300 11900 49088
df.loc['2018-06-07':'2018-06-19']
Date Close Start High Low Volume
new_Date
2018-06-19 2018-06-19 11300 11850 11950 11300 180656
2018-06-18 2018-06-18 12000 13400 13400 12000 309787
2018-06-15 2018-06-15 13400 13600 13600 12900 201376
2018-06-14 2018-06-14 13450 13200 13700 13150 347451
2018-06-12 2018-06-12 13200 12200 13300 12050 558148
2018-06-11 2018-06-11 11950 12000 12250 11950 62293
2018-06-08 2018-06-08 11950 11950 12200 11800 59258
today = pd.to_datetime('2018-12-25') # 기준일 생성
df.index
DatetimeIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
               '2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
               '2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
               '2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
               '2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
              dtype='datetime64[ns]', name='new_Date', freq=None)
today - df.index # 기준일로부터 날짜 차이 생성
TimedeltaIndex(['176 days', '179 days', '180 days', '181 days', '182 days',
                '183 days', '186 days', '187 days', '188 days', '189 days',
                '190 days', '193 days', '194 days', '196 days', '197 days',
                '200 days', '201 days', '203 days', '204 days', '207 days'],
               dtype='timedelta64[ns]', name='new_Date', freq=None)
df['time_delta']= today - df.index
df.set_index('time_delta', inplace=True)
df.head()
Date Close Start High Low Volume
time_delta
176 days 2018-07-02 10100 10850 10900 10000 137977
179 days 2018-06-29 10700 10550 10900 9990 170253
180 days 2018-06-28 10400 10900 10950 10150 155769
181 days 2018-06-27 10900 10800 11050 10500 133548
182 days 2018-06-26 10800 10900 11000 10700 63039
df.loc['179 days':'181 days']
Date Close Start High Low Volume
time_delta
179 days 2018-06-29 10700 10550 10900 9990 170253
180 days 2018-06-28 10400 10900 10950 10150 155769
181 days 2018-06-27 10900 10800 11050 10500 133548

Reference

파이썬 라이브러리를 활용한 데이터 분석 (웨스 맥키니 저)

댓글남기기