Pandas 時間序列 2- 日期時間索引

天上飛雞 2020-05-27

展開全文

局部字符串索引切片 vs. 精準匹配精確索引截斷與花式索引日期/時間組件

DatetimeIndex 主要用作 Pandas 對象的索引。DatetimeIndex 類為時間序列做了很多優(yōu)化：

預(yù)計算了各種偏移量的日期范圍，并在后臺緩存，讓后臺生成后續(xù)日期范圍的速度非?？欤▋H需抓取切片）。
在 Pandas 對象上使用 shift 與 tshift 方法進行快速偏移。
合并具有相同頻率的重疊 DatetimeIndex 對象的速度非常快（這點對快速數(shù)據(jù)對齊非常重要）。
通過 year、month 等屬性快速訪問日期字段。
snap 等正則函數(shù)與超快的 asof 邏輯。

DatetimeIndex 對象支持全部常規(guī) Index 對象的基本用法，及一些列簡化頻率處理的高級時間序列專有方法。

參閱：重置索引
注意：Pandas 不強制排序日期索引，但如果日期沒有排序，可能會引發(fā)可控范圍之外的或不正確的操作。

DatetimeIndex 可以當作常規(guī)索引，支持選擇、切片等方法。

In [94]: rng = pd.date_range(start, end, freq='BM')

In [95]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [96]: ts.index
Out[96]: 
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
               '2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
               '2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
              dtype='datetime64[ns]', freq='BM')

In [97]: ts[:5].index
Out[97]: 
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
               '2011-05-31'],
              dtype='datetime64[ns]', freq='BM')

In [98]: ts[::2].index
Out[98]: 
DatetimeIndex(['2011-01-31', '2011-03-31', '2011-05-31', '2011-07-29',
               '2011-09-30', '2011-11-30'],
              dtype='datetime64[ns]', freq='2BM')

局部字符串索引

能解析為時間戳的日期與字符串可以作為索引的參數(shù)：

In [99]: ts['1/31/2011']
Out[99]: 0.11920871129693428

In [100]: ts[datetime.datetime(2011, 12, 25):]
Out[100]: 
2011-12-30    0.56702
Freq: BM, dtype: float64

In [101]: ts['10/31/2011':'12/31/2011']
Out[101]: 
2011-10-31    0.271860
2011-11-30   -0.424972
2011-12-30    0.567020
Freq: BM, dtype: float64

Pandas 為訪問較長的時間序列提供了便捷方法，年、年月字符串均可：

In [102]: ts['2011']
Out[102]: 
2011-01-31    0.119209
2011-02-28   -1.044236
2011-03-31   -0.861849
2011-04-29   -2.104569
2011-05-31   -0.494929
2011-06-30    1.071804
2011-07-29    0.721555
2011-08-31   -0.706771
2011-09-30   -1.039575
2011-10-31    0.271860
2011-11-30   -0.424972
2011-12-30    0.567020
Freq: BM, dtype: float64

In [103]: ts['2011-6']
Out[103]: 
2011-06-30    1.071804
Freq: BM, dtype: float64

帶 DatetimeIndex 的 DateFrame 也支持這種切片方式。局部字符串是標簽切片的一種形式，這種切片也包含截止時點，即，與日期匹配的時間也會包含在內(nèi)：

In [104]: dft = pd.DataFrame(np.random.randn(100000, 1), columns=['A'],
   .....:                    index=pd.date_range('20130101', periods=100000, freq='T'))
   .....: 

In [105]: dft
Out[105]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043

[100000 rows x 1 columns]

In [106]: dft['2013']
Out[106]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043

[100000 rows x 1 columns]

下列代碼截取了自 1 月 1 日凌晨起，至 2 月 28 日午夜的日期與時間。

In [107]: dft['2013-1':'2013-2']
Out[107]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-02-28 23:55:00  0.850929
2013-02-28 23:56:00  0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517

[84960 rows x 1 columns]

下列代碼截取了包含截止日期及其時間在內(nèi)的日期與時間。

In [108]: dft['2013-1':'2013-2-28']
Out[108]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-02-28 23:55:00  0.850929
2013-02-28 23:56:00  0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517

[84960 rows x 1 columns]

下列代碼指定了精準的截止時間，注意此處的結(jié)果與上述截取結(jié)果的區(qū)別：

In [109]: dft['2013-1':'2013-2-28 00:00:00']
Out[109]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-02-27 23:56:00  1.197749
2013-02-27 23:57:00  0.720521
2013-02-27 23:58:00 -0.072718
2013-02-27 23:59:00 -0.681192
2013-02-28 00:00:00 -0.557501

[83521 rows x 1 columns]

截止時間是索引的一部分，包含在截取的內(nèi)容之內(nèi)：

In [110]: dft['2013-1-15':'2013-1-15 12:30:00']
Out[110]: 
                            A
2013-01-15 00:00:00 -0.984810
2013-01-15 00:01:00  0.941451
2013-01-15 00:02:00  1.559365
2013-01-15 00:03:00  1.034374
2013-01-15 00:04:00 -1.480656
...                       ...
2013-01-15 12:26:00  0.371454
2013-01-15 12:27:00 -0.930806
2013-01-15 12:28:00 -0.069177
2013-01-15 12:29:00  0.066510
2013-01-15 12:30:00 -0.003945

[751 rows x 1 columns]

0.18.0 版新增。

DatetimeIndex 局部字符串索引還支持多層索引 DataFrame。

In [111]: dft2 = pd.DataFrame(np.random.randn(20, 1),
   .....:                     columns=['A'],
   .....:                     index=pd.MultiIndex.from_product(
   .....:                         [pd.date_range('20130101', periods=10, freq='12H'),
   .....:                          ['a', 'b']]))
   .....: 

In [112]: dft2
Out[112]: 
                              A
2013-01-01 00:00:00 a -0.298694
                    b  0.823553
2013-01-01 12:00:00 a  0.943285
                    b -1.479399
2013-01-02 00:00:00 a -1.643342
...                         ...
2013-01-04 12:00:00 b  0.069036
2013-01-05 00:00:00 a  0.122297
                    b  1.422060
2013-01-05 12:00:00 a  0.370079
                    b  1.016331

[20 rows x 1 columns]

In [113]: dft2.loc['2013-01-05']
Out[113]: 
                              A
2013-01-05 00:00:00 a  0.122297
                    b  1.422060
2013-01-05 12:00:00 a  0.370079
                    b  1.016331

In [114]: idx = pd.IndexSlice

In [115]: dft2 = dft2.swaplevel(0, 1).sort_index()

In [116]: dft2.loc[idx[:, '2013-01-05'], :]
Out[116]: 
                              A
a 2013-01-05 00:00:00  0.122297
  2013-01-05 12:00:00  0.370079
b 2013-01-05 00:00:00  1.422060
  2013-01-05 12:00:00  1.016331

0.25.0 版新增。

字符串索引切片支持 UTC 偏移。

In [117]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))

In [118]: df
Out[118]: 
                           0
2019-01-01 00:00:00-08:00  0

In [119]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[119]: 
                           0
2019-01-01 00:00:00-08:00  0

切片 vs. 精準匹配

0.20.0 版新增。

基于索引的精度，字符串既可用于切片，也可用于精準匹配。字符串精度比索引精度低，就是切片，比索引精度高，則是精準匹配。

In [120]: series_minute = pd.Series([1, 2, 3],
   .....:                           pd.DatetimeIndex(['2011-12-31 23:59:00',
   .....:                                             '2012-01-01 00:00:00',
   .....:                                             '2012-01-01 00:02:00']))
   .....: 

In [121]: series_minute.index.resolution
Out[121]: 'minute'

下例中的時間戳字符串沒有 Series 對象的精度高。series_minute 到秒，時間戳字符串只到分。

In [122]: series_minute['2011-12-31 23']
Out[122]: 
2011-12-31 23:59:00    1
dtype: int64

精度為分鐘（或更高精度）的時間戳字符串，給出的是標量，不會被當作切片。

In [123]: series_minute['2011-12-31 23:59']
Out[123]: 1

In [124]: series_minute['2011-12-31 23:59:00']
Out[124]: 1

索引的精度為秒時，精度為分鐘的時間戳返回的是 Series。

In [125]: series_second = pd.Series([1, 2, 3],
   .....:                           pd.DatetimeIndex(['2011-12-31 23:59:59',
   .....:                                             '2012-01-01 00:00:00',
   .....:                                             '2012-01-01 00:00:01']))
   .....: 

In [126]: series_second.index.resolution
Out[126]: 'second'

In [127]: series_second['2011-12-31 23:59']
Out[127]: 
2011-12-31 23:59:59    1
dtype: int64

用時間戳字符串切片時，還可以用 [] 索引 DataFrame。

In [128]: dft_minute = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]},
   .....:                           index=series_minute.index)
   .....: 

In [129]: dft_minute['2011-12-31 23']
Out[129]: 
                     a  b
2011-12-31 23:59:00  1  4

警告：字符串執(zhí)行精確匹配時，用 [] 按列，而不是按行截取 DateFrame ，參閱索引基礎(chǔ)。如，dft_minute ['2011-12-31 23:59'] 會觸發(fā) KeyError，這是因為 2012-12-31 23:59與索引的精度一樣，但沒有叫這個名字的列。

為了實現(xiàn)精準切片，要用 .loc 對行進行切片或選擇。

In [130]: dft_minute.loc['2011-12-31 23:59']
Out[130]: 
a    1
b    4
Name: 2011-12-31 23:59:00, dtype: int64

注意，DatetimeIndex 精度不能低于日。

In [131]: series_monthly = pd.Series([1, 2, 3],
   .....:                            pd.DatetimeIndex(['2011-12', '2012-01', '2012-02']))
   .....: 

In [132]: series_monthly.index.resolution
Out[132]: 'day'

In [133]: series_monthly['2011-12']  # 返回的是 Series
Out[133]: 
2011-12-01    1
dtype: int64

精確索引

正如上節(jié)所述，局部字符串依靠時間段的精度索引 DatetimeIndex，即時間間隔與索引精度相關(guān)。反之，用 Timestamp 或 datetime 索引更精準，這些對象指定的時間更精確。注意，精確索引包含了起始時點。

就算沒有顯式指定，Timestamp 與datetime 也支持 hours、minutes、seconds，默認值為 0。

In [134]: dft[datetime.datetime(2013, 1, 1):datetime.datetime(2013, 2, 28)]
Out[134]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-02-27 23:56:00  1.197749
2013-02-27 23:57:00  0.720521
2013-02-27 23:58:00 -0.072718
2013-02-27 23:59:00 -0.681192
2013-02-28 00:00:00 -0.557501

[83521 rows x 1 columns]

不用默認值。

In [135]: dft[datetime.datetime(2013, 1, 1, 10, 12, 0):
   .....:     datetime.datetime(2013, 2, 28, 10, 12, 0)]
   .....: 
Out[135]: 
                            A
2013-01-01 10:12:00  0.565375
2013-01-01 10:13:00  0.068184
2013-01-01 10:14:00  0.788871
2013-01-01 10:15:00 -0.280343
2013-01-01 10:16:00  0.931536
...                       ...
2013-02-28 10:08:00  0.148098
2013-02-28 10:09:00 -0.388138
2013-02-28 10:10:00  0.139348
2013-02-28 10:11:00  0.085288
2013-02-28 10:12:00  0.950146

[83521 rows x 1 columns]

截斷與花式索引

truncate() 便捷函數(shù)與切片類似。注意，與切片返回的是部分匹配日期不同， truncate 假設(shè) DatetimeIndex 里未標明時間組件的值為 0。

In [136]: rng2 = pd.date_range('2011-01-01', '2012-01-01', freq='W')

In [137]: ts2 = pd.Series(np.random.randn(len(rng2)), index=rng2)

In [138]: ts2.truncate(before='2011-11', after='2011-12')
Out[138]: 
2011-11-06    0.437823
2011-11-13   -0.293083
2011-11-20   -0.059881
2011-11-27    1.252450
Freq: W-SUN, dtype: float64

In [139]: ts2['2011-11':'2011-12']
Out[139]: 
2011-11-06    0.437823
2011-11-13   -0.293083
2011-11-20   -0.059881
2011-11-27    1.252450
2011-12-04    0.046611
2011-12-11    0.059478
2011-12-18   -0.286539
2011-12-25    0.841669
Freq: W-SUN, dtype: float64

花式索引返回 DatetimeIndex，但因為打亂了 DatetimeIndex 頻率，丟棄了頻率信息，見 freq=None：

In [140]: ts2[[0, 2, 6]].index
Out[140]: DatetimeIndex(['2011-01-02', '2011-01-16', '2011-02-13'], dtype='datetime64[ns]', freq=None)

日期/時間組件

以下日期/時間屬性可以訪問 Timestamp 或 DatetimeIndex。

屬性	說明
year	datetime 的年
month	datetime 的月
day	datetime 的日
hour	datetime 的小時
minute	datetime 的分鐘
second	datetime 的秒
microsecond	datetime 的微秒
nanosecond	datetime 的納秒
date	返回 datetime.date（不包含時區(qū)信息）
time	返回 datetime.time（不包含時區(qū)信息）
timetz	返回帶本地時區(qū)信息的 datetime.time
dayofyear	一年里的第幾天
weekofyear	一年里的第幾周
week	一年里的第幾周
dayofweek	一周里的第幾天，Monday=0, Sunday=6
weekday	一周里的第幾天，Monday=0, Sunday=6
weekday_name	這一天是星期幾（如，F(xiàn)riday）
quarter	日期所處的季節(jié)：Jan-Mar = 1 等
days_in_month	日期所在的月有多少天
is_month_start	邏輯判斷是不是月初（由頻率定義）
is_month_end	邏輯判斷是不是月末（由頻率定義）
is_quarter_start	邏輯判斷是不是季初（由頻率定義）
is_quarter_end	邏輯判斷是不是季末（由頻率定義）
is_year_start	邏輯判斷是不是年初（由頻率定義）
is_year_end	邏輯判斷是不是年末（由頻率定義）
is_leap_year	邏輯判斷是不是日期所在年是不是閏年