【原】2022-07 Pandas進階復盤匯總

小小明代碼實體 2022-07-18 發(fā)布于廣東

展開全文

📢博客主頁：https://blog.csdn.net/as604049322

📢歡迎點贊 👍 收藏 ?留言 📝 歡迎討論！

📢本文由 小小明-代碼實體 原創(chuàng)，首發(fā)于 CSDN🙉

本月重讀了《Pandas 百問百答》和《joyful-pandas》，結合群友提到的問題，對一些內容進行復盤總結。

以下內容主要包括容易遺忘的，能加深對原理理解的，能提升執(zhí)行速度的，能更簡單解決實際問題的等等。

本月若群友提出的問題，又涉及新的偏門的知識點，還會繼續(xù)在本文更新，建議收藏本文后慢慢研讀。

目前目錄如下，可按需查看。

文章目錄

Excel日期列批量還原

如果我們使用Pandas讀取Excel，其中日期列總是會被解析成datetime類型，導致寫出Excel時，原本的日期列總是會帶上全0的時間，例如：

import pandas as pd

df = pd.read_excel("time_data.xlsx")
df.to_excel("test.xlsx", index=False)

麻煩點的方法，我們可以取出其內部的對象，設置這列的顯示格式，簡單點的辦法是獲取其中的日期類型：

import pandas as pd

df = pd.read_excel("time_data.xlsx")
df.b = df.b.dt.date
df.to_excel("test.xlsx", index=False)

但這是日期列比較少而且確定的情況，如果我們需要批量處理很多Excel表，日期列不確定，是否有方法將所有的日期列批量還原，我的處理方法如下：

import datetime

for column, s in df.select_dtypes("datetime").iteritems():
    if (s.dt.time == datetime.time(0)).all():
        df[column] = s.dt.date

經(jīng)過上述代碼處理即可將所有日期列還原，寫出Excel表時會自動設置為純日期格式。

繼承體系與類型判斷

select_dtypes方法支持篩選指定類型的列，根據(jù)官方文檔：

選擇所有數(shù)字類型的列，用 np.number 或 'number'
選擇字符串類型的列，默認只能用 object，但這將返回所有數(shù)據(jù)類型為 object 的列。若已將字符串轉換為Nullable的string類型后，則只能使用string進行篩選。
選擇日期時間類型的列，用np.datetime64、'datetime' 或 'datetime64'
選擇 timedelta 類型的列，用np.timedelta64、'timedelta' 或 'timedelta64'
選擇 category 類型類別，用 'category'
選擇 datetimetz 類型的列，用'datetimetz'或 'datetime64[ns, tz]'

簡易版select_dtypes實現(xiàn)：

def select_dtypes(df, dtypes):
    if not pd.api.types.is_array_like(dtypes):
        dtypes = [dtypes]
    return df[df.columns[df.dtypes.isin(dtypes)]]

Numpy官網(wǎng)上展示了類型的繼承形式：

來自：https:///doc/stable/reference/arrays.scalars.html

選擇字符串類型除了可以寫df.select_dtypes(include="object")還可以簡寫為df.select_dtypes("O")

但是前面的df.b.dt.date返回的也是Object類型，會把這種列也篩選出來。

下面我們自行編碼看看Numpy和pandas的類型繼承體系，簡易的方法如下：

def subdtypes1(dtype):
    subs = dtype.__subclasses__()
    if not subs:
        return dtype
    return [dtype, [subdtypes1(dt) for dt in subs]]

查看Numpy：

subdtypes1(np.generic)

[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.intc,
        numpy.int32,
        numpy.int64,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8, numpy.uint16, numpy.uintc, numpy.uint32, numpy.uint64]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.longdouble]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.clongdouble]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.bytes_, numpy.str_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]

查看Pandas：

subdtypes1(pd.core.dtypes.base.ExtensionDtype)

[pandas.core.dtypes.base.ExtensionDtype,
 [[pandas.core.dtypes.dtypes.PandasExtensionDtype,
   [pandas.core.dtypes.dtypes.CategoricalDtype,
    pandas.core.dtypes.dtypes.DatetimeTZDtype,
    pandas.core.dtypes.dtypes.PeriodDtype,
    pandas.core.dtypes.dtypes.IntervalDtype]],
  pandas.core.dtypes.dtypes.CategoricalDtype,
  pandas.core.dtypes.dtypes.PandasDtype,
  [pandas.core.arrays.masked.BaseMaskedDtype,
   [pandas.core.arrays.boolean.BooleanDtype,
    [pandas.core.arrays.numeric.NumericDtype,
     [[pandas.core.arrays.integer._IntegerDtype,
       [pandas.core.arrays.integer.Int8Dtype,
        pandas.core.arrays.integer.Int16Dtype,
        pandas.core.arrays.integer.Int32Dtype,
        pandas.core.arrays.integer.Int64Dtype,
        pandas.core.arrays.integer.UInt8Dtype,
        pandas.core.arrays.integer.UInt16Dtype,
        pandas.core.arrays.integer.UInt32Dtype,
        pandas.core.arrays.integer.UInt64Dtype]],
      [pandas.core.arrays.floating.FloatingDtype,
       [pandas.core.arrays.floating.Float32Dtype,
        pandas.core.arrays.floating.Float64Dtype]]]]]],
  pandas.core.arrays.sparse.dtype.SparseDtype,
  pandas.core.arrays.string_.StringDtype]]

這種形式可能閱讀效果不佳，我們可以使用rich庫做樹形顯示：

from rich.tree import Tree

def subdtypes(dtype, tree=None):
    if tree is None:
        tree = Tree(f'{dtype.__module__}.{dtype.__qualname__}')
    subs = dtype.__subclasses__()
    if not subs:
        return
    for dt in subs:
        sub_tree=tree.add(f'{dt.__module__}.{dt.__qualname__}')
        subdtypes(dt, sub_tree)
    return tree

顯示Numpy：

import numpy as np
subdtypes(np.generic)

numpy.generic
├── numpy.number
│   ├── numpy.integer
│   │   ├── numpy.signedinteger
│   │   │   ├── numpy.int8
│   │   │   ├── numpy.int16
│   │   │   ├── numpy.intc
│   │   │   ├── numpy.int32
│   │   │   ├── numpy.int64
│   │   │   └── numpy.timedelta64
│   │   └── numpy.unsignedinteger
│   │       ├── numpy.uint8
│   │       ├── numpy.uint16
│   │       ├── numpy.uintc
│   │       ├── numpy.uint32
│   │       └── numpy.uint64
│   └── numpy.inexact
│       ├── numpy.floating
│       │   ├── numpy.float16
│       │   ├── numpy.float32
│       │   ├── numpy.float64
│       │   └── numpy.longdouble
│       └── numpy.complexfloating
│           ├── numpy.complex64
│           ├── numpy.complex128
│           └── numpy.clongdouble
├── numpy.flexible
│   ├── numpy.character
│   │   ├── numpy.bytes_
│   │   └── numpy.str_
│   └── numpy.void
│       └── numpy.record
├── numpy.bool_
├── numpy.datetime64
└── numpy.object_

查看Pandas：

import pandas as pd
subdtypes(pd.core.dtypes.base.ExtensionDtype)

pandas.core.dtypes.base.ExtensionDtype
├── pandas.core.dtypes.dtypes.PandasExtensionDtype
│   ├── pandas.core.dtypes.dtypes.CategoricalDtype
│   ├── pandas.core.dtypes.dtypes.DatetimeTZDtype
│   ├── pandas.core.dtypes.dtypes.PeriodDtype
│   └── pandas.core.dtypes.dtypes.IntervalDtype
├── pandas.core.dtypes.dtypes.CategoricalDtype
├── pandas.core.dtypes.dtypes.PandasDtype
├── pandas.core.arrays.masked.BaseMaskedDtype
│   ├── pandas.core.arrays.boolean.BooleanDtype
│   └── pandas.core.arrays.numeric.NumericDtype
│       ├── pandas.core.arrays.integer._IntegerDtype
│       │   ├── pandas.core.arrays.integer.Int8Dtype
│       │   ├── pandas.core.arrays.integer.Int16Dtype
│       │   ├── pandas.core.arrays.integer.Int32Dtype
│       │   ├── pandas.core.arrays.integer.Int64Dtype
│       │   ├── pandas.core.arrays.integer.UInt8Dtype
│       │   ├── pandas.core.arrays.integer.UInt16Dtype
│       │   ├── pandas.core.arrays.integer.UInt32Dtype
│       │   └── pandas.core.arrays.integer.UInt64Dtype
│       └── pandas.core.arrays.floating.FloatingDtype
│           ├── pandas.core.arrays.floating.Float32Dtype
│           └── pandas.core.arrays.floating.Float64Dtype
├── pandas.core.arrays.sparse.dtype.SparseDtype
└── pandas.core.arrays.string_.StringDtype

類型判斷

以前我們判斷一個變量是否是數(shù)值類型時使用過如下方法：

isinstance(num, (int, float))

或

from numbers import Number
isinstance(num, Number)

不過pandas本身也提供了判斷方法：

pd.api.types.is_number(num)

當然pd.api.types中還有各種類型的判斷方法：

print([_ for _ in dir(pd.api.types) if _.startswith("is")])

['is_array_like', 'is_bool', 'is_bool_dtype', 'is_categorical', 'is_categorical_dtype', 'is_complex', 'is_complex_dtype', 'is_datetime64_any_dtype', 'is_datetime64_dtype', 'is_datetime64_ns_dtype', 'is_datetime64tz_dtype', 'is_dict_like', 'is_dtype_equal', 'is_extension_array_dtype', 'is_extension_type', 'is_file_like', 'is_float', 'is_float_dtype', 'is_hashable', 'is_int64_dtype', 'is_integer', 'is_integer_dtype', 'is_interval', 'is_interval_dtype', 'is_iterator', 'is_list_like', 'is_named_tuple', 'is_number', 'is_numeric_dtype', 'is_object_dtype', 'is_period_dtype', 'is_re', 'is_re_compilable', 'is_scalar', 'is_signed_integer_dtype', 'is_sparse', 'is_string_dtype', 'is_timedelta64_dtype', 'is_timedelta64_ns_dtype', 'is_unsigned_integer_dtype']

可以很方便的判斷指定變量是否為指定類型。

類型轉換

類型推斷infer_objects：任何類型都可以以object類型的形式存儲

例如有很多列object內部實際存儲著整數(shù)、浮點數(shù)、bool等就可以使用infer_objects方法自動還原回正確的類型：

df = pd.DataFrame({"A": [1, 2], "B": [2., 3.4],
                   "C": [True, False], "D": ["xxm", "dmst"]}, dtype="object")
print(df.dtypes)
df = df.infer_objects()
print("自動推斷后：")
print(df.dtypes)

A    object
B    object
C    object
D    object
dtype: object
自動推斷后：
A      int64
B    float64
C       bool
D     object
dtype: object

對于字符串默認情況下我們只能以object類型的形式存儲，但在1.0.0版本以后我們可以轉換為Nullable的string類型，用于專門表示字符串類型。

一般情況下，我們進行類型轉換使用astype方法，例如將文本轉換為數(shù)字：

s = pd.Series(["1", "5", "8"])
s.astype("int")

但是假如上述字符串中存在某個無法被轉換為數(shù)字的字符串，就會發(fā)生報錯ValueError: invalid literal for int() with base 10: xxx

除了我們事先將字符串轉換到數(shù)字形式外，還可以使用pd.to_numeric方法：

m = ['apple', 2, 3]
pd.to_numeric(m, errors='coerce')

errors參數(shù)指定了無法轉換時的行為，coerce表示輸出空值np.nan，ignore表示輸出原始內容但最終列類型為object，而默認的raise表示無法轉換時直接報錯。

downcast參數(shù)表示向下轉型，可以轉換到可以存儲目標數(shù)據(jù)的最小類型，例如如下數(shù)據(jù)轉換為uint8：

m = ['1', 2, 3]
pd.to_numeric(m, errors='coerce', downcast='unsigned')

pandas的類似的轉換命令還有兩個時間相關的，先看看批量轉換時間間隔。

指定時間間隔字符串：

m = ['5s', '1day', "3days", "4H", "6Min"]
pd.to_timedelta(m)

TimedeltaIndex(['0 days 00:00:05', '1 days 00:00:00', '3 days 00:00:00',
                '0 days 04:00:00', '0 days 00:06:00'],
               dtype='timedelta64[ns]', freq=None)

單位一致可以指定數(shù)值和單位：

pd.to_timedelta([5, 6, 3, 1], unit="D")

TimedeltaIndex(['5 days', '6 days', '3 days', '1 days'], dtype='timedelta64[ns]', freq=None)

日期轉換方法pd.to_datetime參數(shù)較多，我們演示一些常見的方法。

指定日期格式的轉換：

pd.to_datetime(['18000101',"19810102"], format='%Y%m%d', errors='ignore')

DatetimeIndex(['1800-01-01', '1981-01-02'], dtype='datetime64[ns]', freq=None)

注意：時間序列的空值以pd.NaT的形式存在：

s = pd.Series(['5/11/2010', '3-12-a020', '3/13/2011'])
pd.to_datetime(s, errors="coerce")

0   2010-05-11
1          NaT
2   2011-03-13
dtype: datetime64[ns]

列名全是【'year’, 'month’, 'day’, 'minute’, 'second’, 'ms’, 'us’, 'ns’】之內的DataFrame可以整體被轉換：

df = pd.DataFrame({'year': [2015, 2016],
                   'month': [2, 3],
                   'day': [4, 5]})
pd.to_datetime(df)

infer_datetime_format參數(shù)表示是否嘗試對于第一個非空的時間字符串推斷日期格式，如果可以被推斷出來，則切換到一種更快的方法解析全部的時間字符串（假如已指定format，該參數(shù)被忽略）：

s = pd.Series(['5/11/2010', '3/12/2020', '3/13/2011'])
pd.to_datetime(s, infer_datetime_format=True)

轉換時間戳到日期：

pd.to_datetime([1575119387, 1575119687, 1575212636], unit='s')

DatetimeIndex(['2019-11-30 13:09:47', '2019-11-30 13:14:47',
               '2019-12-01 15:03:56'],
              dtype='datetime64[ns]', freq=None)

pd.to_datetime([1575119387982, 1575119687867, 1575212636675], unit='ms')

DatetimeIndex(['2019-11-30 13:09:47.982000', '2019-11-30 13:14:47.867000',
               '2019-12-01 15:03:56.675000'],
              dtype='datetime64[ns]', freq=None)

pd.to_datetime([1575119387982502912, 1575119687867502912, 1575212636675502912])

DatetimeIndex(['2019-11-30 13:09:47.982502912',
               '2019-11-30 13:14:47.867502912',
               '2019-12-01 15:03:56.675502912'],
              dtype='datetime64[ns]', freq=None)

pd.to_datetime轉換時間戳的默認單位為納秒，非納秒時都需要指定一下單位。

還可以指定起始時間：

pd.to_datetime([0, 1, 2, 3], unit='D', origin=pd.Timestamp('2022-01-01'))

以上代碼相當于：

pd.Timestamp('2022-01-01')+pd.to_timedelta(range(4), unit="D")

DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'], dtype='datetime64[ns]', freq=None)

Nullable類型處理空值

日常處理數(shù)據(jù)時，我們都會發(fā)現(xiàn)對于整數(shù)列一旦出現(xiàn)空值，整列就會自動變成浮點數(shù)類型：

pd.Series([1, np.nan]).dtype

dtype('float64')

能否在保留空值的情況下維持整數(shù)類型呢？那就是使用Nullable類型。

例如上述因為空值變成浮點數(shù)的列，我們可以使用轉換到Nullable的整數(shù)類型：

s = pd.Series([np.nan, 1])
s.astype(pd.Int16Dtype())

0    <NA>
1       1
dtype: Int16

當然也可以直接傳入字符串：

s.astype("Int16")

四種Nullable類型：

	Nullable類型	直接傳入字符串
整數(shù)類型	pd.Int64Dtype()	“Int64”
浮點數(shù)類型	pd.Float64Dtype()	“Float64”
bool類型	pd.BooleanDtype()	“boolean”
字符串類型	pd.StringDtype()	“string”

在上述4個 Nullable 類型中存儲的缺失值，都會轉為 pandas 內置的 pd.NA 。

boolean 這種Nullable類型和 bool 序列區(qū)別在于，含有缺失值時，boolean可以進行索引器中的選擇，會把缺失值看作 False ，而bool列表則會直接報錯。進行邏輯運算時， bool 類型在缺失處返回的永遠是 False ， boolean 會根據(jù)結果是否確定返回缺失值還是非缺失的確定的值。例如： True | pd.NA 中無論缺失值為什么值，必然返回 True ； False | pd.NA 中的結果會根據(jù)缺失值取值的不同而變化，此時返回 pd.NA ； False & pd.NA 中無論缺失值為什么值，必然返回 False 。

convert_dtypes可以自動將各列轉換為Nullable類型：

df = pd.DataFrame({"A": [1, np.nan], "B": [2., np.nan],
                   "C": [True, np.nan], "D": ["xxm", np.nan]})
print(df.dtypes)
df = df.convert_dtypes()
print("自動轉換后：")
print(df.dtypes)

A    float64
B    float64
C     object
D     object
dtype: object
自動轉換后：
A      Int64
B      Int64
C    boolean
D     string
dtype: object

原本所有的字符串類型都會以 object 類型的 Series 進行存儲，實際上 object 類型還可以存儲字典、列表甚至DataFream等等對象，轉換成 string 類型后，則會嚴格以字符串形式存儲。

object 類型的 str 屬性并不要求所有值都是字符串時才能使用，只需要序列中至少有一個可迭代（Iterable）對象即可，那么對于一個全部存儲python列表的列，使用s.str[0]相當于取每個列表的第一個元素。

DataFrame構建與遍歷

pd.DataFrame.from_records方法與將對象直接傳入pd.DataFrame的方法一致，下面介紹pd.DataFrame.from_dict中一個直接傳入pd.DataFrame難以實現(xiàn)的用法：

pd.DataFrame.from_dict(
    {'A': [1, 2, 3], 'B': [4, 5, 6]},
    orient="index", columns=['X', 'Y', 'Z'])

轉置直接使用.T：

對于json數(shù)據(jù)我們可以使用pd.json_normalize方法：

data = [{
        'CreatedBy': {'Name': 'User001'},
        'Lookup': {'TextField': 'Some text',
                   'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
        'Image': {'a': 'b'}
        }]
pd.json_normalize(data)

CreatedBy.Name	Lookup.TextField	Lookup.UserField.Id	Lookup.UserField.Name	Image.a
User001	Some text	ID001	Name001	b

可以指定最大解析級別：

pd.json_normalize(data, max_level=1)

CreatedBy.Name	Lookup.TextField	Lookup.UserField	Image.a
User001	Some text	{'Id’: 'ID001’, 'Name’: 'Name001’}	b

一個內層帶有JSON數(shù)組的例子：

data = [
    {
        "state": "Florida",
        "shortname": "FL",
        "info": {"governor": "Rick Scott"},
        "counties": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 40000},
            {"name": "Palm Beach", "population": 60000},
        ],
    },
    {
        "state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich"},
        "counties": [
            {"name": "Summit", "population": 1234},
            {"name": "Cuyahoga", "population": 1337},
        ],
    },
]
pd.json_normalize(data)

state	shortname	counties	info.governor
Florida	FL	[{'name’: 'Dade’, 'population’: 12345}, {'name…	Rick Scott
Ohio	OH	[{'name’: 'Summit’, 'population’: 1234}, {'nam…	John Kasich

此時可以指定record_path參數(shù)：

result = pd.json_normalize(
    data, record_path="counties", meta=["state", "shortname", ["info", "governor"]]
)
result

name	population	state	shortname	info.governor
Dade	12345	Florida	FL	Rick Scott
Broward	40000	Florida	FL	Rick Scott
Palm Beach	60000	Florida	FL	Rick Scott
Summit	1234	Ohio	OH	John Kasich
Cuyahoga	1337	Ohio	OH	John Kasich

遍歷Pandas我們都知道iterrows 性能極差速度極慢，所以不作演示下面我們看到一個比一個快方法，首先準備10萬測試數(shù)據(jù)：

df = pd.DataFrame({"a": np.random.randint(0, 1000, 100000),
                  "b": np.random.rand(100000)})

測試結果：

可以看到，zip遍歷各列的numpy對象速度最快。

itertuples遍歷返回的是命名元組，可以直接返回對應屬性，例如：

Pandas(Index=0, a=637, b=0.849218922664699)

交集并集差集異或集

以前我們再原生python上：

a = set('abracadabra')
b = set('alacazam')
print(" 差集:", a - b)  # 集合 a 中包含而集合 b 中不包含的元素
# {'r', 'd', 'b'}
print(" 并集:", a | b)  # 集合 a 或 b 中包含的所有元素
# {'a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'}
print(" 交集:", a & b)  # 集合 a 和 b 中都包含了的元素
# {'a', 'c'}
print(" 異或集", a ^ b)

 差集: {'r', 'd', 'b'}
 并集: {'m', 'a', 'd', 'b', 'l', 'r', 'z', 'c'}
 交集: {'a', 'c'}
 異或集 {'r', 'd', 'b', 'l', 'm', 'z'}

pandas的Index對象也支持：

a = pd.Index(list('abracadabra'))
b = pd.Index(list('alacazam'))
print(" 差集:", a.difference(b))
print(" 并集:", a.union(b).unique())
print(" 交集:", a.intersection(b))
print(" 異或集:", a.symmetric_difference(b))

 差集: Index(['b', 'd', 'r'], dtype='object')
 并集: Index(['a', 'b', 'c', 'd', 'l', 'm', 'r', 'z'], dtype='object')
 交集: Index(['a', 'c'], dtype='object')
 異或集: Index(['b', 'd', 'l', 'm', 'r', 'z'], dtype='object')

Index可以很方便的轉換為Series對象，相當于對單列的交集差集和并集。

Series對象若想要求交集除了轉換為Index對象，可以直接使用isin方法：

a = pd.Series(list('abracadabra'))
a[a.isin(list('alacazam'))]

0     a
3     a
4     c
5     a
7     a
10    a
dtype: object

與Index的交集的差異在于會保留重復。

對于兩個DataFrame求交集并集差集，采用如下方法：

# 差集
pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
# 并集
df1.merge(df2, how='outer')
# 交集：
df1.merge(df2)
# 異或集
pd.concat([df1, df2]).drop_duplicates(keep=False)

相關文章：

Pandas自定義shift與DataFrame求差集
https://xxmdmst.blog.csdn.net/article/details/118887322

索引過濾對齊與多級索引

reindex的使用

假如我們有一張編碼表和一張字母表，想要查詢每一個字母對應的編碼，假如編碼表包含全部時：

s = pd.Series({"a": 1, "b": 2, "c": 3})
df = pd.DataFrame({"s": list("acbaac")})
df["d"] = s[df.s].values
df

假如字母表存在編碼表找不到的字母：

df = pd.DataFrame({"s": list("acbddaac")})

此時使用上述方法則會報錯目標索引找不到，我們可以使用reindex方法：

df["d"] = s.reindex(df.s).values

更簡單的辦法是：

df["d"] = df.s.map(s)

索引對象的get_indexer方法可以批量獲取目標在索引中的角標位置：

a = pd.Index(['c', 'b', 'a'])
a.get_indexer(['c', 'a', 'd', 'b', 'b', 'c', 'a'])

array([ 0,  2, -1,  1,  1,  0,  2], dtype=int64)

不存在的元素會返回-1，get_loc則獲取單個元素的位置，目標不存在會報錯：

a.get_loc("b")

賦值時索引自動對齊

df = pd.DataFrame({"s": range(6)})
df.s = pd.Series({3: "v3", 5: "v5", 1: "v1", 7: "v7"})
df

結果：

可以看到按照存在的索引一一賦值，多余的數(shù)據(jù)會自動被丟棄。

如果我們希望保留datafream中的沒有被賦值的部分，除了可以fillna或combine_first重新填充回來外，還可以篩選要賦值的行：

df = pd.DataFrame({"s": range(6)})
t = pd.Series({3: "v3", 5: "v5", 1: "v1", 7: "v7"})
df.loc[t.index.intersection(df.index), "s"] = t
df

結果：

注意：loc傳入的索引都必須在查找目標中存在，否則會報錯。

所以對一個DataFream某列賦值一個Series時，一定要注意索引是否正確對應。如果索引不對應，僅值順序一致，應該取出其numpy對象進行賦值。

多級索引

產(chǎn)生多級索引：

pd.MultiIndex.from_product([("a", "b"), range(2)])

MultiIndex([('a', 0),
            ('a', 1),
            ('b', 0),
            ('b', 1)],
           )

我們也可以使用原生python庫生成類似的元組：

import itertools

list(itertools.product(("a", "b"), range(2)))

[('a', 0), ('a', 1), ('b', 0), ('b', 1)]

自己生成的元組可以使用pd.MultiIndex.from_tuples方法轉化成多級索引：

t = itertools.product(("a", "b"), range(2))
pd.MultiIndex.from_tuples(t)

可以通過 get_level_values 獲得得到某一層的索引：

muti = pd.MultiIndex.from_product([("a", "b"), range(2)])
print(muti.get_level_values(0))
print(muti.get_level_values(1))

Index(['a', 'a', 'b', 'b'], dtype='object')
Int64Index([0, 1, 0, 1], dtype='int64')

多級索引的篩選

有一份如下樣式的多級索引的數(shù)據(jù)：

np.random.seed(0)
L1, L2 = ['A', 'B', 'C'], ['a', 'b', 'c']
mul_index1 = pd.MultiIndex.from_product([L1, L2], names=('Upper', 'Lower'))
L3, L4 = ['D', 'E', 'F'], ['d', 'e', 'f']
mul_index2 = pd.MultiIndex.from_product([L3, L4], names=('Big', 'Small'))
df_ex = pd.DataFrame(np.random.randint(-9, 10, (9, 9)),
                     index=mul_index1,
                     columns=mul_index2)
df_ex

我們希望對每個層級都能分別指定篩選規(guī)則，可以使用pd.IndexSlice對象：

idx = pd.IndexSlice
df_ex.loc[idx[['C', 'A'], 'b':], idx['E':, ["d", "f"]]]

篩選結果如下：

再測試一個三級索引的例子：

np.random.seed(0)
L1,L2,L3 = ['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3],
             names=('Upper', 'Lower','Extra'))
L4,L5,L6 = ['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6],
             names=('Big', 'Small', 'Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)),
                        index=mul_index1,
                        columns=mul_index2)
df_ex

篩選效果：

索引層的交換由 swaplevel 和 reorder_levels 完成，前者只能交換兩個層，而后者可以交換任意層，兩者都可以指定交換的是軸是哪一個，即行索引或列索引：

(
    df_ex.swaplevel(1, 2, axis=1)   # 列索引的第一層和第二層交換
    .reorder_levels([2, 0, 1], axis=0)   # 行索引被指定為指定層級順序
    .head()
)

刪除索引層級：

df_ex.droplevel([1, 2], axis=0)

修改索引層的名稱使用rename_axis，修改索引的值使用rename多級索引需要指定修改的層號 level和修改字典（或函數(shù)）。

修改指定層級的索引可以使用如下方法：

df_ex.index = df_ex.index.set_levels(list('abcdefgh'), level=2)

時序升降采樣與滑窗處理提速

對于如下時間序列數(shù)據(jù)：

s = pd.Series(np.random.rand(5), pd.date_range(
    "2022-01-01", periods=5, freq="2D"))
s

2022-01-01    0.563105
2022-01-03    0.340093
2022-01-05    0.472301
2022-01-07    0.526723
2022-01-09    0.857248
Freq: 2D, dtype: float64

降采樣表示將數(shù)據(jù)聚合到規(guī)律的低頻率：

s.resample("5D").sum()

2022-01-01    6
2022-01-06    9
Freq: 5D, dtype: int64

升采樣表示將數(shù)據(jù)從低頻率轉換到高頻率：

s.asfreq("D")

2022-01-01    1.0
2022-01-02    NaN
2022-01-03    2.0
2022-01-04    NaN
2022-01-05    3.0
2022-01-06    NaN
2022-01-07    4.0
2022-01-08    NaN
2022-01-09    5.0
Freq: D, dtype: float64

對于升采樣產(chǎn)生的缺失值，我們除了可以使用fillna和ffill等空值填充方法外，還可以使用interpolate函數(shù)進行插值：

s.asfreq("D").interpolate()

2022-01-01    1.0
2022-01-02    1.5
2022-01-03    2.0
2022-01-04    2.5
2022-01-05    3.0
2022-01-06    3.5
2022-01-07    4.0
2022-01-08    4.5
2022-01-09    5.0
Freq: D, dtype: float64

interpolate函數(shù)的完整用法可參考：https://pandas./docs/reference/api/pandas.Series.interpolate.html

對rolling滑動窗口需要傳入自定義函數(shù)并且數(shù)據(jù)量較大時，apply函數(shù)指定engine = 'numba’，可能能夠大幅度提速，示例：

s.rolling('30D').apply(lambda x: x.sum()/x.size, engine='numba', raw=True)

構造測試數(shù)據(jù)：

idx = pd.date_range('19800101', '20221231', freq='B')
data = np.random.randint(-1, 2, len(idx)).cumsum()  # 隨機游動構造模擬序列
s = pd.Series(data, index=idx)
s

1980-01-01      1
1980-01-02      2
1980-01-03      3
1980-01-04      2
1980-01-07      3
             ... 
2022-12-26   -177
2022-12-27   -178
2022-12-28   -179
2022-12-29   -180
2022-12-30   -179
Freq: B, Length: 11219, dtype: int32

效果：

注意：僅pandas.core.window.rolling.Rolling對象的apply方法具備該參數(shù)，DataFrame和pandas.core.groupby.GroupBy對象的apply方法均不支持。而且Rolling對象的apply方法的engine參數(shù)在1.0.0以上版本才出現(xiàn)。

分組聚合

groupby 對象的完整屬性和方法可參考：https://pandas./docs/reference/groupby.html

groupby 對象的屬性

做分組操作時所調用的方法都來自于 pandas 中的 groupby 對象：

import pandas as pd

animals = pd.DataFrame({'品種': ['貓', '狗', '貓', '狗'],
                        '身高': [9.1, 6.0, 9.5, 34.0],
                        '體重': [7.9, 7.5, 9.9, 198.0]})
gb = animals.groupby("品種")
gb

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002985A85B548>

組的個數(shù)：

gb.ngroups

每組對應的索引：

gb.groups

{'狗': [1, 3], '貓': [0, 2]}

獲取指定組的內容，例如：

gb.get_group("狗")

等價于：

animals.query("品種=='狗'")

聚合函數(shù)agg

最實用的命名元組用法：

animals.groupby('品種').agg(
    最低身高=('身高', "min"),
    最高身高=('身高', "max"),
    平均體重=('體重', "mean"),
)

如果只對部分列進行一個操作并且不需要重命名可以使用基本用法：

animals.groupby('品種').agg({'身高': 'mean', '體重': 'mean'})

可以對單列執(zhí)行多個聚合操作：

animals.groupby('品種').身高.agg(["min", "max"])

對單列執(zhí)行每個聚合操作時都重命名列名：

animals.groupby('品種').身高.agg(
    最低="min",
    最高="max",
)

或者：

animals.groupby('品種').身高.agg([("最低", "min"), ("最高", "max")])

注意：agg也可以在未分組的DataFream或Series對象上使用

transform

transform與agg使用自定義函數(shù)處理時傳入?yún)?shù)一致都是每列對應的Series對象，簡單驗證一下：

聚合時，傳入agg的自定義函數(shù)必須返回聚合的標量值。

transform最后的返回結果是行列索引與數(shù)據(jù)源一致的 DataFrame ，當自定義函數(shù)返回一個標量時，結果會被廣播到其所在的整個組形成行列索引與數(shù)據(jù)源一致的 DataFrame ：

animals.groupby('品種').transform("min")

   身高  體重
0  9.1  7.9
1  6.0  7.5
2  9.1  7.9
3  6.0  7.5

我們一般會選取需要進行廣播處理的那一列，例如：

animals.groupby('品種').身高.transform("min")

apply

apply直接對DataFream操作傳入自定義函數(shù)的對象也是每列對應的Series對象，對分組后DataFrameGroupBy對象執(zhí)行apply操作，則傳入的是按行索引被拆分的DataFream對象。

標量示例：計算BMI均值

animals.groupby('品種').apply(lambda x: (x.體重/x.身高**2).mean())

品種
狗    0.189807
貓    0.102547
dtype: float64

返回Series的情況：列索引為 Series 的索引

animals.groupby('品種').agg(最低身高=('身高', "min"), 平均體重=('體重', "mean"))

上述代碼通過apply返回Series的形式達到同樣的效果：

animals.groupby('品種').apply(lambda x: pd.Series([x.體重.min(), x.體重.mean()], index=['最低身高', '平均體重']))

返回DataFrame的情況：列索引為 Series 的索引

可以任意調整每組的結果：

animals.groupby('品種').apply(lambda x: pd.DataFrame(
    np.ones((2, 2), "int8"), index=['a', 'b'],
    columns=pd.Index([('w', 'x'), ('y', 'z')])
))

返回的結果出現(xiàn)多級索引只在自定義函數(shù)返回的DataFrame與輸入的DataFrame索引不一致時才會出現(xiàn)，此時我們一般會使用droplevel方法將其刪除，例如：

animals.groupby('品種').apply(lambda x: x.query("身高>9")).droplevel(0)

melt的擴展版wide_to_long

常用的pandas變形函數(shù)有pivot、pivot_table、melt、crosstab、explode和get_dummies，索引變形函數(shù)stack與unstack。

簡單說明下，pivot和melt用于長寬表互轉，即SQL所說的列轉行與行轉列。pivot_table實現(xiàn)了類似Excel的數(shù)據(jù)透視表，crosstab交叉表則是數(shù)據(jù)透視表的特殊情況只能進行計數(shù)統(tǒng)計。explode實現(xiàn)了將列表擴展到多行，get_dummies用于生成啞編碼。

unstack 函數(shù)的作用是把行索引轉為列索引，stack 用于把列索引壓入行索引。stack與unstack結合groupby也可以實現(xiàn)pivot和melt同樣的功能。

以上常用的函數(shù)不再講述，現(xiàn)在只介紹wide_to_long。下面看看效果：

df = pd.DataFrame({'Class':[1,2],'Name':['San Zhang', 'Si Li'],
                   'Chinese_Mid':[80, 75], 'Math_Mid':[90, 85],
                   'Chinese_Final':[80, 75], 'Math_Final':[90, 85]})
pd.wide_to_long(df,
                stubnames=['Chinese', 'Math'],
                i=['Class', 'Name'],
                j='Examination',
                sep='_',
                suffix='.+')

如果使用melt實現(xiàn)上述效果，則需要如下代碼：

df_melt = df.melt(id_vars=['Class', 'Name'], value_vars=[
    "Chinese_Mid", "Math_Mid", "Chinese_Final", "Math_Final"],
    var_name="Subject_Examination",
    value_name='grade')
df_melt
df_melt[["Subject", "Examination"]] = df_melt.Subject_Examination.str.split(
    "_", expand=True)
df_melt.drop(columns=["Subject_Examination"], inplace=True)
df_melt.set_index(["Class","Name", "Examination", "Subject"]).unstack("Subject").droplevel(0, axis="columns")

分組的整組篩選

filter 方法可以對 groupby 對象進行組的篩選，其中自定義函數(shù)的輸入?yún)?shù)為數(shù)據(jù)源構成的 DataFrame 本身。

下面我們要求過濾掉每個站全年都是0的數(shù)據(jù)，以及每個站只有1年數(shù)據(jù)的站。準備測試數(shù)據(jù)：

import pandas as pd
import numpy as np
np.random.seed(0)
date = np.random.choice(pd.date_range(
    "2019-02-01", "2022-07-17", freq="4M"), 20)
stcd = np.random.choice(["X1005", "X1092", "Y7205"], 20)
p = np.random.permutation([0]*15+list(range(1, 6)))
df = pd.DataFrame({"date": date, "stcd": stcd, "p": p})
df.sort_values(["stcd", "date"], inplace=True, ignore_index=True)
df

過濾掉每個站全年都是0的數(shù)據(jù)：

df = df.groupby([
    "stcd",
    df.date.dt.year
]).filter(lambda x: (x.p != 0).any())

過濾掉只有1年數(shù)據(jù)的站：

df.groupby("stcd").filter(lambda x: x.date.dt.year.nunique() > 1)

Pandas選項設置

可用選項列表(出自：https://pandas./docs/reference/api/pandas.describe_option.html)：

compute.[use_bottleneck, use_numba, use_numexpr]
display.[chop_threshold, colheader_justify, column_space, date_dayfirst, date_yearfirst, encoding, expand_frame_repr, float_format]
display.html.[border, table_schema, use_mathjax]
display.[large_repr]
display.latex.[escape, longtable, multicolumn, multicolumn_format, multirow, repr]
display.[max_categories, max_columns, max_colwidth, max_dir_items, max_info_columns, max_info_rows, max_rows, max_seq_items, memory_usage, min_rows, multi_sparse, notebook_repr_html, pprint_nest_depth, precision, show_dimensions]
display.unicode.[ambiguous_as_wide, east_asian_width]
display.[width]
io.excel.ods.[reader, writer]
io.excel.xls.[reader, writer]
io.excel.xlsb.[reader]
io.excel.xlsm.[reader, writer]
io.excel.xlsx.[reader, writer]
io.hdf.[default_format, dropna_table]
io.parquet.[engine]
io.sql.[engine]
mode.[chained_assignment, data_manager, sim_interactive, string_storage, use_inf_as_na, use_inf_as_null]
plotting.[backend]
plotting.matplotlib.[register_converters]
styler.format.[decimal, escape, formatter, na_rep, precision, thousands]
styler.html.[mathjax]
styler.latex.[environment, hrules, multicol_align, multirow_align]
styler.render.[encoding, max_columns, max_elements, max_rows, repr]
styler.sparse.[columns, index]

查看所有選項說明：

pd.describe_option()

傳入選項名稱即可過濾出含有指定名稱的選項說明：

pd.describe_option("display")

設置選項：

pd.options.display.max_rows = 100

通過這種方式我們可以在敲出pd.options.后按下Tab鍵進行代碼提示，從而找到需要的選項。

還可以通過set_option方法設置選項：

pd.set_option("max_r", 100)

選項的完整名稱為display.max_rows，但是set_option通過正則查找找到了唯一匹配的選項，如果正則找到多個匹配的選項則會報錯。

reset_option方法可以一次重置多個選項(使用正則表達式) :

pd.reset_option("^display")

option_context() 方法可以在指定范圍內使用選項：

In [21]: with pd.option_context("display.max_rows", 10, "display.max_columns", 5):
   ....:     print(pd.get_option("display.max_rows"))
   ....:     print(pd.get_option("display.max_columns"))
   ....: 
10
5

In [22]: print(pd.get_option("display.max_rows"))
60

In [23]: print(pd.get_option("display.max_columns"))
0

更多用法可參考：https://pandas./docs/user_guide/options.html

綜合小案例

多列數(shù)據(jù)與單列列表互轉

df = pd.DataFrame([
    ['90', '51', '07'],
    ['99', '35', '33'],
    ['100', '14', '30'],
    ['99', '01', '11'],
    ['100', '08', '16']
])
df
# 多列數(shù)據(jù)轉換為單列列表
s = df.apply(list, axis=1)
s
# 單列列表轉換為多列的DataFream
s.apply(pd.Series)

更多分列示例查看：

Pandas實現(xiàn)列表分列與字典分列的三個實例
https://xxmdmst.blog.csdn.net/article/details/112789571

二分查找

以前我們使用二分查找使用bisect庫：

import bisect

a = [1, 3, 5]
print(bisect.bisect(a, 1), bisect.bisect(a, 2), bisect.bisect(a, 3))
print(bisect.bisect_left(a, 1), bisect.bisect_left(a, 2), bisect.bisect_left(a, 3))
print(bisect.bisect_right(a, 1), bisect.bisect_right(a, 2), bisect.bisect_right(a, 3))

1 1 2
0 1 1
1 1 2

事實上pandas內部有批量2分查找的方法：

ser = pd.Series([1, 3, 5])
print(ser.searchsorted([1, 2, 3]))
print(ser.searchsorted([1, 2, 3], side='left'))
print(ser.searchsorted([1, 2, 3], side='right'))

[0 1 1]
[0 1 1]
[1 1 2]

只不到bisect等價于bisect_right，searchsorted的side默認值為left。

自定義順序排序

pandas實現(xiàn)自定義順序排序除了利用輔助列以外就是利用category類型設置順序，有如下數(shù)據(jù)：

sales = pd.DataFrame({'分公司': ['上海', '廣州', '深圳', '北京', '上海', '深圳', '廣州', '北京', '北京'],
                  '銷售額': [26677, 16544, 15655, 36986, 18923, 44161, 26409, 93223, 56586],
                   '門店': ['上海一店', '廣州二店',  '深圳二店', '北京一店', '上海二店',
                          '深圳一店',  '廣州一店',  '北京二店', '北京三店']},
                  index=pd.Index(range(1, 10), name="序號"))
sales

我們希望按照北京，上海，廣州，深圳的順序進行排序，只需要設置一下category：

sales.分公司 = sales.分公司.astype("category").cat.set_categories(['北京', '上海', '廣州', '深圳'])

或者我們可以直接創(chuàng)建category類：

sales.分公司 = pd.Categorical(sales.分公司, categories=['北京', '上海', '廣州', '深圳'])

然后在按照分公司排序即可：

sales.sort_values(by='分公司')

分組合并列里的內容

目標：

完整代碼：

import pandas as pd

df = pd.DataFrame({'公司': ['蘋果', '蘋果', '谷歌', '谷歌', '谷歌', '谷歌', '谷歌'],
                   '部門': ['產(chǎn)品部', '研發(fā)部', '產(chǎn)品部', '產(chǎn)品部', '研發(fā)部', '研發(fā)部', '研發(fā)部'],
                  '部門人數(shù)': [1, 2, 3, 4, 5, 6, 7],
                   '運營成本': [10, 20, 30, 40, 50, 60, 70]})

df['部門人數(shù)：運營成本'] = df.部門人數(shù).astype("str")+"："+df.運營成本.astype("str")
df.groupby(['公司', '部門'], as_index=False)['部門人數(shù)：運營成本'].agg('；'.join)

度分秒經(jīng)緯度互轉

測試數(shù)據(jù)：

df = pd.DataFrame({'lon': ['905107', '993533', '1001430', '990111', '1000816',
                           '1013637', '945430', '1014359', '1012210',
                           '101°34′37″', '930450', '1001542', '995847']})

度分秒轉為小數(shù)度數(shù)：

import re


def func(x):
    return sum(int(num) / (60 ** i)
               for i, num in enumerate(re.match("(\d{2,3})[^\d]*(\d{2})[^\d]*(\d{2})[^\d]*$", str(x)).groups()))


df["r1"] = df.lon.apply(func)

小數(shù)度數(shù)轉為度分秒：

def func(x):
    d, r = divmod(x, 1)
    m, r = divmod(r*60, 1)
    s = round(r*60)
    return f"{int(d):0>2}°{int(m):0>2}′{s:0>2}″"

df["r2"] = df.r1.apply(func)
df

最終結果：

2秒生成一百萬條測試數(shù)據(jù)并排序

import pandas as pd
import numpy as np

sales_people = pd.Series({"陳天浩": "上海", "孫健": "上海", "王梓戎": "廣東", "劉丹": "上海",
                          "劉穎": "上海", "劉雪": "天津", "章洋": "上海", "殷琳": "廣東",
                          "李輝": "北京", "王玉": "吉林", "侯寧": "上海", "吳中岳": "廣東",
                          "張林": "廣東", "莊雷": "上海", "王宇": "吉林", "利坤": "上海",
                          "董丹丹": "廣東", "蔡建平": "山東", "陳楊": "吉林", "蔡勇": "廣東",
                          "李琳": "上海", "魏蒼生": "天津", "劉帆": "天津", "戴雪": "上海",
                          "許亮": "吉林", "李智童": "山東", "錢國": "山東", "郭華鋒": "吉林",
                          "閻云": "山東", "江敏": "上海"})
products = pd.Series({"蘋果": 10, "梨": 8, "桃": 6.5, "葡萄": 15, "椰子": 20,
                      "西瓜": 30, "百香果": 12, "榴蓮": 50, "桔子": 6, "香蕉": 7.5})
size = 1000000
date = np.random.choice(pd.date_range('2022-01-01', '2022-12-31'), size)
customer_id = np.random.randint(1, 1000, size)
sale_name = np.random.choice(sales_people.index, size)
region = sales_people[sale_name].values
product = np.random.choice(products.index, size)
price = products[product].values
quantity = np.random.randint(1, 10000, size)
revenue = price * quantity
df = pd.DataFrame({"交易日期": date, "客戶ID": customer_id, "售貨員": sale_name, "分公司": region,
                  "產(chǎn)品": product, "單價": price, "數(shù)量": quantity, "訂單金額": revenue})
df.客戶ID = "C"+df.客戶ID.astype("str").str.zfill(4)
df.sort_values(['交易日期', '分公司', '售貨員'], ignore_index=True, inplace=True)
df

相對于呆叔原文的3分鐘生成1萬條數(shù)據(jù)快了1萬倍以上。原文：《不會爬，沒數(shù)據(jù)？沒關系！3分鐘搞定1w+數(shù)據(jù)，超實用！》

相鄰日期超過 4 天則順序編號

測試數(shù)據(jù)生成：

import pandas as pd
import numpy as np

size = 5000000
df = pd.DataFrame({
    "id": np.random.randint(1, 501, size),
    "date": pd.date_range("2010-01-01", periods=size, freq="5T")
})
df.sort_values(["id", "date"], ascending=[True, False], inplace=True)

處理代碼：

diff = (df.groupby("id")["date"].shift()-df.date) > pd.Timedelta("4 days")
diff_cumsum = diff.groupby(df.id).cumsum()+1
df["new_id"] = df.id.astype("str")+"-"+diff_cumsum.astype("str")
df