Python基础(十一) | 超详细的Pandas库三万字总结(五)

举报
timerring 发表于 2022/10/07 09:38:19 2022/10/07
【摘要】 11.7 其他(1)向量化字符串操作(2) 处理时间序列(3) 多级索引:用于多维数据base_data = np.array([[1771, 11115 ], [2154, 30320], [2141, 14070], [2424, 32680], ...

11.7 其他

(1)向量化字符串操作

(2) 处理时间序列

(3) 多级索引:用于多维数据

base_data = np.array([[1771, 11115 ],
                      [2154, 30320],
                      [2141, 14070],
                      [2424, 32680],
                      [1077, 7806],
                      [1303, 24222],
                      [798, 4789],
                      [981, 13468]]) 
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
                                     , [2008, 2018]*4], columns=["population", "GDP"])
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
population GDP
BeiJing 2008 1771 11115
2018 2154 30320
ShangHai 2008 2141 14070
2018 2424 32680
ShenZhen 2008 1077 7806
2018 1303 24222
HangZhou 2008 798 4789
2018 981 13468
data.index.names = ["city", "year"]
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
population GDP
city year
BeiJing 2008 1771 11115
2018 2154 30320
ShangHai 2008 2141 14070
2018 2424 32680
ShenZhen 2008 1077 7806
2018 1303 24222
HangZhou 2008 798 4789
2018 981 13468
data["GDP"]
city      year
BeiJing   2008    11115
          2018    30320
ShangHai  2008    14070
          2018    32680
ShenZhen  2008     7806
          2018    24222
HangZhou  2008     4789
          2018    13468
Name: GDP, dtype: int32
data.loc["ShangHai", "GDP"]
year
2008    14070
2018    32680
Name: GDP, dtype: int32
data.loc["ShangHai", 2018]["GDP"]
32680

(4) 高性能的Pandas:eval()

df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
%timeit (df1+df2)/(df3+df4)
17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • 减少了复合代数式计算中间过程的内存分配
%timeit pd.eval("(df1+df2)/(df3+df4)")
10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))
True
  • 实现列间运算
df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C
0 0.418071 0.381836 0.500556
1 0.059432 0.749066 0.302429
2 0.489147 0.739153 0.777161
3 0.175441 0.016556 0.348979
4 0.766534 0.559252 0.310635
res_1 = pd.eval("(df.A+df.B)/(df.C-1)")
res_2 = df.eval("(A+B)/(C-1)")
np.allclose(res_1, res_2)
True
df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 0.418071 0.381836 0.500556 -1.601593
1 0.059432 0.749066 0.302429 -1.159019
2 0.489147 0.739153 0.777161 -5.512052
3 0.175441 0.016556 0.348979 -0.294917
4 0.766534 0.559252 0.310635 -1.923199
df.eval("D=(A+B)/(C-1)", inplace=True)
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 0.418071 0.381836 0.500556 -1.601593
1 0.059432 0.749066 0.302429 -1.159019
2 0.489147 0.739153 0.777161 -5.512052
3 0.175441 0.016556 0.348979 -0.294917
4 0.766534 0.559252 0.310635 -1.923199
  • 使用局部变量
column_mean = df.mean(axis=1)
res = df.eval("A+@column_mean")
res.head()
0    0.342788
1    0.047409
2   -0.387501
3    0.236956
4    0.694839
dtype: float64

(4) 高性能的Pandas:query()

df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 0.418071 0.381836 0.500556 -1.601593
1 0.059432 0.749066 0.302429 -1.159019
2 0.489147 0.739153 0.777161 -5.512052
3 0.175441 0.016556 0.348979 -0.294917
4 0.766534 0.559252 0.310635 -1.923199
%timeit df[(df.A < 0.5) & (df.B > 0.5)]
1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.query("(A < 0.5)&(B > 0.5)")
2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.query("(A < 0.5)&(B > 0.5)").head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
1 0.059432 0.749066 0.302429 -1.159019
2 0.489147 0.739153 0.777161 -5.512052
7 0.073950 0.730144 0.646190 -2.272672
10 0.393200 0.610467 0.697096 -3.313485
11 0.065734 0.764699 0.179380 -1.011958
np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
True

(5)eval()和query()的使用时机

小数组时,普通方法反而更快

df.values.nbytes
32000
df1.values.nbytes
8000000
【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。