Python基础(十一) | 超详细的Pandas库三万字总结(五)
【摘要】 11.7 其他(1)向量化字符串操作(2) 处理时间序列(3) 多级索引:用于多维数据base_data = np.array([[1771, 11115 ], [2154, 30320], [2141, 14070], [2424, 32680], ...
11.7 其他
(1)向量化字符串操作
(2) 处理时间序列
(3) 多级索引:用于多维数据
base_data = np.array([[1771, 11115 ],
[2154, 30320],
[2141, 14070],
[2424, 32680],
[1077, 7806],
[1303, 24222],
[798, 4789],
[981, 13468]])
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
, [2008, 2018]*4], columns=["population", "GDP"])
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
population | GDP | ||
---|---|---|---|
BeiJing | 2008 | 1771 | 11115 |
2018 | 2154 | 30320 | |
ShangHai | 2008 | 2141 | 14070 |
2018 | 2424 | 32680 | |
ShenZhen | 2008 | 1077 | 7806 |
2018 | 1303 | 24222 | |
HangZhou | 2008 | 798 | 4789 |
2018 | 981 | 13468 |
data.index.names = ["city", "year"]
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
population | GDP | ||
---|---|---|---|
city | year | ||
BeiJing | 2008 | 1771 | 11115 |
2018 | 2154 | 30320 | |
ShangHai | 2008 | 2141 | 14070 |
2018 | 2424 | 32680 | |
ShenZhen | 2008 | 1077 | 7806 |
2018 | 1303 | 24222 | |
HangZhou | 2008 | 798 | 4789 |
2018 | 981 | 13468 |
data["GDP"]
city year
BeiJing 2008 11115
2018 30320
ShangHai 2008 14070
2018 32680
ShenZhen 2008 7806
2018 24222
HangZhou 2008 4789
2018 13468
Name: GDP, dtype: int32
data.loc["ShangHai", "GDP"]
year
2008 14070
2018 32680
Name: GDP, dtype: int32
data.loc["ShangHai", 2018]["GDP"]
32680
(4) 高性能的Pandas:eval()
df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
%timeit (df1+df2)/(df3+df4)
17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
- 减少了复合代数式计算中间过程的内存分配
%timeit pd.eval("(df1+df2)/(df3+df4)")
10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))
True
- 实现列间运算
df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | |
---|---|---|---|
0 | 0.418071 | 0.381836 | 0.500556 |
1 | 0.059432 | 0.749066 | 0.302429 |
2 | 0.489147 | 0.739153 | 0.777161 |
3 | 0.175441 | 0.016556 | 0.348979 |
4 | 0.766534 | 0.559252 | 0.310635 |
res_1 = pd.eval("(df.A+df.B)/(df.C-1)")
res_2 = df.eval("(A+B)/(C-1)")
np.allclose(res_1, res_2)
True
df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | 0.418071 | 0.381836 | 0.500556 | -1.601593 |
1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
3 | 0.175441 | 0.016556 | 0.348979 | -0.294917 |
4 | 0.766534 | 0.559252 | 0.310635 | -1.923199 |
df.eval("D=(A+B)/(C-1)", inplace=True)
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | 0.418071 | 0.381836 | 0.500556 | -1.601593 |
1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
3 | 0.175441 | 0.016556 | 0.348979 | -0.294917 |
4 | 0.766534 | 0.559252 | 0.310635 | -1.923199 |
- 使用局部变量
column_mean = df.mean(axis=1)
res = df.eval("A+@column_mean")
res.head()
0 0.342788
1 0.047409
2 -0.387501
3 0.236956
4 0.694839
dtype: float64
(4) 高性能的Pandas:query()
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | 0.418071 | 0.381836 | 0.500556 | -1.601593 |
1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
3 | 0.175441 | 0.016556 | 0.348979 | -0.294917 |
4 | 0.766534 | 0.559252 | 0.310635 | -1.923199 |
%timeit df[(df.A < 0.5) & (df.B > 0.5)]
1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.query("(A < 0.5)&(B > 0.5)")
2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.query("(A < 0.5)&(B > 0.5)").head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
7 | 0.073950 | 0.730144 | 0.646190 | -2.272672 |
10 | 0.393200 | 0.610467 | 0.697096 | -3.313485 |
11 | 0.065734 | 0.764699 | 0.179380 | -1.011958 |
np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
True
(5)eval()和query()的使用时机
小数组时,普通方法反而更快
df.values.nbytes
32000
df1.values.nbytes
8000000
【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)