Python基础(十一) | 超详细的Pandas库三万字总结(三)
【摘要】 11.4 缺失值处理1、发现缺失值import pandas as pdimport numpy as npdata = pd.DataFrame(np.array([[1, np.nan, 2], [np.nan, 3, 4], [5, 6, None]]), column...
11.4 缺失值处理
1、发现缺失值
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array([[1, np.nan, 2],
[np.nan, 3, 4],
[5, 6, None]]), columns=["A", "B", "C"])
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | |
---|---|---|---|
0 | 1 | NaN | 2 |
1 | NaN | 3 | 4 |
2 | 5 | 6 | None |
注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源
np.nan是一个特殊的浮点数,类型是浮点类型,所以表示缺失值时最好使用NaN。
data.dtypes
A object
B object
C object
dtype: object
data.isnull()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | |
---|---|---|---|
0 | False | True | False |
1 | True | False | False |
2 | False | False | True |
data.notnull()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | |
---|---|---|---|
0 | True | False | True |
1 | False | True | True |
2 | True | True | False |
2、删除缺失值
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | NaN | 2.0 | 3.0 |
1 | NaN | 4.0 | 5.0 | 6.0 |
2 | 7.0 | 8.0 | NaN | 9.0 |
3 | 10.0 | 11.0 | 12.0 | 13.0 |
注意:np.nan是一种特殊的浮点数
data.dtypes
A float64
B float64
C float64
D float64
dtype: object
(1)删除整行
data.dropna()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
3 | 10.0 | 11.0 | 12.0 | 13.0 |
(2)删除整列
data.dropna(axis="columns")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
D | |
---|---|
0 | 3.0 |
1 | 6.0 |
2 | 9.0 |
3 | 13.0 |
data["D"] = np.nan
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | NaN | 2.0 | NaN |
1 | NaN | 4.0 | 5.0 | NaN |
2 | 7.0 | 8.0 | NaN | NaN |
3 | 10.0 | 11.0 | 12.0 | NaN |
data.dropna(axis="columns", how="all")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | |
---|---|---|---|
0 | 1.0 | NaN | 2.0 |
1 | NaN | 4.0 | 5.0 |
2 | 7.0 | 8.0 | NaN |
3 | 10.0 | 11.0 | 12.0 |
all表示都是缺失值时才删除。
data.dropna(axis="columns", how="any")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
0 |
---|
1 |
2 |
3 |
data.loc[3] = np.nan
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | NaN | 2.0 | NaN |
1 | NaN | 4.0 | 5.0 | NaN |
2 | 7.0 | 8.0 | NaN | NaN |
3 | NaN | NaN | NaN | NaN |
data.dropna(how="all")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | NaN | 2.0 | NaN |
1 | NaN | 4.0 | 5.0 | NaN |
2 | 7.0 | 8.0 | NaN | NaN |
3、填充缺失值
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | NaN | 2.0 | 3.0 |
1 | NaN | 4.0 | 5.0 | 6.0 |
2 | 7.0 | 8.0 | NaN | 9.0 |
3 | 10.0 | 11.0 | 12.0 | 13.0 |
data.fillna(value=5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | 5.0 | 2.0 | 3.0 |
1 | 5.0 | 4.0 | 5.0 | 6.0 |
2 | 7.0 | 8.0 | 5.0 | 9.0 |
3 | 10.0 | 11.0 | 12.0 | 13.0 |
- 用均值进行替换
fill = data.mean()
fill
A 6.000000
B 7.666667
C 6.333333
D 7.750000
dtype: float64
data.fillna(value=fill)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | 7.666667 | 2.000000 | 3.0 |
1 | 6.0 | 4.000000 | 5.000000 | 6.0 |
2 | 7.0 | 8.000000 | 6.333333 | 9.0 |
3 | 10.0 | 11.000000 | 12.000000 | 13.0 |
全部数据的平均值,先进行摊平,再进行填充即可。
fill = data.stack().mean()
fill
7.0
data.fillna(value=fill)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | 7.0 | 2.0 | 3.0 |
1 | 7.0 | 4.0 | 5.0 | 6.0 |
2 | 7.0 | 8.0 | 7.0 | 9.0 |
3 | 10.0 | 11.0 | 12.0 | 13.0 |
11.5 合并数据
- 构造一个生产DataFrame的函数
import pandas as pd
import numpy as np
def make_df(cols, ind):
"一个简单的DataFrame"
data = {c: [str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
make_df("ABC", range(3))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | |
---|---|---|---|
0 | A0 | B0 | C0 |
1 | A1 | B1 | C1 |
2 | A2 | B2 | C2 |
- 垂直合并
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
A B
1 A1 B1
2 A2 B2
A B
3 A3 B3
4 A4 B4
pd.concat([df_1, df_2])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
3 | A3 | B3 |
4 | A4 | B4 |
- 水平合并
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)
A B
0 A0 B0
1 A1 B1
C D
0 C0 D0
1 C1 D1
pd.concat([df_3, df_4], axis=1)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
- 索引重叠
行重叠
df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
A B
1 A1 B1
2 A2 B2
A B
1 A1 B1
2 A2 B2
pd.concat([df_5, df_6])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
1 | A1 | B1 |
2 | A2 | B2 |
pd.concat([df_5, df_6],ignore_index=True)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | |
---|---|---|
0 | A1 | B1 |
1 | A2 | B2 |
2 | A1 | B1 |
3 | A2 | B2 |
列重叠
df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
A B C
1 A1 B1 C1
2 A2 B2 C2
B C D
1 B1 C1 D1
2 B2 C2 D2
pd.concat([df_7, df_8], axis=1)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | B | C | D | |
---|---|---|---|---|---|---|
1 | A1 | B1 | C1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | B2 | C2 | D2 |
pd.concat([df_7, df_8],axis=1, ignore_index=True)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
1 | A1 | B1 | C1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | B2 | C2 | D2 |
- 对齐合并merge()
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
B C
1 B1 C1
2 B2 C2
pd.merge(df_9, df_10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | |
---|---|---|---|
0 | A1 | B1 | C1 |
1 | A2 | B2 | C2 |
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
C B
2 C2 B2
1 C1 B1
pd.merge(df_9, df_10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
A | B | C | |
---|---|---|---|
0 | A1 | B1 | C1 |
1 | A2 | B2 | C2 |
【例】 合并城市信息
population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
"pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
city | pop | |
---|---|---|
0 | BeiJing | 2154 |
1 | HangZhou | 981 |
2 | ShenZhen | 1303 |
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
"GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
city | GDP | |
---|---|---|
0 | BeiJing | 30320 |
1 | ShangHai | 32680 |
2 | HangZhou | 13468 |
city_info = pd.merge(population, GDP)
city_info
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
city | pop | GDP | |
---|---|---|---|
0 | BeiJing | 2154 | 30320 |
1 | HangZhou | 981 | 13468 |
这里outer是求并集
city_info = pd.merge(population, GDP, how="outer")
city_info
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
city | pop | GDP | |
---|---|---|---|
0 | BeiJing | 2154.0 | 30320.0 |
1 | HangZhou | 981.0 | 13468.0 |
2 | ShenZhen | 1303.0 | NaN |
3 | ShangHai | NaN | 32680.0 |
【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)