Python基础(十一) | 超详细的Pandas库三万字总结(三)

举报
timerring 发表于 2022/10/07 09:36:06 2022/10/07
【摘要】 11.4 缺失值处理1、发现缺失值import pandas as pdimport numpy as npdata = pd.DataFrame(np.array([[1, np.nan, 2], [np.nan, 3, 4], [5, 6, None]]), column...

11.4 缺失值处理

1、发现缺失值

import pandas as pd
import numpy as np

data = pd.DataFrame(np.array([[1, np.nan, 2],
                              [np.nan, 3, 4],
                              [5, 6, None]]), columns=["A", "B", "C"])
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C
0 1 NaN 2
1 NaN 3 4
2 5 6 None

注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源

np.nan是一个特殊的浮点数,类型是浮点类型,所以表示缺失值时最好使用NaN。

data.dtypes
A    object
B    object
C    object
dtype: object
data.isnull()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C
0 False True False
1 True False False
2 False False True
data.notnull()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C
0 True False True
1 False True True
2 True True False

2、删除缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 1.0 NaN 2.0 3.0
1 NaN 4.0 5.0 6.0
2 7.0 8.0 NaN 9.0
3 10.0 11.0 12.0 13.0

注意:np.nan是一种特殊的浮点数

data.dtypes
A    float64
B    float64
C    float64
D    float64
dtype: object

(1)删除整行

data.dropna()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
3 10.0 11.0 12.0 13.0

(2)删除整列

data.dropna(axis="columns")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
D
0 3.0
1 6.0
2 9.0
3 13.0
data["D"] = np.nan
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 1.0 NaN 2.0 NaN
1 NaN 4.0 5.0 NaN
2 7.0 8.0 NaN NaN
3 10.0 11.0 12.0 NaN
data.dropna(axis="columns", how="all")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C
0 1.0 NaN 2.0
1 NaN 4.0 5.0
2 7.0 8.0 NaN
3 10.0 11.0 12.0

all表示都是缺失值时才删除。

data.dropna(axis="columns", how="any")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0
1
2
3
data.loc[3] = np.nan
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 1.0 NaN 2.0 NaN
1 NaN 4.0 5.0 NaN
2 7.0 8.0 NaN NaN
3 NaN NaN NaN NaN
data.dropna(how="all")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 1.0 NaN 2.0 NaN
1 NaN 4.0 5.0 NaN
2 7.0 8.0 NaN NaN

3、填充缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 1.0 NaN 2.0 3.0
1 NaN 4.0 5.0 6.0
2 7.0 8.0 NaN 9.0
3 10.0 11.0 12.0 13.0
data.fillna(value=5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 1.0 5.0 2.0 3.0
1 5.0 4.0 5.0 6.0
2 7.0 8.0 5.0 9.0
3 10.0 11.0 12.0 13.0
  • 用均值进行替换
fill = data.mean()
fill
A    6.000000
B    7.666667
C    6.333333
D    7.750000
dtype: float64
data.fillna(value=fill)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 1.0 7.666667 2.000000 3.0
1 6.0 4.000000 5.000000 6.0
2 7.0 8.000000 6.333333 9.0
3 10.0 11.000000 12.000000 13.0

全部数据的平均值,先进行摊平,再进行填充即可。

fill = data.stack().mean()
fill
7.0
data.fillna(value=fill)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 1.0 7.0 2.0 3.0
1 7.0 4.0 5.0 6.0
2 7.0 8.0 7.0 9.0
3 10.0 11.0 12.0 13.0

11.5 合并数据

  • 构造一个生产DataFrame的函数
import pandas as pd
import numpy as np

def make_df(cols, ind):
    "一个简单的DataFrame"
    data = {c: [str(c)+str(i) for i in ind]  for c in cols}
    return pd.DataFrame(data, ind)

make_df("ABC", range(3))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
  • 垂直合并
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4
pd.concat([df_1, df_2])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
  • 水平合并
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)
    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1
pd.concat([df_3, df_4], axis=1)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
  • 索引重叠

行重叠

df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
    A   B
1  A1  B1
2  A2  B2
    A   B
1  A1  B1
2  A2  B2
pd.concat([df_5, df_6])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B
1 A1 B1
2 A2 B2
1 A1 B1
2 A2 B2
pd.concat([df_5, df_6],ignore_index=True)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B
0 A1 B1
1 A2 B2
2 A1 B1
3 A2 B2

列重叠

df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
1  B1  C1  D1
2  B2  C2  D2
pd.concat([df_7, df_8], axis=1)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C B C D
1 A1 B1 C1 B1 C1 D1
2 A2 B2 C2 B2 C2 D2
pd.concat([df_7, df_8],axis=1, ignore_index=True)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3 4 5
1 A1 B1 C1 B1 C1 D1
2 A2 B2 C2 B2 C2 D2
  • 对齐合并merge()
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
    A   B
1  A1  B1
2  A2  B2
    B   C
1  B1  C1
2  B2  C2
pd.merge(df_9, df_10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C
0 A1 B1 C1
1 A2 B2 C2
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
    A   B
1  A1  B1
2  A2  B2
    C   B
2  C2  B2
1  C1  B1
pd.merge(df_9, df_10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C
0 A1 B1 C1
1 A2 B2 C2

【例】 合并城市信息

population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
                   "pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
city pop
0 BeiJing 2154
1 HangZhou 981
2 ShenZhen 1303
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
            "GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
city GDP
0 BeiJing 30320
1 ShangHai 32680
2 HangZhou 13468
city_info = pd.merge(population, GDP)
city_info
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
city pop GDP
0 BeiJing 2154 30320
1 HangZhou 981 13468

这里outer是求并集

city_info = pd.merge(population, GDP, how="outer")
city_info
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
city pop GDP
0 BeiJing 2154.0 30320.0
1 HangZhou 981.0 13468.0
2 ShenZhen 1303.0 NaN
3 ShangHai NaN 32680.0
【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。