数据清洗之 缺失值处理

举报
ruochen 发表于 2021/03/28 02:03:54 2021/03/28
【摘要】 缺失值处理 缺失值首先需要根据实际情况定义可以采取直接删除法有时候需要使用替换法或者插值法常用的替换法有均值替换、前向、后向替换和常数替换 import pandas as pd import numpy as np import os 123 os.getcwd() 1 'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据...

缺失值处理

  • 缺失值首先需要根据实际情况定义
  • 可以采取直接删除法
  • 有时候需要使用替换法或者插值法
  • 常用的替换法有均值替换、前向、后向替换和常数替换
import pandas as pd
import numpy as np
import os

  
 
  • 1
  • 2
  • 3
os.getcwd()

  
 
  • 1
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据预处理'

  
 
  • 1
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')

  
 
  • 1
df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')

  
 
  • 1
def f(x): if '$' in str(x): x = str(x).strip('$') x = str(x).replace(',', '') else: x = str(x).replace(',', '') return float(x)

  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
df['Price'] = df['Price'].apply(f)

  
 
  • 1
df['Mileage'] = df['Mileage'].apply(f)

  
 
  • 1
# 计算缺失比例
df.apply(lambda x: sum(x.isnull())/len(x), axis=0)

  
 
  • 1
  • 2
Condition 0.000000
Condition_Desc 0.778994
Price 0.000000
Location 0.000267
Model_Year 0.000534
Mileage 0.003470
Exterior_Color 0.095422
Make 0.000534
Warranty 0.318297
Model 0.016415
Sub_Model 0.676231
Type 0.197785
Vehicle_Title 0.964233
OBO 0.008808
Feedback_Perc 0.117710
Watch_Count 0.530629
N_Reviews 0.000801
Seller_Status 0.083411
Vehicle_Tile 0.007207
Auction 0.002269
Buy_Now 0.031630
Bid_Count 0.707727
dtype: float64

  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
df.head(3)

  
 
  • 1
Condition Condition_Desc Price Location Model_Year Mileage Exterior_Color Make Warranty Model ... Vehicle_Title OBO Feedback_Perc Watch_Count N_Reviews Seller_Status Vehicle_Tile Auction Buy_Now Bid_Count
0 Used mint!!! very low miles 11412.0 McHenry, Illinois, United States 2013.0 16000.0 Black Harley-Davidson Unspecified Touring ... NaN FALSE 8.1 NaN 2427 Private Seller Clear True FALSE 28.0
1 Used Perfect condition 17200.0 Fort Recovery, Ohio, United States 2016.0 60.0 Black Harley-Davidson Vehicle has an existing warranty Touring ... NaN FALSE 100 17 657 Private Seller Clear True TRUE 0.0
2 Used NaN 3872.0 Chicago, Illinois, United States 1970.0 25763.0 Silver/Blue BMW Vehicle does NOT have an existing warranty R-Series ... NaN FALSE 100 NaN 136 NaN Clear True FALSE 26.0

3 rows × 22 columns

# how = 'all', 只有当前行都是缺失值才删除
# how = 'any', 只要当前行有一个缺失值就删除
df.dropna(how = 'any', axis=0)

  
 
  • 1
  • 2
  • 3
Condition Condition_Desc Price Location Model_Year Mileage Exterior_Color Make Warranty Model ... Vehicle_Title OBO Feedback_Perc Watch_Count N_Reviews Seller_Status Vehicle_Tile Auction Buy_Now Bid_Count

0 rows × 22 columns

# subset 根据指定字段判断
# df.dropna(how='any', subset=['Condition', 'Price', 'Mileage'])

  
 
  • 1
  • 2
# 缺失值使用0填补
df.fillna(0).head(5)

  
 
  • 1
  • 2
Condition Condition_Desc Price Location Model_Year Mileage Exterior_Color Make Warranty Model ... Vehicle_Title OBO Feedback_Perc Watch_Count N_Reviews Seller_Status Vehicle_Tile Auction Buy_Now Bid_Count
0 Used mint!!! very low miles 11412.0 McHenry, Illinois, United States 2013.0 16000.0 Black Harley-Davidson Unspecified Touring ... 0 FALSE 8.1 0 2427 Private Seller Clear True FALSE 28.0
1 Used Perfect condition 17200.0 Fort Recovery, Ohio, United States 2016.0 60.0 Black Harley-Davidson Vehicle has an existing warranty Touring ... 0 FALSE 100 17 657 Private Seller Clear True TRUE 0.0
2 Used 0 3872.0 Chicago, Illinois, United States 1970.0 25763.0 Silver/Blue BMW Vehicle does NOT have an existing warranty R-Series ... 0 FALSE 100 0 136 0 Clear True FALSE 26.0
3 Used CLEAN TITLE READY TO RIDE HOME 6575.0 Green Bay, Wisconsin, United States 2009.0 33142.0 Red Harley-Davidson 0 Touring ... 0 FALSE 100 0 2920 Dealer Clear True FALSE 11.0
4 Used 0 10000.0 West Bend, Wisconsin, United States 2012.0 17800.0 Blue Harley-Davidson NO WARRANTY Touring ... 0 FALSE 100 13 271 OWNER Clear True TRUE 0.0

5 rows × 22 columns

# 针对一个变量进行缺失值判断,使用其均值进行填补
df.Mileage.fillna(df.Mileage.mean()).head(5)

  
 
  • 1
  • 2
0 16000.0
1 60.0
2 25763.0
3 33142.0
4 17800.0
Name: Mileage, dtype: float64

  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
df.columns

  
 
  • 1
Index(['Condition', 'Condition_Desc', 'Price', 'Location', 'Model_Year', 'Mileage', 'Exterior_Color', 'Make', 'Warranty', 'Model', 'Sub_Model', 'Type', 'Vehicle_Title', 'OBO', 'Feedback_Perc', 'Watch_Count', 'N_Reviews', 'Seller_Status', 'Vehicle_Tile', 'Auction', 'Buy_Now', 'Bid_Count'], dtype='object')

  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
df[df['Exterior_Color'].isnull()].head(5)

  
 
  • 1
Condition Condition_Desc Price Location Model_Year Mileage Exterior_Color Make Warranty Model ... Vehicle_Title OBO Feedback_Perc Watch_Count N_Reviews Seller_Status Vehicle_Tile Auction Buy_Now Bid_Count
14 Used NaN 5500.0 Davenport, Iowa, United States 2008.0 22102.0 NaN Harley-Davidson Vehicle does NOT have an existing warranty Touring ... NaN FALSE 9.3 NaN 244 Private Seller Clear True FALSE 16.0
35 Used NaN 7700.0 Roselle, Illinois, United States 2007.0 10893.0 NaN Harley-Davidson NaN Other ... NaN FALSE 100 NaN 236 NaN Clear False TRUE NaN
41 Used NaN 6800.0 Hampshire, Illinois, United States 2003.0 55782.0 NaN Harley-Davidson Vehicle does NOT have an existing warranty Softail ... NaN TRUE 100 2< 298 Private Seller Clear False TRUE NaN
55 Used NaN 29500.0 Parma, Michigan, United States 1950.0 8471.0 NaN Harley-Davidson NaN Other ... NaN TRUE 100 24 216 NaN Clear False TRUE NaN
72 Used NaN 6500.0 Bourbonnais, Illinois, United States 1986.0 55300.0 NaN Harley-Davidson NaN Touring ... NaN TRUE 100 2< 1 Private Seller Clear False TRUE NaN

5 rows × 22 columns

# 求众数
df['Exterior_Color'].mode()[0]

  
 
  • 1
  • 2
'Black'

  
 
  • 1
# 缺失用众数填补
df['Exterior_Color'].fillna(df['Exterior_Color'].mode()[0]).head(5)

  
 
  • 1
  • 2
0 Black
1 Black
2 Silver/Blue
3 Red
4 Blue
Name: Exterior_Color, dtype: object

  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
df['Mileage'].median()

  
 
  • 1
7083.0

  
 
  • 1
# 对不同变量使用不同数据填补
# 不加inplace=True,不会对原数据生效
df.fillna(value={'Exterior_Color': df['Exterior_Color'].mode()[0], 'Mileage': df['Mileage'].median(),}).head(5)

  
 
  • 1
  • 2
  • 3
  • 4
Condition Condition_Desc Price Location Model_Year Mileage Exterior_Color Make Warranty Model ... Vehicle_Title OBO Feedback_Perc Watch_Count N_Reviews Seller_Status Vehicle_Tile Auction Buy_Now Bid_Count
0 Used mint!!! very low miles 11412.0 McHenry, Illinois, United States 2013.0 16000.0 Black Harley-Davidson Unspecified Touring ... NaN FALSE 8.1 NaN 2427 Private Seller Clear True FALSE 28.0
1 Used Perfect condition 17200.0 Fort Recovery, Ohio, United States 2016.0 60.0 Black Harley-Davidson Vehicle has an existing warranty Touring ... NaN FALSE 100 17 657 Private Seller Clear True TRUE 0.0
2 Used NaN 3872.0 Chicago, Illinois, United States 1970.0 25763.0 Silver/Blue BMW Vehicle does NOT have an existing warranty R-Series ... NaN FALSE 100 NaN 136 NaN Clear True FALSE 26.0
3 Used CLEAN TITLE READY TO RIDE HOME 6575.0 Green Bay, Wisconsin, United States 2009.0 33142.0 Red Harley-Davidson NaN Touring ... NaN FALSE 100 NaN 2920 Dealer Clear True FALSE 11.0
4 Used NaN 10000.0 West Bend, Wisconsin, United States 2012.0 17800.0 Blue Harley-Davidson NO WARRANTY Touring ... NaN FALSE 100 13 271 OWNER Clear True TRUE 0.0

5 rows × 22 columns

# 前向填补
df['Exterior_Color'].fillna(method='ffill').tail(10)

  
 
  • 1
  • 2
7483 Purple
7484 Purple
7485 Black
7486 Black
7487 Gray
7488 Black
7489 Black
7490 Red
7491 TWO TONE
7492 Gray
Name: Exterior_Color, dtype: object

  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
# 后向填补
df['Exterior_Color'].fillna(method='bfill').tail(10)

  
 
  • 1
  • 2
7483 Purple
7484 Black
7485 Black
7486 Black
7487 Gray
7488 Black
7489 Black
7490 Red
7491 TWO TONE
7492 Gray
Name: Exterior_Color, dtype: object

  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

文章来源: ruochen.blog.csdn.net,作者:若尘,版权归原作者所有,如需转载,请联系作者。

原文链接:ruochen.blog.csdn.net/article/details/105611038

【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。