数据清洗之 异常值处理
【摘要】 异常值处理
指那些偏离正常范围的值,不是错误值异常值出现频率较低,但又会对实际项目分析造成偏差异常值一般用过箱线图法(分位差法)或者分布图(标准差法)来判断异常值检测可以使用均值的二倍标准差范围,也可以使用上下4分位数差方法异常值往往采取盖帽法或者数据离散化
import pandas as pd
import numpy as np
import os
123
...
异常值处理
- 指那些偏离正常范围的值,不是错误值
- 异常值出现频率较低,但又会对实际项目分析造成偏差
- 异常值一般用过箱线图法(分位差法)或者分布图(标准差法)来判断
- 异常值检测可以使用均值的二倍标准差范围,也可以使用上下4分位数差方法
- 异常值往往采取盖帽法或者数据离散化
import pandas as pd
import numpy as np
import os
- 1
- 2
- 3
os.getcwd()
- 1
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据预处理'
- 1
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
- 1
df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')
- 1
def f(x): if '$' in str(x): x = str(x).strip('$') x = str(x).replace(',', '') else: x = str(x).replace(',', '') return float(x)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
df['Price'] = df['Price'].apply(f)
- 1
df['Mileage'] = df['Mileage'].apply(f)
- 1
df.head(5)
- 1
Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Used | mint!!! very low miles | 11412.0 | McHenry, Illinois, United States | 2013.0 | 16000.0 | Black | Harley-Davidson | Unspecified | Touring | ... | NaN | FALSE | 8.1 | NaN | 2427 | Private Seller | Clear | True | FALSE | 28.0 |
1 | Used | Perfect condition | 17200.0 | Fort Recovery, Ohio, United States | 2016.0 | 60.0 | Black | Harley-Davidson | Vehicle has an existing warranty | Touring | ... | NaN | FALSE | 100 | 17 | 657 | Private Seller | Clear | True | TRUE | 0.0 |
2 | Used | NaN | 3872.0 | Chicago, Illinois, United States | 1970.0 | 25763.0 | Silver/Blue | BMW | Vehicle does NOT have an existing warranty | R-Series | ... | NaN | FALSE | 100 | NaN | 136 | NaN | Clear | True | FALSE | 26.0 |
3 | Used | CLEAN TITLE READY TO RIDE HOME | 6575.0 | Green Bay, Wisconsin, United States | 2009.0 | 33142.0 | Red | Harley-Davidson | NaN | Touring | ... | NaN | FALSE | 100 | NaN | 2920 | Dealer | Clear | True | FALSE | 11.0 |
4 | Used | NaN | 10000.0 | West Bend, Wisconsin, United States | 2012.0 | 17800.0 | Blue | Harley-Davidson | NO WARRANTY | Touring | ... | NaN | FALSE | 100 | 13 | 271 | OWNER | Clear | True | TRUE | 0.0 |
5 rows × 22 columns
# 对价格异常值处理
# 计算价格均值
x_bar = df['Price'].mean()
- 1
- 2
- 3
# 计算价格标准差
x_std = df['Price'].std()
- 1
- 2
# 异常值上限检测
any(df['Price'] > x_bar + 2 * x_std)
- 1
- 2
True
- 1
# 异常值下限检测
any(df['Price'] < x_bar - 2 * x_std)
- 1
- 2
False
- 1
# 描述性统计
df['Price'].describe()
- 1
- 2
count 7493.000000
mean 9968.811557
std 8497.326850
min 0.000000
25% 4158.000000
50% 7995.000000
75% 13000.000000
max 100000.000000
Name: Price, dtype: float64
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
# 25% 分位数
Q1 = df['Price'].quantile(q = 0.25)
- 1
- 2
# 75% 分位数
Q3 = df['Price'].quantile(q = 0.75)
- 1
- 2
# 分位差
IQR = Q3 - Q1
- 1
- 2
any(df['Price'] > Q3 + 1.5 * IQR)
- 1
True
- 1
any(df['Price'] < Q1 - 1.5 * IQR)
- 1
False
- 1
import matplotlib.pyplot as plt
- 1
%matplotlib inline
- 1
df['Price'].plot(kind='box')
- 1
<matplotlib.axes._subplots.AxesSubplot at 0x11ddad20ac8>
- 1
# 设置绘图风格
plt.style.use('seaborn')
# 绘制直方图
df.Price.plot(kind='hist', bins=30, density=True)
# 绘制核密度图
df.Price.plot(kind='kde')
# 图形展现
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
# 用99分位数和1分位数替换
# 计算P1和P99
P99 = df['Price'].quantile(q=0.99)
P1 = df['Price'].quantile(q=0.01)
- 1
- 2
- 3
- 4
P99
- 1
39995.32
- 1
df['Price_new'] = df['Price']
- 1
# 盖帽法
df.loc[df['Price'] > P99, 'Price_new'] = P99
df.loc[df['Price'] < P1, 'Price_new'] = P1
- 1
- 2
- 3
df[['Price', 'Price_new']].describe()
- 1
Price | Price_new | |
---|---|---|
count | 7493.000000 | 7493.000000 |
mean | 9968.811557 | 9821.220873 |
std | 8497.326850 | 7737.092537 |
min | 0.000000 | 100.000000 |
25% | 4158.000000 | 4158.000000 |
50% | 7995.000000 | 7995.000000 |
75% | 13000.000000 | 13000.000000 |
max | 100000.000000 | 39995.320000 |
# df['Price_new'].plot(kind='box')
- 1
文章来源: ruochen.blog.csdn.net,作者:若尘,版权归原作者所有,如需转载,请联系作者。
原文链接:ruochen.blog.csdn.net/article/details/105636205
【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)