数据清洗之 数据离散化
        【摘要】     数据离散化 
数据离散化就是分箱一把你常用分箱方法是等频分箱或者等宽分箱一般使用pd.cut或者pd.qcut函数 
 
 pandas.cut(x, bins, right=True, labels) 
 
x: 数据bins: 离散化的数目,或者切分的区间labels: 离散化后各个类别的标签right: 是否包含区间右边的值 
import pandas as p...
    
    
    
    数据离散化
- 数据离散化就是分箱
- 一把你常用分箱方法是等频分箱或者等宽分箱
- 一般使用pd.cut或者pd.qcut函数
pandas.cut(x, bins, right=True, labels)
- x: 数据
- bins: 离散化的数目,或者切分的区间
- labels: 离散化后各个类别的标签
- right: 是否包含区间右边的值
import pandas as pd
import numpy as np
import os
  
 - 1
- 2
- 3
os.getcwd()
  
 - 1
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据'
  
 - 1
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
  
 - 1
df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')
  
 - 1
def f(x): if '$' in str(x): x = str(x).strip('$') x = str(x).replace(',', '') else: x = str(x).replace(',', '') return float(x)
  
 - 1
- 2
- 3
- 4
- 5
- 6
- 7
df['Price'] = df['Price'].apply(f)
  
 - 1
df['Mileage'] = df['Mileage'].apply(f)
  
 - 1
df.head(5)
  
 - 1
| Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Used | mint!!! very low miles | 11412.0 | McHenry, Illinois, United States | 2013.0 | 16000.0 | Black | Harley-Davidson | Unspecified | Touring | ... | NaN | FALSE | 8.1 | NaN | 2427 | Private Seller | Clear | True | FALSE | 28.0 | 
| 1 | Used | Perfect condition | 17200.0 | Fort Recovery, Ohio, United States | 2016.0 | 60.0 | Black | Harley-Davidson | Vehicle has an existing warranty | Touring | ... | NaN | FALSE | 100 | 17 | 657 | Private Seller | Clear | True | TRUE | 0.0 | 
| 2 | Used | NaN | 3872.0 | Chicago, Illinois, United States | 1970.0 | 25763.0 | Silver/Blue | BMW | Vehicle does NOT have an existing warranty | R-Series | ... | NaN | FALSE | 100 | NaN | 136 | NaN | Clear | True | FALSE | 26.0 | 
| 3 | Used | CLEAN TITLE READY TO RIDE HOME | 6575.0 | Green Bay, Wisconsin, United States | 2009.0 | 33142.0 | Red | Harley-Davidson | NaN | Touring | ... | NaN | FALSE | 100 | NaN | 2920 | Dealer | Clear | True | FALSE | 11.0 | 
| 4 | Used | NaN | 10000.0 | West Bend, Wisconsin, United States | 2012.0 | 17800.0 | Blue | Harley-Davidson | NO WARRANTY | Touring | ... | NaN | FALSE | 100 | 13 | 271 | OWNER | Clear | True | TRUE | 0.0 | 
5 rows × 22 columns
df['Price_bin'] = pd.cut(df['Price'], 5, labels=range(5))
  
 - 1
# 计算频数
df['Price_bin'].value_counts()
  
 - 1
- 2
0 6762
1 659
2 50
3 20
4 2
Name: Price_bin, dtype: int64
  
 - 1
- 2
- 3
- 4
- 5
- 6
%matplotlib inline
  
 - 1
df['Price_bin'].value_counts().plot(kind='bar')
  
 - 1
<matplotlib.axes._subplots.AxesSubplot at 0x1b35fba9048>
  
 - 1
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-olyuNEbB-1587367665199)(output_12_1.png)]](https://img-blog.csdnimg.cn/2020042015282530.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzI5MzM5NDY3,size_16,color_FFFFFF,t_70#pic_center)
df['Price_bin'].hist()
  
 - 1
<matplotlib.axes._subplots.AxesSubplot at 0x1b35f681278>
  
 - 1
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kUCpxNzE-1587367665204)(output_13_1.png)]](https://img-blog.csdnimg.cn/2020042015284080.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzI5MzM5NDY3,size_16,color_FFFFFF,t_70#pic_center)
w = [100, 1000, 5000, 10000, 20000, 100000]
  
 - 1
df['Price_bin'] = pd.cut(df['Price'], bins=w, labels=range(5))
  
 - 1
df[['Price', 'Price_bin']].head(5)
  
 - 1
| Price | Price_bin | |
|---|---|---|
| 0 | 11412.0 | 3 | 
| 1 | 17200.0 | 3 | 
| 2 | 3872.0 | 1 | 
| 3 | 6575.0 | 2 | 
| 4 | 10000.0 | 2 | 
df['Price_bin'].hist()
  
 - 1
<matplotlib.axes._subplots.AxesSubplot at 0x1b35fb99898>
  
 - 1
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-W11kWf50-1587367665206)(output_17_1.png)]](https://img-blog.csdnimg.cn/20200420152859200.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzI5MzM5NDY3,size_16,color_FFFFFF,t_70#pic_center)
  
 - 1
# 分位数
k = 5
w = [1.0 * i/k for i in range(k+1)]
w
  
 - 1
- 2
- 3
- 4
[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
  
 - 1
# 等频分成5段
df['Price_bin'] = pd.qcut(df['Price'], q=w, labels=range(5))
  
 - 1
- 2
df['Price_bin'].hist()
  
 - 1
<matplotlib.axes._subplots.AxesSubplot at 0x1b35fe2a080>
  
 - 1
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B3njTZxo-1587367665209)(output_21_1.png)]](https://img-blog.csdnimg.cn/20200420152911559.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzI5MzM5NDY3,size_16,color_FFFFFF,t_70#pic_center)
# 计算分位点
k = 5
w1 = df['Price'].quantile([1.0 * i/k for i in range(k+1)])
  
 - 1
- 2
- 3
w1
  
 - 1
0.0 0.0
0.2 3500.0
0.4 6491.0
0.6 9777.0
0.8 14999.0
1.0 100000.0
Name: Price, dtype: float64
  
 - 1
- 2
- 3
- 4
- 5
- 6
- 7
# 一般第一个分位点要比实际小
# 最后一个分位点要比实际大
w1[0] = w[0] * 0.95
w1[1.0] = w1[1.0] * 1.1
  
 - 1
- 2
- 3
- 4
w1
  
 - 1
0.0 0.0
0.2 3500.0
0.4 6491.0
0.6 9777.0
0.8 14999.0
1.0 110000.0
Name: Price, dtype: float64
  
 - 1
- 2
- 3
- 4
- 5
- 6
- 7
# 按照新的分段标准分割
df['Price_bin'] = pd.cut(df['Price'], bins=w1, labels=range(5))
  
 - 1
- 2
df['Price_bin'].hist()
  
 - 1
<matplotlib.axes._subplots.AxesSubplot at 0x1b35e53fa20>
  
 - 1
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CbT03bmk-1587367665212)(output_27_1.png)]](https://img-blog.csdnimg.cn/2020042015292217.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzI5MzM5NDY3,size_16,color_FFFFFF,t_70#pic_center)
文章来源: ruochen.blog.csdn.net,作者:若尘,版权归原作者所有,如需转载,请联系作者。
原文链接:ruochen.blog.csdn.net/article/details/105636309
        【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
            cloudbbs@huaweicloud.com
        
        
        
        
        - 点赞
- 收藏
- 关注作者
 
             
           
评论(0)