- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

数据清洗之数据离散化

ruochen 发表于 2021/03/28 03:18:42 2021/03/28

【摘要】数据离散化数据离散化就是分箱一把你常用分箱方法是等频分箱或者等宽分箱一般使用pd.cut或者pd.qcut函数 pandas.cut(x, bins, right=True, labels) x: 数据bins: 离散化的数目，或者切分的区间labels: 离散化后各个类别的标签right: 是否包含区间右边的值 import pandas as p...

数据离散化

数据离散化就是分箱
一把你常用分箱方法是等频分箱或者等宽分箱
一般使用pd.cut或者pd.qcut函数

pandas.cut(x, bins, right=True, labels)

x: 数据
bins: 离散化的数目，或者切分的区间
labels: 离散化后各个类别的标签
right: 是否包含区间右边的值

import pandas as pd
import numpy as np
import os

  
 
  1
  2
  3

os.getcwd()

  
 
  1

'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据'

  
 
  1

os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')

  
 
  1

df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')

  
 
  1

def f(x): if '$' in str(x): x = str(x).strip('$') x = str(x).replace(',', '') else: x = str(x).replace(',', '') return float(x)

  
 
  1
  2
  3
  4
  5
  6
  7

df['Price'] = df['Price'].apply(f)

  
 
  1

df['Mileage'] = df['Mileage'].apply(f)

  
 
  1

df.head(5)

  
 
  1

	Condition	Condition_Desc	Price	Location	Model_Year	Mileage	Exterior_Color	Make	Warranty	Model	...	Vehicle_Title	OBO	Feedback_Perc	Watch_Count	N_Reviews	Seller_Status	Vehicle_Tile	Auction	Buy_Now	Bid_Count
0	Used	mint!!! very low miles	11412.0	McHenry, Illinois, United States	2013.0	16000.0	Black	Harley-Davidson	Unspecified	Touring	...	NaN	FALSE	8.1	NaN	2427	Private Seller	Clear	True	FALSE	28.0
1	Used	Perfect condition	17200.0	Fort Recovery, Ohio, United States	2016.0	60.0	Black	Harley-Davidson	Vehicle has an existing warranty	Touring	...	NaN	FALSE	100	17	657	Private Seller	Clear	True	TRUE	0.0
2	Used	NaN	3872.0	Chicago, Illinois, United States	1970.0	25763.0	Silver/Blue	BMW	Vehicle does NOT have an existing warranty	R-Series	...	NaN	FALSE	100	NaN	136	NaN	Clear	True	FALSE	26.0
3	Used	CLEAN TITLE READY TO RIDE HOME	6575.0	Green Bay, Wisconsin, United States	2009.0	33142.0	Red	Harley-Davidson	NaN	Touring	...	NaN	FALSE	100	NaN	2920	Dealer	Clear	True	FALSE	11.0
4	Used	NaN	10000.0	West Bend, Wisconsin, United States	2012.0	17800.0	Blue	Harley-Davidson	NO WARRANTY	Touring	...	NaN	FALSE	100	13	271	OWNER	Clear	True	TRUE	0.0

5 rows × 22 columns

df['Price_bin'] = pd.cut(df['Price'], 5, labels=range(5))

  
 
  1

# 计算频数
df['Price_bin'].value_counts()

  
 
  1
  2

0 6762
1 659
2 50
3 20
4 2
Name: Price_bin, dtype: int64

  
 
  1
  2
  3
  4
  5
  6

%matplotlib inline

  
 
  1

df['Price_bin'].value_counts().plot(kind='bar')

  
 
  1

<matplotlib.axes._subplots.AxesSubplot at 0x1b35fba9048>

  
 
  1

df['Price_bin'].hist()

  
 
  1

<matplotlib.axes._subplots.AxesSubplot at 0x1b35f681278>

  
 
  1

w = [100, 1000, 5000, 10000, 20000, 100000]

  
 
  1

df['Price_bin'] = pd.cut(df['Price'], bins=w, labels=range(5))

  
 
  1

df[['Price', 'Price_bin']].head(5)

  
 
  1

	Price	Price_bin
0	11412.0	3
1	17200.0	3
2	3872.0	1
3	6575.0	2
4	10000.0	2

df['Price_bin'].hist()

  
 
  1

<matplotlib.axes._subplots.AxesSubplot at 0x1b35fb99898>

  
 
  1

# 分位数
k = 5
w = [1.0 * i/k for i in range(k+1)]
w

  
 
  1
  2
  3
  4

[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

  
 
  1

# 等频分成5段
df['Price_bin'] = pd.qcut(df['Price'], q=w, labels=range(5))

  
 
  1
  2

df['Price_bin'].hist()

  
 
  1

<matplotlib.axes._subplots.AxesSubplot at 0x1b35fe2a080>

  
 
  1

# 计算分位点
k = 5
w1 = df['Price'].quantile([1.0 * i/k for i in range(k+1)])

  
 
  1
  2
  3

0.0 0.0
0.2 3500.0
0.4 6491.0
0.6 9777.0
0.8 14999.0
1.0 100000.0
Name: Price, dtype: float64

  
 
  1
  2
  3
  4
  5
  6
  7

# 一般第一个分位点要比实际小
# 最后一个分位点要比实际大
w1[0] = w[0] * 0.95
w1[1.0] = w1[1.0] * 1.1

  
 
  1
  2
  3
  4

0.0 0.0
0.2 3500.0
0.4 6491.0
0.6 9777.0
0.8 14999.0
1.0 110000.0
Name: Price, dtype: float64

  
 
  1
  2
  3
  4
  5
  6
  7

# 按照新的分段标准分割
df['Price_bin'] = pd.cut(df['Price'], bins=w1, labels=range(5))

  
 
  1
  2

df['Price_bin'].hist()

  
 
  1

<matplotlib.axes._subplots.AxesSubplot at 0x1b35e53fa20>

  
 
  1

文章来源: ruochen.blog.csdn.net，作者：若尘，版权归原作者所有，如需转载，请联系作者。

原文链接：ruochen.blog.csdn.net/article/details/105636309

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

数据清洗之数据离散化

数据离散化

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

数据清洗之 数据离散化

数据离散化

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品

数据清洗之数据离散化