Hive和Pandas实现wordcount
【摘要】
本文Pandas版本必须为0.25以上才能使用explode,可以通过以下命令看Pandas版本:
pip show pandas
可以通过以下命令升级Pandas到最新版:
pip install pandas --upgrade
也可以通过以下命令安装指定版本的Pandas:
pip install ...
本文Pandas版本必须为0.25以上才能使用explode,可以通过以下命令看Pandas版本:
pip show pandas
可以通过以下命令升级Pandas到最新版:
pip install pandas --upgrade
也可以通过以下命令安装指定版本的Pandas:
pip install pandas==1.0.3
目录
效果展示
在hive中:
hive> select s from wc_t;
s
apple apple cdh
dest cdh firend dest
english firend apple dest
girl gift hit dest girl
Time taken: 0.191 seconds, Fetched 4 row(s)
select wc.word, count(1) count
from
(select explode(split(s, ' ')) as word from wc_t) wc
group by wc.word
order by count desc;
word count
--------------
dest 4
apple 3
girl 2
cdh 2
firend 2
hit 1
english 1
gift 1
Time taken: 2.119 seconds, Fetched 8 row(s)
在python中:
import pandas as pd
In[1]:
df = pd.read_csv("word.txt", header=None)
df
out[1]:
0
0 apple apple cdh
1 dest cdh firend dest
2 english firend apple dest
3 girl gift hit dest girl
In[2]:
df = pd.read_csv("word.txt", header=None, names=['s'])
df["s"] = df["s"].str.split(" ")
se = df.explode("s").rename(columns={"s": "word"}).groupby("word").apply(len)
se.sort_values(ascending=False, inplace=True)
se.reset_index(name="count")
out[2]:
word count
0 dest 4
1 apple 3
2 girl 2
3 firend 2
4 cdh 2
5 hit 1
6 gift 1
7 english 1
hive实现的详解
word.txt文件的内容:
apple apple cdh
dest cdh firend dest
english firend apple dest
girl gift hit dest girl
hive表数据准备:
create table wc_t(s string);
load data local inpath 'word.txt' into table wc_t;
首先,使用 split 函数将数据切割成一个一个的单词:
hive> select split(s, ' ') from wc_t;
["apple","apple","cdh"]
["dest","cdh","firend","dest"]
["english","firend","apple","dest"]
["girl","gift","hit","dest","girl"]
Time taken: 0.36 seconds, Fetched 4 row(s)
然后,使用 explode 函数将集合中的元素拆分成多行元素:
hvie> select explode(split(s, ' ')) word from wc_t;
word
apple
apple
cdh
dest
cdh
firend
dest
english
firend
apple
dest
girl
gift
hit
dest
girl
Time taken: 0.207 seconds, Fetched 16 row(s)
最后,使用聚合函数统计多行数据:
select wc.word, count(1) count
from
(select explode(split(s, ' ')) word from wc_t) wc
group by wc.word
order by count desc;
word count
--------------
dest 4
apple 3
girl 2
cdh 2
firend 2
hit 1
english 1
gift 1
Time taken: 2.119 seconds, Fetched 8 row(s)
python实现的详解
读取数据:
import pandas as pd
df = pd.read_csv("word.txt", header=None, names=['s'])
df
s
0 apple apple cdh
1 dest cdh firend dest
2 english firend apple dest
3 girl gift hit dest girl
将单词切割成数组:
df["s"] = df["s"].str.split(" ")
df
s
0 [apple, apple, cdh]
1 [dest, cdh, firend, dest]
2 [english, firend, apple, dest]
3 [girl, gift, hit, dest, girl]
将数组中的元素拆分成多行元素:
df = df.explode("s")
df
s
0 apple
0 apple
0 cdh
1 dest
1 cdh
1 firend
1 dest
2 english
2 firend
2 apple
2 dest
3 girl
3 gift
3 hit
3 dest
3 girl
修改列名:
df = df.rename(columns={"s": "word"})
df.head()
word
0 apple
0 apple
0 cdh
1 dest
1 cdh
分组聚合,计算每个单词出现的次数(返回一个Series):
se = df.groupby("word").apply(len)
se
word
apple 3
cdh 2
dest 4
english 1
firend 2
gift 1
girl 2
hit 1
dtype: int64
由于无多余的数值列进行数值统计,故只能通过apply传递函数进行计算。
对次数进行排序:
se.sort_values(ascending=False, inplace=True)
se
word
dest 4
apple 3
girl 2
firend 2
cdh 2
hit 1
gift 1
english 1
dtype: int64
最后将结果还原为DataFrame:
se.reset_index(name="count")
word count
0 dest 4
1 apple 3
2 girl 2
3 firend 2
4 cdh 2
5 hit 1
6 gift 1
7 english 1
一步到位:
df = pd.read_csv("word.txt", header=None, names=['s'])
df["s"] = df["s"].str.split(" ")
se = df.explode("s").rename(columns={"s": "word"}).groupby("word").apply(len)
se.sort_values(ascending=False, inplace=True)
se.reset_index(name="count")
word count
0 dest 4
1 apple 3
2 girl 2
3 firend 2
4 cdh 2
5 hit 1
6 gift 1
7 english 1
小例子
有一个gross.csv文件,内容如下:
Action|Adventure|Fantasy|Sci-Fi,760505847.0
Action|Adventure|Fantasy,309404152.0
Action|Adventure|Thriller,200074175.0
Action|Thriller,448130642.0
Documentary,
Action|Adventure|Sci-Fi,73058679.0
Action|Adventure|Romance,336530303.0
Adventure|Animation|Comedy|Family|Fantasy|Musical|Romance,200807262.0
Action|Adventure|Sci-Fi,458991599.0
Adventure|Family|Fantasy|Mystery,301956980.0
Action|Adventure|Sci-Fi,330249062.0
Action|Adventure|Sci-Fi,200069408.0
Action|Adventure,168368427.0
Action|Adventure|Fantasy,423032628.0
Action|Adventure|Western,89289910.0
Action|Adventure|Fantasy|Sci-Fi,291021565.0
Action|Adventure|Family|Fantasy,141614023.0
每行数据表示某部电影所属的电影类型,和该部电影的票房总数。现在要求用python或hive 按照电影类型分类,统计出不同类型的票房总数。
hive实现
加载数据:
CREATE TABLE movie_gross (
genres string,
gross bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
load data local inpath 'gross.csv' into table movie_gross;
sql查询语句:
select a.genre,sum(a.gross) gross
from
(select genre,gross from movie_gross
lateral view explode(split(genres,'\\|')) tmp as genre) a
group by a.genre
order by gross desc;
结果:
genre gross
-------------------------
Adventure 4284974020
Action 4230340420
Fantasy 2428342457
Sci-Fi 2113896160
Thriller 648204817
Family 644378265
Romance 537337565
Mystery 301956980
Musical 200807262
Animation 200807262
Comedy 200807262
Western 89289910
Documentary NULL
Time taken: 2.592 seconds, Fetched 13 row(s)
python实现
import pandas as pd
df = pd.read_csv("gross.csv", header=None, names=["genres", "gross"])
df["genres"] = df["genres"].str.split("|")
df.explode("genres").groupby('genres').sum().sort_values("gross", ascending=False)
结果:
gross
genres
Adventure 4.284974e+09
Action 4.230340e+09
Fantasy 2.428342e+09
Sci-Fi 2.113896e+09
Thriller 6.482048e+08
Family 6.443783e+08
Romance 5.373376e+08
Mystery 3.019570e+08
Animation 2.008073e+08
Comedy 2.008073e+08
Musical 2.008073e+08
Western 8.928991e+07
Documentary 0.000000e+00
文章来源: xxmdmst.blog.csdn.net,作者:小小明-代码实体,版权归原作者所有,如需转载,请联系作者。
原文链接:xxmdmst.blog.csdn.net/article/details/105985770
【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)