HIVE 优化(一)-COUNT DISTINCT
1.少用COUNT DISTINCT
数据量小的时候无所谓,数据量大的情况下,由于COUNT DISTINCT操作需要用一个Reduce Task来完成,这一个Reduce需要处理的数据量太大,就会导致整个Job很难完成,一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换:
SELECT day,
COUNT(DISTINCT id) AS uv
FROM lxw1234
GROUP BY day
可以转换成:
SELECT day,
COUNT(id) AS uv
FROM (SELECT day,id FROM lxw1234 GROUP BY day,id) a
GROUP BY day;
Count(distinct)
数据量小的时候无所谓,数据量大的情况下,由于COUNT DISTINCT操作需要用一个Reduce Task来完成,这一个Reduce需要处理的数据量太大,就会导致整个Job很难完成,一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换:
环境准备:
create table bigtable(
id bigint,
time bigint,
uid string,
keyword string,
url_rank int,
click_num int,
click_url string) row format delimited fields terminated by '\t';
load data local inpath '/home/admin/softwares/data/100万条大表数据(id除以10取整)/bigtable' into table bigtable;
【关键点】
set hive.exec.reducers.bytes.per.reducer=32123456;
SELECT count(DISTINCT id) FROM bigtable;
结果:
c0
10000
Time taken: 35.49 seconds, Fetched: 1 row(s)
【关键点】可以转换成:
set hive.exec.reducers.bytes.per.reducer=32123456;
SELECT count(id) FROM (SELECT id FROM bigtable GROUP BY id) a;
结果:
Stage-Stage-2: Map: 3 Reduce: 1 Cumulative CPU: 5.14 sec HDFS Read: 8987 HDFS Write: 7 SUCCESS
_c0
10000
Time taken: 51.202 seconds, Fetched: 1 row(s)
虽然会多用一个Job来完成,但在数据量大的情况下,这个绝对是值得的。
文章来源: www.jianshu.com,作者:百忍成金的虚竹,版权归原作者所有,如需转载,请联系作者。
原文链接:www.jianshu.com/p/cbfe92a11fbe
- 点赞
- 收藏
- 关注作者
评论(0)