【大数据】Hive主流文件存储格式对比
【摘要】
Hive 主流文件存储格式对比
1、存储文件的压缩比测试
1.1 测试数据
https://github.com/Chenhonli/Compression_Format_Data
log.txt...
Hive 主流文件存储格式对比
1、存储文件的压缩比测试
1.1 测试数据
https://github.com/Chenhonli/Compression_Format_Data
log.txt 大小为18.1 M
- 1
- 2
- 3
1.2 TextFile
- 创建表,存储数据格式为TextFile
create table log_text (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as textfile ;
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 向表中加载数据
load data local inpath '/home/hadoop/log.txt' into table log_text ;
- 1
- 查看表的数据量大小
dfs -du -h /user/hive/warehouse/log_text;
+------------------------------------------------+--+
| DFS Output |
+------------------------------------------------+--+
| 18.1 M /user/hive/warehouse/log_text/log.txt |
+------------------------------------------------+--+
- 1
- 2
- 3
- 4
- 5
- 6
- 7
1.3 Parquet
- 创建表,存储数据格式为 parquet
create table log_parquet (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet;
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 向表中加载数据
insert into table log_parquet select * from log_text;
- 1
- 查看表的数据量大小
dfs -du -h /user/hive/warehouse/log_parquet;
+----------------------------------------------------+--+
| DFS Output |
+----------------------------------------------------+--+
| 13.1 M /user/hive/warehouse/log_parquet/000000_0 |
+----------------------------------------------------+--+
- 1
- 2
- 3
- 4
- 5
- 6
- 7
1.4 ORC
- 创建表,存储数据格式为ORC
create table log_orc (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc ;
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 向表中加载数据
insert into table log_orc select * from log_text ;
- 1
- 查看表的数据量大小
dfs -du -h /user/hive/warehouse/log_orc;
+-----------------------------------------------+--+
| DFS Output |
+-----------------------------------------------+--+
| 2.8 M /user/hive/warehouse/log_orc/000000_0 |
+-----------------------------------------------+--+
- 1
- 2
- 3
- 4
- 5
- 6
1.5 存储文件的压缩比总结
ORC > Parquet > textFile
- 1
2、存储文件的查询速度测试
2.1 TextFile
select count(*) from log_text;
+---------+--+
| _c0 |
+---------+--+
| 100000 |
+---------+--+
1 row selected (16.99 seconds)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
2.2 Parquet
select count(*) from log_parquet;
+---------+--+
| _c0 |
+---------+--+
| 100000 |
+---------+--+
1 row selected (17.994 seconds)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
2.3 ORC
select count(*) from log_orc;
+---------+--+
| _c0 |
+---------+--+
| 100000 |
+---------+--+
1 row selected (15.943 seconds)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
2.4 存储文件的查询速度总结
ORC > TextFile > Parquet
- 1
3、存储和压缩结合
-
使用压缩的优势是可以最小化所需要的磁盘存储空间,以及减少磁盘和网络io操作
-
官网地址
- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
-
ORC支持三种压缩:ZLIB,SNAPPY,NONE。最后一种就是不压缩,orc默认采用的是ZLIB压缩。
3.1 创建一个非压缩的的ORC存储方式表
- 1、创建一个非压缩的的ORC表
create table log_orc_none (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties("orc.compress"="NONE") ;
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 2、加载数据
insert into table log_orc_none select * from log_text ;
- 1
- 3、查看表的数据量大小
dfs -du -h /user/hive/warehouse/log_orc_none;
+----------------------------------------------------+--+
| DFS Output |
+----------------------------------------------------+--+
| 7.7 M /user/hive/warehouse/log_orc_none/000000_0 |
+----------------------------------------------------+--+
- 1
- 2
- 3
- 4
- 5
- 6
3.2 创建一个snappy压缩的ORC存储方式表
- 1、创建一个snappy压缩的的ORC表
create table log_orc_snappy (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties("orc.compress"="SNAPPY") ;
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 2、加载数据
insert into table log_orc_snappy select * from log_text ;
- 1
- 3、查看表的数据量大小
dfs -du -h /user/hive/warehouse/log_orc_snappy;
+------------------------------------------------------+--+
| DFS Output |
+------------------------------------------------------+--+
| 3.8 M /user/hive/warehouse/log_orc_snappy/000000_0 |
+------------------------------------------------------+--+
- 1
- 2
- 3
- 4
- 5
- 6
3.3 创建一个ZLIB压缩的ORC存储方式表
- 不指定压缩格式的就是默认的采用ZLIB压缩
- 可以参考上面创建的 log_orc 表
- 查看表的数据量大小
dfs -du -h /user/hive/warehouse/log_orc;
+-----------------------------------------------+--+
| DFS Output |
+-----------------------------------------------+--+
| 2.8 M /user/hive/warehouse/log_orc/000000_0 |
+-----------------------------------------------+--+
- 1
- 2
- 3
- 4
- 5
- 6
3.4 存储方式和压缩总结
-
orc 默认的压缩方式ZLIB比Snappy压缩的还小。
-
在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。
-
由于snappy的压缩和解压缩 效率都比较高,压缩方式一般选择snappy
文章来源: blog.csdn.net,作者:橙子园,版权归原作者所有,如需转载,请联系作者。
原文链接:blog.csdn.net/Chenftli/article/details/122561777
【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)