七十八、Hive数据仓库实际操作(操作测试)

举报
托马斯-酷涛 发表于 2022/05/25 23:22:20 2022/05/25
【摘要】 Hive数据仓库的操作: 数据库的创建与删除表的创建,修改,删除表中数据的导入与导出表分区与桶的创建、修改、删除 目录 Hive环境搭建 Hive数据仓库的操作 Hive数据表的操作 Hive中数据的导入与导出 Hive环境搭建 centos安装hive3.1.2(精讲篇)https://bl...

Hive数据仓库的操作:

  • 数据库的创建与删除
  • 表的创建,修改,删除
  • 表中数据的导入与导出
  • 表分区与桶的创建、修改、删除

目录

Hive环境搭建

Hive数据仓库的操作

Hive数据表的操作

Hive中数据的导入与导出


Hive环境搭建

centos安装hive3.1.2(精讲篇)https://blog.csdn.net/m0_54925305/article/details/120554242?spm=1001.2014.3001.5502

Hive数据仓库的操作

一、创建数据库

hive> show databases;
OK
default
Time taken: 0.067 seconds, Fetched: 1 row(s)
hive> create database if not exists DB;
OK
Time taken: 0.064 seconds
hive> show databases;
OK
db
default
Time taken: 0.018 seconds, Fetched: 2 row(s)

二、查看数据仓库DB的信息及路径

hive> describe database DB;
OK
db		hdfs://master:9000/user/hive/warehouse/db.db	root	USER	
Time taken: 0.065 seconds, Fetched: 1 row(s)

Hive数据表的操作

Hive的数据表分为两种:内部表和外部表。


Hive创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。这样外部表相对来说更加安全些,数据组织也更加灵活,方便共享源数据,生产中常使用外部表。


下面详细介绍对表操作的命令及使用方法:
        即将创建的表,表名不能和已有表名重复,否则会报错,现在我们show tables 一下,查看已存在的表。

一、创建一个名为cat的内部表,有两个字段为cat_id和cat_name,字符类型为string

hive> create table cat(cat_id string,cat_name string);
OK
Time taken: 0.72 seconds
hive> show tables;
OK
cat
Time taken: 0.057 seconds, Fetched: 1 row(s)

二、创建一个外部表,表名为cat2,有两个字段为cat_id和cat_name,字符类型为string

hive> create external table if not exists goods(group_id string,group_name string) row format delimited fields terminated by '\t' location '/user/root/goods';
OK
Time taken: 0.155 seconds
hive> show tables;
OK
cat
goods
Time taken: 0.026 seconds, Fetched: 2 row(s)

三、修改cat表的表结构。对cat表添加两个字段group_id和cat_code

hive> alter table cat add columns(group_id string,cat_code string);
OK
Time taken: 0.372 seconds
hive> desc cat;
OK
cat_id              	string              	                    
cat_name            	string              	                    
group_id            	string              	                    
cat_code            	string              	                    
Time taken: 0.087 seconds, Fetched: 4 row(s)

四、修改表名cat为cat2

hive> alter table cat rename to cat2;
OK
Time taken: 0.275 seconds

这个命令可以让用户为表更名,数据所在的位置和分区名并不改变。

五、创建与已知表相同结构的表,创建一个与cat表结构相同的表,名为cat3,这里要
用到 like 关键字

hive> create table cat3 like cat2;
OK
Time taken: 1.391 seconds
hive> show tables;
OK
cat2
cat3
goods
Time taken: 0.047 seconds, Fetched: 3 row(s)
hive> desc cat3;
OK
cat_id              	string              	                    
cat_name            	string              	                    
group_id            	string              	                    
cat_code            	string              	                    
Time taken: 0.118 seconds, Fetched: 4 row(s)

Hive中数据的导入与导出

一、从本地文件系统中导入数据到Hive表

首先,在Hive 中创建一个cat_group表,包含group_id和group_name两个字段,字符类型为string,以“\t”为分隔符,并查看结果。

hive> create table cat_group(group_id string,group_name string) row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.218 seconds
hive> show tables;
OK
cat2
cat3
cat_group
goods
Time taken: 0.048 seconds, Fetched: 4 row(s)

[row format delimited]关键字,是用来设置创建的表在加载数据的时候,支持的列分隔符。
[stored as textfile]关键字,是用来设置加载数据的数据类型,默认是 TEXTFILE,如果文件数据是纯文本,就是使用[stored as textfile],然后从本地直接拷贝到HDFS上,Hive直接可以识别数据。

二、将Linux本地 /input/hive/目录下的myhive 文件导入到 Hive 中的cat_group表中

hive> load data local inpath '/input/hive/myhive' into table cat_group;
Loading data to table db.cat_group
OK
Time taken: 1.081 seconds

通过select语句查看cat_group表中是否成功导入数据,使用limit关键字限制输出5条记录。

hive> select * from cat_group limit 5;
OK
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯
Time taken: 2.088 seconds, Fetched: 5 row(s)

三、将HDFS中的数据导入到Hive中

        1、首先,另外开启一个操作窗口,在HDFS上创建/output/hive目录

[root@master hive]# hadoop fs -mkdir /output/hive

        2、将本地/iutput/hive/下的myhive文件上传到HDFS的/output/hive上,并查看是否创建成功

[root@master hive]# hadoop fs -put /input/hive/myhive /output/hive/
[root@master hive]# hadoop fs -ls /output/hive 
Found 1 items
-rw-r--r--   2 root supergroup         64 2022-03-05 22:19 /output/hive/myhive

        3、在Hive 中创建名为cat_group1的表,创表语句如下

hive> create table cat_group1(group_id string,group_name string)
    > row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.243 seconds

        4、将HDFS 下/output/hive 中的表cat_group导入到Hive 中的cat_group1表中,并查看结果

hive> load data inpath '/output/hive/myhive'  into table cat_group1;
Loading data to table db.cat_group1
OK
Time taken: 0.539 seconds
hive> select * from cat_group1 limit 5;
OK
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯
Time taken: 0.262 seconds, Fetched: 5 row(s)

        注:数据导入成功。

        HDFS中数据导入到 Hive中与本地数据导入到 hive中的区别是load data后少了local。

四、从别的表中查询出相应的数据并导入到Hive中

        1、首先在Hive 中创建一个名为cat_group2的表。

hive> create table cat_group2(group_id string,group_name string)
    > row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.111 seconds

        2、用下面两种方式将cat_group1表中的数据导入到cat_group2表中。

hive> insert into table cat_group2 select * from cat_group1;
Query ID = root_20220306040659_42572420-db7d-4412-bbc3-495abd9ce479
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646528951444_0003, Tracking URL = http://master:8088/proxy/application_1646528951444_0003/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-06 04:07:31,799 Stage-1 map = 0%,  reduce = 0%
2022-03-06 04:07:51,642 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.89 sec
2022-03-06 04:08:00,165 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.47 sec
MapReduce Total cumulative CPU time: 3 seconds 470 msec
Ended Job = job_1646528951444_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group2/.hive-staging_hive_2022-03-06_04-06-59_043_3456913091663343579-1/-ext-10000
Loading data to table db.cat_group2
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.47 sec   HDFS Read: 13409 HDFS Write: 348 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 470 msec
OK
Time taken: 63.711 seconds
hive> insert overwrite  table cat_group2 select * from cat_group1;
Query ID = root_20220306041024_bf920fd1-b42d-4ed7-ad7b-66955905fa19
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646528951444_0004, Tracking URL = http://master:8088/proxy/application_1646528951444_0004/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-06 04:10:47,981 Stage-1 map = 0%,  reduce = 0%
2022-03-06 04:11:12,568 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.33 sec
2022-03-06 04:11:22,231 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.1 sec
MapReduce Total cumulative CPU time: 4 seconds 100 msec
Ended Job = job_1646528951444_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group2/.hive-staging_hive_2022-03-06_04-10-24_167_6531779411761470258-1/-ext-10000
Loading data to table db.cat_group2
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.1 sec   HDFS Read: 13494 HDFS Write: 348 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 100 msec
OK
Time taken: 60.895 seconds

        注:insert overwrite会覆盖数据

        3、查询表 cat_group2

hive> select * from cat_group2 limit 5;
OK
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯
Time taken: 0.33 seconds, Fetched: 5 row(s)

        4、在创建表的时候从别的表中查询出相应数据并插入到所创建的表中

Hive中创建表cat_group3并直接从cat_group2中获得数据。

hive> create table  cat_group3 as  select * from cat_group2;
Query ID = root_20220306041630_3200b863-b9b3-4c2e-ac0d-c7caff9b6611
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1646528951444_0005, Tracking URL = http://master:8088/proxy/application_1646528951444_0005/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-03-06 04:16:54,438 Stage-1 map = 0%,  reduce = 0%
2022-03-06 04:17:02,430 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.58 sec
MapReduce Total cumulative CPU time: 1 seconds 580 msec
Ended Job = job_1646528951444_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/.hive-staging_hive_2022-03-06_04-16-30_327_7813330832683742274-1/-ext-10002
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group3
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.58 sec   HDFS Read: 4969 HDFS Write: 133 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 580 msec
OK
Time taken: 34.65 seconds

        5、查询表 cat_group3

hive> select * from cat_group3 limit 5;
OK
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯
Time taken: 0.229 seconds, Fetched: 5 row(s)

五、常见的三种数据导出方式

        1、导出到本地文件系统

在本地创建目录 /output/hive  并将Hive中的cat_group表导出到本地文件系统/output/hive/中。

[root@master hive]# mkdir -p /output/hive/
hive> insert overwrite local directory '/output/hive/'
    > row format delimited fields terminated by '\t' select * from cat_group;
Query ID = root_20220306062829_b059a3f5-e4ad-4dd7-a000-e294c4ccbee2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1646528951444_0006, Tracking URL = http://master:8088/proxy/application_1646528951444_0006/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-03-06 06:28:51,743 Stage-1 map = 0%,  reduce = 0%
2022-03-06 06:29:00,515 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.59 sec
MapReduce Total cumulative CPU time: 1 seconds 590 msec
Ended Job = job_1646528951444_0006
Moving data to local directory /output/hive
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.59 sec   HDFS Read: 4738 HDFS Write: 64 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 590 msec
OK
Time taken: 32.116 seconds
[root@master out]# cd /output/hive/
[root@master hive]# ll
total 4
-rw-r--r--. 1 root root 64 Mar  6 06:29 000000_0
[root@master hive]# cat 000000_0 
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯

注意:方法和导入数据到 Hive不一样,不能用insert into来将数据导出。

        2、Hive中数据导出到HDFS中

将Hive中的表cat_group中的数据导入到HDFS的/output/hive目录里。

hive> insert overwrite directory '/output/hive' 
    > row format delimited fields terminated by '\t' select group_id,
    > group_name from cat_group;
Query ID = root_20220306063621_b359d338-77ee-4571-a425-5415f9c6fb03
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1646528951444_0007, Tracking URL = http://master:8088/proxy/application_1646528951444_0007/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-03-06 06:36:41,866 Stage-1 map = 0%,  reduce = 0%
2022-03-06 06:36:55,679 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_1646528951444_0007
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/output/hive/.hive-staging_hive_2022-03-06_06-36-21_452_7432529204143275493-1/-ext-10000
Moving data to directory /output/hive
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.75 sec   HDFS Read: 4772 HDFS Write: 64 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
OK
Time taken: 36.494 seconds

在HDFS上查看结果

[root@master hive]# hadoop fs -ls /output/hive
Found 1 items
-rw-r--r--   2 root supergroup         64 2022-03-06 06:36 /output/hive/000000_0

        3、导出到Hive的另一张表中

将Hive中表cat_group中的数据导入到cat_group4中(两表字段及字符类型相同)。
首先在Hive 中创建一个表cat_group4,有group_id和group_name 两个字段,字符类型为string,以\t’为分隔符。

hive> create table cat_group4(group_id string,group_name string)
    > row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.195 seconds

然后将cat_group中的数据导入到cat_group4中。

hive> insert into table cat_group4 select * from cat_group;
Query ID = root_20220306064421_722364dd-7475-4ae5-ba44-553f3df856e2
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646528951444_0008, Tracking URL = http://master:8088/proxy/application_1646528951444_0008/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-06 06:44:47,514 Stage-1 map = 0%,  reduce = 0%
2022-03-06 06:44:58,359 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.74 sec
2022-03-06 06:45:11,880 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.4 sec
MapReduce Total cumulative CPU time: 3 seconds 400 msec
Ended Job = job_1646528951444_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group4/.hive-staging_hive_2022-03-06_06-44-21_318_6696628966307745769-1/-ext-10000
Loading data to table db.cat_group4
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.4 sec   HDFS Read: 13474 HDFS Write: 348 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 400 msec
OK
Time taken: 52.617 seconds

导入完成后,查看cat_group4表中数据。

hive> select * from cat_group4 limit 10;
OK
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯
Time taken: 0.249 seconds, Fetched: 5 row(s)

六、Hive分区表的操作

创建表分区,在 Hive中创建一个分区表goods,包含 goods_id和goods_status两个字段,字符类型为string,分区为cat_id,字符类型为string,以“\t“为分隔符。

hive> create table goods(goods_id string,goods_status string) partitioned by (cat_id string)
    > row format delimited fields terminated by '\t';
OK
Time taken: 0.107 seconds

查看表 goods 结构

hive> desc goods;
OK
goods_id            	string              	                    
goods_status        	string              	                    
cat_id              	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
cat_id              	string              	                    
Time taken: 0.108 seconds, Fetched: 7 row(s)

向分区表插入数据,将本地/output/hive下的表goods 中数据,插入到分区表goods中。

[root@master hive]# cat goods 
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052

在Hive中创建一个非分区表goods_1表,用于存储本地/input/hive/下的表goods 中数据。

hive> create table goods_1(goods_id string,goods_status string,cat_id string)
    > row format delimited fields terminated by '\t';
OK
Time taken: 0.179 seconds

将本地/input/hive/下的表goods 中数据导入到Hive中的goods_1表中。

hive> load data local inpath '/input/hive/goods' into table goods_1;
Loading data to table db.goods_1
OK
Time taken: 0.511 seconds

再将表goods_1中的数据导入到分区表goods中

hive> insert into table db.goods partition(cat_id = '52052') select goods_id, goods_status from db.goods_1 where cat_id = '52052';
Query ID = root_20220307041832_30f47fc3-629d-4eda-821a-5f0c3a9edb0d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0002, Tracking URL = http://master:8088/proxy/application_1646636256603_0002/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-07 04:19:05,274 Stage-1 map = 0%,  reduce = 0%
2022-03-07 04:19:18,487 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.77 sec
2022-03-07 04:19:27,292 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.59 sec
MapReduce Total cumulative CPU time: 4 seconds 590 msec
Ended Job = job_1646636256603_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/goods/cat_id=52052/.hive-staging_hive_2022-03-07_04-18-32_060_6446641423854979060-1/-ext-10000
Loading data to table db.goods partition (cat_id=52052)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.59 sec   HDFS Read: 14777 HDFS Write: 320 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 590 msec
OK
Time taken: 59.931 seconds

查看表 goods中数据 

hive> select goods_id, goods_status from goods;
OK
1624123	6
1020405	6
1020405	6
1020405	6
1020405	6
1020405	6
1020405	6
1020405	6
Time taken: 0.252 seconds, Fetched: 8 row(s)

修改表分区,将分区表goods中的分区列 cat_id = 52050 改为cat_id = 52051,并查看修改后的分区名。

hive> alter table goods partition(cat_id=52052) rename to partition(cat_id=52051);
OK
Time taken: 0.678 seconds
hive> show partitions goods;
OK
cat_id=52051
Time taken: 0.139 seconds, Fetched: 1 row(s)

删除表分区

在删除goods分区表之前,先将goods表备份出一个goods_2表

hive> create table goods_2(goods_id string,goods_status string) partitioned by (cat_id string) row format delimited fields terminated by '\t';
OK
Time taken: 0.178 seconds
hive> insert into table goods_2 partition(cat_id='52052') select goods_id,goods_status from goods_1 where cat_id = '52052';
Query ID = root_20220307054238_db58a379-17f6-4ecb-86e0-402e0d7bbf54
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0003, Tracking URL = http://master:8088/proxy/application_1646636256603_0003/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-07 05:43:04,534 Stage-1 map = 0%,  reduce = 0%
2022-03-07 05:43:17,542 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.76 sec
2022-03-07 05:43:26,197 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.55 sec
MapReduce Total cumulative CPU time: 4 seconds 550 msec
Ended Job = job_1646636256603_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/goods_2/cat_id=52052/.hive-staging_hive_2022-03-07_05-42-38_498_2225361888387483704-1/-ext-10000
Loading data to table db.goods_2 partition (cat_id=52052)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.55 sec   HDFS Read: 14813 HDFS Write: 322 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 550 msec
OK
Time taken: 49.84 seconds

删除goods表中的cat_id 分区

hive> alter table goods drop if exists partition(cat_id = '52051');
Dropped the partition cat_id=52051
OK
Time taken: 0.405 seconds
hive> show partitions goods;
OK
Time taken: 0.137 seconds

七、Hive桶的操作

在建立桶之前,需要设置hive.enforce.bucketing属性为true,使用Hive能识别桶。

1、创建桶

创建一个名为goods_t表,包含两个字段goods_id 和goods_status ,字段类型都为string ,按cat_id string 做分区,按goods_status 列聚类和goods_id列排列,划分成两个桶。

hive> create table goods_t(goods_id string, goods_status string) partitioned by (cat_id string) clustered by(goods_status) sorted by(goods_id) into 2 buckets;
OK
Time taken: 0.148 seconds

2、设置环境变量set hive.enforce.bucketing=true;

hive> set hive.enforce.bucketing=true;

3、向goods_t表中插入goods_2表中的数据

hive> insert overwrite table goods_t partition(cat_id='52063') select goods_id,goods_status from goods_2;
Query ID = root_20220307060336_c76fa90c-ea59-4fa4-9dd5-654c843421fd
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks determined at compile time: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0004, Tracking URL = http://master:8088/proxy/application_1646636256603_0004/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2022-03-07 06:04:01,531 Stage-1 map = 0%,  reduce = 0%
2022-03-07 06:04:12,389 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.73 sec
2022-03-07 06:04:29,170 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 4.23 sec
2022-03-07 06:04:30,371 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.99 sec
MapReduce Total cumulative CPU time: 7 seconds 410 msec
Ended Job = job_1646636256603_0004
Loading data to table db.goods_t partition (cat_id=52063)
Launching Job 2 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0005, Tracking URL = http://master:8088/proxy/application_1646636256603_0005/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0005
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2022-03-07 06:04:54,726 Stage-3 map = 0%,  reduce = 0%
2022-03-07 06:05:07,008 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.75 sec
2022-03-07 06:05:16,566 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 3.93 sec
MapReduce Total cumulative CPU time: 3 seconds 930 msec
Ended Job = job_1646636256603_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 7.41 sec   HDFS Read: 19414 HDFS Write: 469 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 3.93 sec   HDFS Read: 11591 HDFS Write: 173 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 340 msec
OK
Time taken: 102.151 seconds

4、抽样桶表

hive> select * from goods_t tablesample(bucket 1 out of 2 on goods_status);
OK
Time taken: 0.281 seconds


 

文章来源: tuomasi.blog.csdn.net,作者:托马斯-酷涛,版权归原作者所有,如需转载,请联系作者。

原文链接:tuomasi.blog.csdn.net/article/details/123310024

【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。