Sqoop全量同步mysql/Oracle数据到hive

举报
赵延东的一亩三分地 发表于 2023/03/25 12:06:12 2023/03/25
【摘要】 接下来本篇文章详细介绍一下使用sqoop全量同步oracle/mysql数据到hive,这里实验采用oracle数据库为例, 后面一篇文章将详细介绍: 1、sqoop --incremental append 附加模式增量同步数据到hive 2、sqoop --incremental --merge-key合并模式增量同步到hive

接下来本篇文章详细介绍一下使用sqoop全量同步oracle/mysql数据到hive,这里实验采用oracle数据库为例,
后面一篇文章将详细介绍:
1、sqoop --incremental append 附加模式增量同步数据到hive
2、sqoop --incremental --merge-key合并模式增量同步到hive
文章现已经写完

一、知识储备

sqoop import和export工具有些通用的选项,如下表所示:
image.png
数据导入工具import:
import工具,是将HDFS平台外部的结构化存储系统中的数据导入到Hadoop平台,便于后续分析。我们先看一下import工具的基本选项及其含义,如下表所示:
image.png
下面将通过一系列案例来测试这些功能。因为笔者现在只用到import,因此本文章只测试import相关功能,export参数没有列出,请读者自行测试。

二、导入实验

1、Oracle库创建测试用表初始化及hive创建表

--连接的用户为scott用户
create table inr_emp as select a.empno,
                               a.ename,
                               a.job,
                               a.mgr,
                               a.hiredate,
                               a.sal,
                               a.deptno,sysdate as etltime from emp a where job is not null;
select * from inr_emp;
EMPNO	ENAME	JOB	        MGR	    HIREDATE	SAL	DEPTNO	ETLTIME
7369	er	    CLERK	    7902	1980/12/17	800.00	20	2019/3/19 14:02:13
7499	ALLEN	SALESMAN	7698	1981/2/20	1600.00	30	2019/3/19 14:02:13
7521	WARD	SALESMAN	7698	1981/2/22	1250.00	30	2019/3/19 14:02:13
7566	JONES	MANAGER	    7839	1981/4/2	2975.00	20	2019/3/19 14:02:13
7654	MARTIN	SALESMAN	7698	1981/9/28	1250.00	30	2019/3/19 14:02:13
7698	BLAKE	MANAGER	    7839	1981/5/1	2850.00	30	2019/3/19 14:02:13
7782	CLARK	MANAGER	    7839	1981/6/9	2450.00	10	2019/3/19 14:02:13
7839	KING	PRESIDENT		    1981/11/17	5000.00	10	2019/3/19 14:02:13
7844	TURNER	SALESMAN	7698	1981/9/8	1500.00	30	2019/3/19 14:02:13
7876	ADAMS	CLERK	    7788	1987/5/23	1100.00	20	2019/3/19 14:02:13
7900	JAMES	CLERK	    7698	1981/12/3	950.00	30	2019/3/19 14:02:13
7902	FORD	ANALYST	    7566	1981/12/3	3000.00	20	2019/3/19 14:02:13
7934	sdf	    sdf	        7782	1982/1/23	1300.00	10	2019/3/19 14:02:13
    
    
--hive创建表
[root@hadoop bin]# ./hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> use oracle;
OK
Time taken: 1.234 seconds
hive> create table INR_EMP
    > (
    >   empno    int,
    >   ename    string,
    >   job      string,
    >   mgr      int,
    >   hiredate DATE,
    >   sal      float,
    >   deptno   int,
    >   etltime  DATE
    > );
OK
Time taken: 0.63 seconds

2、全量全列导入数据

[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_EMP -m 1 --hive-import --hive-database oracle
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /hadoop/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
19/03/12 18:28:29 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7

查询hive表:

hive> select * from inr_emp;
OK
7369	er	CLERK	7902	NULL	800.0	20	NULL
7499	ALLEN	SALESMAN	7698	NULL	1600.0	30	NULL
7521	WARD	SALESMAN	7698	NULL	1250.0	30	NULL
7566	JONES	MANAGER	7839	NULL	2975.0	20	NULL
7654	MARTIN	SALESMAN	7698	NULL	1250.0	30	NULL
7698	BLAKE	MANAGER	7839	NULL	2850.0	30	NULL
7782	CLARK	MANAGER	7839	NULL	2450.0	10	NULL
7839	KING	PRESIDENT	NULL	NULL	5000.0	10	NULL
7844	TURNER	SALESMAN	7698	NULL	1500.0	30	NULL
7876	ADAMS	CLERK	7788	NULL	1100.0	20	NULL
7900	JAMES	CLERK	7698	NULL	950.0	30	NULL
7902	FORD	ANALYST	7566	NULL	3000.0	20	NULL
7934	sdf	sdf	7782	NULL	1300.0	10	NULL
Time taken: 3.103 seconds, Fetched: 13 row(s)

发现导入hive表时间相关的数据都成空值了,这里我们把oracle时间列对应的hive表的时间列改为string类型重新导入:

hive> drop table inr_emp;
OK
Time taken: 2.483 seconds
hive> create table INR_EMP
    > (
    >   empno    int,
    >   ename    string,
    >   job      string,
    >   mgr      int,
    >   hiredate string,
    >   sal      float,
    >   deptno   int,
    >   etltime  string
    > );
OK
Time taken: 0.109 seconds

再次执行一次上面的导入,看下结果:

hive> select * from inr_emp;
OK
7369	er	CLERK	7902	1980-12-17 00:00:00.0	800.0	20	2019-03-19 14:02:13.0
7499	ALLEN	SALESMAN	7698	1981-02-20 00:00:00.0	1600.0	30	2019-03-19 14:02:13.0
7521	WARD	SALESMAN	7698	1981-02-22 00:00:00.0	1250.0	30	2019-03-19 14:02:13.0
7566	JONES	MANAGER	7839	1981-04-02 00:00:00.0	2975.0	20	2019-03-19 14:02:13.0
7654	MARTIN	SALESMAN	7698	1981-09-28 00:00:00.0	1250.0	30	2019-03-19 14:02:13.0
7698	BLAKE	MANAGER	7839	1981-05-01 00:00:00.0	2850.0	30	2019-03-19 14:02:13.0
7782	CLARK	MANAGER	7839	1981-06-09 00:00:00.0	2450.0	10	2019-03-19 14:02:13.0
7839	KING	PRESIDENT	NULL	1981-11-17 00:00:00.0	5000.0	10	2019-03-19 14:02:13.0
7844	TURNER	SALESMAN	7698	1981-09-08 00:00:00.0	1500.0	30	2019-03-19 14:02:13.0
7876	ADAMS	CLERK	7788	1987-05-23 00:00:00.0	1100.0	20	2019-03-19 14:02:13.0
7900	JAMES	CLERK	7698	1981-12-03 00:00:00.0	950.0	30	2019-03-19 14:02:13.0
7902	FORD	ANALYST	7566	1981-12-03 00:00:00.0	3000.0	20	2019-03-19 14:02:13.0
7934	sdf	sdf	7782	1982-01-23 00:00:00.0	1300.0	10	2019-03-19 14:02:13.0
Time taken: 0.369 seconds, Fetched: 13 row(s)

这次正常了。

3、全量选择列导入

先drop了hive表inr_emp表,重建:

hive> drop table inr_emp;
OK
Time taken: 0.205 seconds
hive> create table INR_EMP
    > (
    >   empno    int,
    >   ename    string,
    >   job      string,
    >   mgr      int,
    >   hiredate string,
    >   sal      float,
    >   deptno   int,
    >   etltime  string
    > );
OK
Time taken: 0.102 seconds

然后另开一个会话挑几列导入

[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_EMP -m 1 --columns 'EMPNO,ENAME,SAL,ETLTIME' --hive-import --hi
ve-database oracleWarning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /hadoop/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
19/03/12 18:44:23 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/03/12 18:44:23 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/03/12 18:44:23 INFO tool.BaseSqoopTool: Using Hive-specific delimiters 

查询hive表

hive> select * from inr_emp;
OK
7369	er	800	NULL	NULL	NULL	NULL	NULL
7499	ALLEN	1600	NULL	NULL	NULL	NULL	NULL
7521	WARD	1250	NULL	NULL	NULL	NULL	NULL
7566	JONES	2975	NULL	NULL	NULL	NULL	NULL
7654	MARTIN	1250	NULL	NULL	NULL	NULL	NULL
7698	BLAKE	2850	NULL	NULL	NULL	NULL	NULL
7782	CLARK	2450	NULL	NULL	NULL	NULL	NULL
7839	KING	5000	NULL	NULL	NULL	NULL	NULL
7844	TURNER	1500	NULL	NULL	NULL	NULL	NULL
7876	ADAMS	1100	NULL	NULL	NULL	NULL	NULL
7900	JAMES	950	NULL	NULL	NULL	NULL	NULL
7902	FORD	3000	NULL	NULL	NULL	NULL	NULL
7934	sdf	1300	NULL	NULL	NULL	NULL	NULL
Time taken: 0.188 seconds, Fetched: 13 row(s)

发现的确只导入了这几列,其他列为空,如果hive表只创建我们需要的源端几个列来创建一个表,然后指定需要的这几列导入呢?
删除重建hive表:

hive> drop table inr_emp;
OK
Time taken: 0.152 seconds
hive> create table INR_EMP
    > (
    >   empno    int,
    >   ename    string,
    >   sal      float
    > );
OK
Time taken: 0.086 seconds

重新导入:

[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_EMP -m 1 --columns 'EMPNO,ENAME,SAL,ETLTIME' --hive-import --hi
ve-database oracle
。。。

查询hive表

hive> select * from inr_emp;
OK
7369	er	800.0
7499	ALLEN	1600.0
7521	WARD	1250.0
7566	JONES	2975.0
7654	MARTIN	1250.0
7698	BLAKE	2850.0
7782	CLARK	2450.0
7839	KING	5000.0
7844	TURNER	1500.0
7876	ADAMS	1100.0
7900	JAMES	950.0
7902	FORD	3000.0
7934	sdf	1300.0
Time taken: 0.18 seconds, Fetched: 13 row(s)

导入的数据没问题,这样在做kylin增量时没我可以只选择需要计算的列来创建hive表,然后通过sqoop来增量数据到hive,降低空间使用,加下下一篇文章介绍增量导入,连接已经在文章开始给出。

【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。