Sqoop全量同步mysql/Oracle数据到hive
接下来本篇文章详细介绍一下使用sqoop全量同步oracle/mysql数据到hive,这里实验采用oracle数据库为例,
后面一篇文章将详细介绍:
1、sqoop --incremental append 附加模式增量同步数据到hive
2、sqoop --incremental --merge-key合并模式增量同步到hive
文章现已经写完
一、知识储备
sqoop import和export工具有些通用的选项,如下表所示:
数据导入工具import:
import工具,是将HDFS平台外部的结构化存储系统中的数据导入到Hadoop平台,便于后续分析。我们先看一下import工具的基本选项及其含义,如下表所示:
下面将通过一系列案例来测试这些功能。因为笔者现在只用到import,因此本文章只测试import相关功能,export参数没有列出,请读者自行测试。
二、导入实验
1、Oracle库创建测试用表初始化及hive创建表
--连接的用户为scott用户
create table inr_emp as select a.empno,
a.ename,
a.job,
a.mgr,
a.hiredate,
a.sal,
a.deptno,sysdate as etltime from emp a where job is not null;
select * from inr_emp;
EMPNO ENAME JOB MGR HIREDATE SAL DEPTNO ETLTIME
7369 er CLERK 7902 1980/12/17 800.00 20 2019/3/19 14:02:13
7499 ALLEN SALESMAN 7698 1981/2/20 1600.00 30 2019/3/19 14:02:13
7521 WARD SALESMAN 7698 1981/2/22 1250.00 30 2019/3/19 14:02:13
7566 JONES MANAGER 7839 1981/4/2 2975.00 20 2019/3/19 14:02:13
7654 MARTIN SALESMAN 7698 1981/9/28 1250.00 30 2019/3/19 14:02:13
7698 BLAKE MANAGER 7839 1981/5/1 2850.00 30 2019/3/19 14:02:13
7782 CLARK MANAGER 7839 1981/6/9 2450.00 10 2019/3/19 14:02:13
7839 KING PRESIDENT 1981/11/17 5000.00 10 2019/3/19 14:02:13
7844 TURNER SALESMAN 7698 1981/9/8 1500.00 30 2019/3/19 14:02:13
7876 ADAMS CLERK 7788 1987/5/23 1100.00 20 2019/3/19 14:02:13
7900 JAMES CLERK 7698 1981/12/3 950.00 30 2019/3/19 14:02:13
7902 FORD ANALYST 7566 1981/12/3 3000.00 20 2019/3/19 14:02:13
7934 sdf sdf 7782 1982/1/23 1300.00 10 2019/3/19 14:02:13
--hive创建表
[root@hadoop bin]# ./hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> use oracle;
OK
Time taken: 1.234 seconds
hive> create table INR_EMP
> (
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate DATE,
> sal float,
> deptno int,
> etltime DATE
> );
OK
Time taken: 0.63 seconds
2、全量全列导入数据
[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_EMP -m 1 --hive-import --hive-database oracle
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /hadoop/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
19/03/12 18:28:29 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
查询hive表:
hive> select * from inr_emp;
OK
7369 er CLERK 7902 NULL 800.0 20 NULL
7499 ALLEN SALESMAN 7698 NULL 1600.0 30 NULL
7521 WARD SALESMAN 7698 NULL 1250.0 30 NULL
7566 JONES MANAGER 7839 NULL 2975.0 20 NULL
7654 MARTIN SALESMAN 7698 NULL 1250.0 30 NULL
7698 BLAKE MANAGER 7839 NULL 2850.0 30 NULL
7782 CLARK MANAGER 7839 NULL 2450.0 10 NULL
7839 KING PRESIDENT NULL NULL 5000.0 10 NULL
7844 TURNER SALESMAN 7698 NULL 1500.0 30 NULL
7876 ADAMS CLERK 7788 NULL 1100.0 20 NULL
7900 JAMES CLERK 7698 NULL 950.0 30 NULL
7902 FORD ANALYST 7566 NULL 3000.0 20 NULL
7934 sdf sdf 7782 NULL 1300.0 10 NULL
Time taken: 3.103 seconds, Fetched: 13 row(s)
发现导入hive表时间相关的数据都成空值了,这里我们把oracle时间列对应的hive表的时间列改为string类型重新导入:
hive> drop table inr_emp;
OK
Time taken: 2.483 seconds
hive> create table INR_EMP
> (
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> sal float,
> deptno int,
> etltime string
> );
OK
Time taken: 0.109 seconds
再次执行一次上面的导入,看下结果:
hive> select * from inr_emp;
OK
7369 er CLERK 7902 1980-12-17 00:00:00.0 800.0 20 2019-03-19 14:02:13.0
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00.0 1600.0 30 2019-03-19 14:02:13.0
7521 WARD SALESMAN 7698 1981-02-22 00:00:00.0 1250.0 30 2019-03-19 14:02:13.0
7566 JONES MANAGER 7839 1981-04-02 00:00:00.0 2975.0 20 2019-03-19 14:02:13.0
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00.0 1250.0 30 2019-03-19 14:02:13.0
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00.0 2850.0 30 2019-03-19 14:02:13.0
7782 CLARK MANAGER 7839 1981-06-09 00:00:00.0 2450.0 10 2019-03-19 14:02:13.0
7839 KING PRESIDENT NULL 1981-11-17 00:00:00.0 5000.0 10 2019-03-19 14:02:13.0
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00.0 1500.0 30 2019-03-19 14:02:13.0
7876 ADAMS CLERK 7788 1987-05-23 00:00:00.0 1100.0 20 2019-03-19 14:02:13.0
7900 JAMES CLERK 7698 1981-12-03 00:00:00.0 950.0 30 2019-03-19 14:02:13.0
7902 FORD ANALYST 7566 1981-12-03 00:00:00.0 3000.0 20 2019-03-19 14:02:13.0
7934 sdf sdf 7782 1982-01-23 00:00:00.0 1300.0 10 2019-03-19 14:02:13.0
Time taken: 0.369 seconds, Fetched: 13 row(s)
这次正常了。
3、全量选择列导入
先drop了hive表inr_emp表,重建:
hive> drop table inr_emp;
OK
Time taken: 0.205 seconds
hive> create table INR_EMP
> (
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> sal float,
> deptno int,
> etltime string
> );
OK
Time taken: 0.102 seconds
然后另开一个会话挑几列导入
[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_EMP -m 1 --columns 'EMPNO,ENAME,SAL,ETLTIME' --hive-import --hi
ve-database oracleWarning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /hadoop/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
19/03/12 18:44:23 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/03/12 18:44:23 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/03/12 18:44:23 INFO tool.BaseSqoopTool: Using Hive-specific delimiters
查询hive表
hive> select * from inr_emp;
OK
7369 er 800 NULL NULL NULL NULL NULL
7499 ALLEN 1600 NULL NULL NULL NULL NULL
7521 WARD 1250 NULL NULL NULL NULL NULL
7566 JONES 2975 NULL NULL NULL NULL NULL
7654 MARTIN 1250 NULL NULL NULL NULL NULL
7698 BLAKE 2850 NULL NULL NULL NULL NULL
7782 CLARK 2450 NULL NULL NULL NULL NULL
7839 KING 5000 NULL NULL NULL NULL NULL
7844 TURNER 1500 NULL NULL NULL NULL NULL
7876 ADAMS 1100 NULL NULL NULL NULL NULL
7900 JAMES 950 NULL NULL NULL NULL NULL
7902 FORD 3000 NULL NULL NULL NULL NULL
7934 sdf 1300 NULL NULL NULL NULL NULL
Time taken: 0.188 seconds, Fetched: 13 row(s)
发现的确只导入了这几列,其他列为空,如果hive表只创建我们需要的源端几个列来创建一个表,然后指定需要的这几列导入呢?
删除重建hive表:
hive> drop table inr_emp;
OK
Time taken: 0.152 seconds
hive> create table INR_EMP
> (
> empno int,
> ename string,
> sal float
> );
OK
Time taken: 0.086 seconds
重新导入:
[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_EMP -m 1 --columns 'EMPNO,ENAME,SAL,ETLTIME' --hive-import --hi
ve-database oracle
。。。
查询hive表
hive> select * from inr_emp;
OK
7369 er 800.0
7499 ALLEN 1600.0
7521 WARD 1250.0
7566 JONES 2975.0
7654 MARTIN 1250.0
7698 BLAKE 2850.0
7782 CLARK 2450.0
7839 KING 5000.0
7844 TURNER 1500.0
7876 ADAMS 1100.0
7900 JAMES 950.0
7902 FORD 3000.0
7934 sdf 1300.0
Time taken: 0.18 seconds, Fetched: 13 row(s)
导入的数据没问题,这样在做kylin增量时没我可以只选择需要计算的列来创建hive表,然后通过sqoop来增量数据到hive,降低空间使用,加下下一篇文章介绍增量导入,连接已经在文章开始给出。
- 点赞
- 收藏
- 关注作者
评论(0)