Sqoop工具使用(三)--把HDFS中的数据导入到Oracle-hexel-ChinaUnix博客

关注RDBMS&nbsp;&amp;&amp;&nbsp;NoSQL

首页　| 　博文目录　| 　关于我

hexel

博客访问： 443613
博文数量： 55
博客积分： 0
博客等级：民兵
技术积分： 1584
用户组：普通用户
注册时间： 2013-05-04 15:15

个人简介

热衷技术，热爱交流

文章分类

全部博文（55）

Hadoop（4）
Oracle数据库（19）

Schemas/objects（3）

启动数据库（1）

性能优化（2）

Backup/Recoverin（6）

sqlplus（1）

Oracle Arch（2）

数据库创建与初始（0）

Security（0）

Oracle Netw（0）

PL/SQL（2）
MongoDB（3）
Linux（15）

shell（2）
未分配的博文（14）

文章存档

2014年（7）

2013年（48）

我的朋友

相关博文

Sqoop工具使用(三)--把HDFS中的数据导入到Oracle

分类： HADOOP

2014-04-17 11:39:01

sqoop export工具把HDFS中的数据导入到rdbms系统中，实现方式有三种：

(1)insert mode:生成insert语句然后执行，这是默认的方式

(2)update mode:生成update语句，替换数据库中的记录

(3)call mode:调用存储过程处理每一条记录：

Common arguments

Argument Description

--connect Specify JDBC connect string

--connection-manager Specify connection manager class to use

--driver Manually specify JDBC driver class to use

--hadoop-mapred-home
Override $HADOOP_MAPRED_HOME

--help Print usage instructions

--password-file Set path for a file containing the authentication password

-P Read password from console

--password Set authentication password

--username Set authentication username

--verbose Print more information while working

--connection-param-file Optional properties file that provides connection parameters

通用参数在import部分已经描述过了，这里着重描述export参数。

Export control arguments:

Argument Description

--direct Use direct export fast path

--export-dir
HDFS source path for the export

-m,--num-mappers Use n map tasks to export in parallel

--table Table to populate

--call Stored Procedure to call

--update-key Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.

--update-mode Specify how updates are performed when new rows are found with non-matching keys in database.

Legal values for mode include updateonly (default) and allowinsert.

--input-null-string The string to be interpreted as null for string columns

--input-null-non-string The string to be interpreted as null for non-string columns

--staging-table The table in which data will be staged before being inserted into the destination table.

--clear-staging-table Indicates that any data present in the staging table can be deleted.

--batch Use batch mode for underlying statement execution.

重要选项：

--input-null-string：如果没有这个选项，那么在字符串类型列中，字符串"null"会被转换成空字符串,所以最好写上这个，指定为'\\N'.

--input-null-non-string：如果没有这个选项，那么在非字符串类型的列中，空串和"null"都会被看作是null.

插入操作实际是开启多个连接，执行多个事务，sqoop采用多行插入方式，每次插入100行，执行100次插入语句后提交事务。

由于默认sqoop使用多个事务并行插入数据，这可能会造成部分数据导入失败，部分数据成功的情况。为了保持一致性，可以加上--staging-table，通过中间表保证数据的一致性。为了实现这个功能，必须要建立一个stage table，这个表结构与target table结构要一致。最好加上--clear-staging-table，导入前清除stage table中的数据。使用了--direct就不能使用--stage table了。和import不一样的是，RDBMS中必须要具备相应的表export操作才会成功。

实例1：使用insert模式把Hadoop中的文件导入到oracle数据库：

注意主键约束，重复记录会导致导入失败：

sqoop export -D oracle.sessionTimeZone=CST --connect jdbc:oracle:thin:@192.168.78.6:1521:hexel \

--username TRX --password trx --table F_917MT1 --fields-terminated-by '\t' \

--export-dir /user/hive/warehouse/gj.db/f_917mt/part-m-00001 -m 1 \

--staging-table F_917MT2 --clear-staging-table --input-null-string '\\N' --input-null-non-string '\\N'

实例2：以update结合insert导入：

待导入的文件内容：

[hexel ~]#hdfs dfs -cat /user/root/sqoop_update/sqoop_update.csv

1 hx

2 lili

3 fengge

建表：

trx@HX> create table sqoop_update(id int,name varchar2(500));

Table created.

首先把文件导入数据库：

sqoop export -D oracle.sessionTimeZone=CST --connect jdbc:oracle:thin:@192.168.78.6:1521:hexel \

--username TRX --password trx --table SQOOP_UPDATE --fields-terminated-by '\t' \

--export-dir /user/root/sqoop_update/ -m 1 \

--input-null-string '\\N' --input-null-non-string '\\N'

trx@HX> select * from sqoop_update;

ID|NAME

----------|--------------------

1|hx

2|lili

3|fengge

trx@HX> alter table sqoop_update add constraint pk_id primary key (id);

Table altered.

下面使用更新导入：

修改导入文件，内容如下：

[hexel ~]#hdfs dfs -cat /user/root/sqoop_update/sqoop_update.csv

1 hx_123

2 lili_123

3 fengge_123

4 huanghe

使用update方式导入，已存在的记录更新，不存在的插入新的（注意大小写）：

sqoop export -D oracle.sessionTimeZone=CST --connect jdbc:oracle:thin:@192.168.78.6:1521:hexel \

--username TRX --password trx --table SQOOP_UPDATE --fields-terminated-by '\t' \

--export-dir /user/root/sqoop_update/ -m 1 --update-key ID --update-mode allowinsert \

--input-null-string '\\N' --input-null-non-string '\\N'

这相当于执行下面操作：

update sqoop_update set name='hx_123' where id=1

update sqoop_update set name='lili_123' where id=2

update sqoop_update set name='fengge_123' where id=3

insert into sqoop_update values(4,'huanghe');

trx@HX> select * from sqoop_update;

ID|NAME

----------|--------------------

1|hx_123

2|lili_123

3|fengge_123

4|huanghe

4 rows selected.

实例三：为每一条记录调用存储过程：

sqoop export -D oracle.sessionTimeZone=CST --connect jdbc:oracle:thin:@192.168.78.6:1521:hexel \

--call PROC_SQOOP_UPDATE \

--username TRX --password trx --fields-terminated-by '\t' \

--export-dir /user/root/sqoop_update/ -m 1 \

--input-null-string '\\N' --input-null-non-string '\\N'

目前不知道怎么写存储过程，处理传过来的数据记录。

阅读(9262) | 评论(0) | 转发(1) |

上一篇：Sqoop工具使用(二)--从oracle导入数据到hive

下一篇：没有了

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6