Sqoop工具使用(三)--把HDFS中的数据导入到Oracle-Larpenteur-ChinaUnix博客

尘世中一个迷途小书童riverhwp.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

Larpenteur

博客访问： 6430297
博文数量： 2759
博客积分： 1021
博客等级：中士
技术积分： 4091
用户组：普通用户
注册时间： 2012-03-11 14:14

文章分类

全部博文（2759）

Todo（1）
Advice（151）
Linux-未分类（223）
Ubuntu（47）
Database（145）
算法&DS（77）
Android（47）
Web（214）
Geek（237）
CPPC（296）
Java（113）
Python（99）
Matlab（19）
Git（19）
SVN（11）
Gnuplot（5）
面试（0）
机器-挖掘-AI（6）
开源项目（1）
Happy Drawe（9）
Programming（144）

Tools（23）

Shell（66）

Makefile（11）

GDB（26）

vim（18）
System（628）

Author（110）

Common（4）

Memory（66）

File system（82）

Driver（19）

IO（66）

Storage（45）

General（38）

Architecture（19）

Command（64）

Kernel（115）
Virtualization（39）
Cloud（33）
Hadoop（71）
Big Data（24）
未分配的博文（100）

文章存档

2019年（1）

2017年（84）

2016年（196）

2015年（204）

2014年（636）

2013年（1176）

2012年（463）

我的朋友

相关博文

Sqoop工具使用(三)--把HDFS中的数据导入到Oracle

分类： HADOOP

2014-04-20 11:43:09

原文地址：Sqoop工具使用(三)--把HDFS中的数据导入到Oracle 作者：hexel

sqoop export工具把HDFS中的数据导入到rdbms系统中，实现方式有三种：

(1)insert mode:生成insert语句然后执行，这是默认的方式

(2)update mode:生成update语句，替换数据库中的记录

(3)call mode:调用存储过程处理每一条记录：

Common arguments

Argument Description

--connect Specify JDBC connect string

--connection-manager Specify connection manager class to use

--driver Manually specify JDBC driver class to use

--hadoop-mapred-home
Override $HADOOP_MAPRED_HOME

--help Print usage instructions

--password-file Set path for a file containing the authentication password

-P Read password from console

--password Set authentication password

--username Set authentication username

--verbose Print more information while working

--connection-param-file Optional properties file that provides connection parameters

通用参数在import部分已经描述过了，这里着重描述export参数。

Export control arguments:

Argument Description

--direct Use direct export fast path

--export-dir
HDFS source path for the export

-m,--num-mappers Use n map tasks to export in parallel

--table Table to populate

--call Stored Procedure to call

--update-key Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.

--update-mode Specify how updates are performed when new rows are found with non-matching keys in database.

Legal values for mode include updateonly (default) and allowinsert.

--input-null-string The string to be interpreted as null for string columns

--input-null-non-string The string to be interpreted as null for non-string columns

--staging-table The table in which data will be staged before being inserted into the destination table.

--clear-staging-table Indicates that any data present in the staging table can be deleted.

--batch Use batch mode for underlying statement execution.

重要选项：

--input-null-string：如果没有这个选项，那么在字符串类型列中，字符串"null"会被转换成空字符串,所以最好写上这个，指定为'\\N'.

--input-null-non-string：如果没有这个选项，那么在非字符串类型的列中，空串和"null"都会被看作是null.

插入操作实际是开启多个连接，执行多个事务，sqoop采用多行插入方式，每次插入100行，执行100次插入语句后提交事务。

由于默认sqoop使用多个事务并行插入数据，这可能会造成部分数据导入失败，部分数据成功的情况。为了保持一致性，可以加上--staging-table，通过中间表保证数据的一致性。为了实现这个功能，必须要建立一个stage table，这个表结构与target table结构要一致。最好加上--clear-staging-table，导入前清除stage table中的数据。使用了--direct就不能使用--stage table了。和import不一样的是，RDBMS中必须要具备相应的表export操作才会成功。

实例1：使用insert模式把Hadoop中的文件导入到oracle数据库：

注意主键约束，重复记录会导致导入失败：

sqoop export -D oracle.sessionTimeZone=CST --connect jdbc:oracle:thin:@192.168.78.6:1521:hexel \

--username TRX --password trx --table F_917MT1 --fields-terminated-by '\t' \

--export-dir /user/hive/warehouse/gj.db/f_917mt/part-m-00001 -m 1 \

--staging-table F_917MT2 --clear-staging-table --input-null-string '\\N' --input-null-non-string '\\N'

实例2：以update结合insert导入：

待导入的文件内容：

[hexel ~]#hdfs dfs -cat /user/root/sqoop_update/sqoop_update.csv

1 hx

2 lili

3 fengge

建表：

trx@HX> create table sqoop_update(id int,name varchar2(500));

Table created.

首先把文件导入数据库：

sqoop export -D oracle.sessionTimeZone=CST --connect jdbc:oracle:thin:@192.168.78.6:1521:hexel \

--username TRX --password trx --table SQOOP_UPDATE --fields-terminated-by '\t' \

--export-dir /user/root/sqoop_update/ -m 1 \

--input-null-string '\\N' --input-null-non-string '\\N'

trx@HX> select * from sqoop_update;

ID|NAME

----------|--------------------

1|hx

2|lili

3|fengge

trx@HX> alter table sqoop_update add constraint pk_id primary key (id);

Table altered.

下面使用更新导入：

修改导入文件，内容如下：

[hexel ~]#hdfs dfs -cat /user/root/sqoop_update/sqoop_update.csv

1 hx_123

2 lili_123

3 fengge_123

4 huanghe

使用update方式导入，已存在的记录更新，不存在的插入新的（注意大小写）：

sqoop export -D oracle.sessionTimeZone=CST --connect jdbc:oracle:thin:@192.168.78.6:1521:hexel \

--username TRX --password trx --table SQOOP_UPDATE --fields-terminated-by '\t' \

--export-dir /user/root/sqoop_update/ -m 1 --update-key ID --update-mode allowinsert \

--input-null-string '\\N' --input-null-non-string '\\N'

这相当于执行下面操作：

update sqoop_update set name='hx_123' where id=1

update sqoop_update set name='lili_123' where id=2

update sqoop_update set name='fengge_123' where id=3

insert into sqoop_update values(4,'huanghe');

trx@HX> select * from sqoop_update;

ID|NAME

----------|--------------------

1|hx_123

2|lili_123

3|fengge_123

4|huanghe

4 rows selected.

实例三：为每一条记录调用存储过程：

sqoop export -D oracle.sessionTimeZone=CST --connect jdbc:oracle:thin:@192.168.78.6:1521:hexel \

--call PROC_SQOOP_UPDATE \

--username TRX --password trx --fields-terminated-by '\t' \

--export-dir /user/root/sqoop_update/ -m 1 \

--input-null-string '\\N' --input-null-non-string '\\N'

目前不知道怎么写存储过程，处理传过来的数据记录。

阅读(1745) | 评论(0) | 转发(0) |

上一篇：使用Shell脚本像一个文件中写入多行数据

下一篇：centos 6.4系统双网卡绑定配置详解

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6