Chinaunix首页 | 论坛 | 博客
  • 博客访问: 6664410
  • 博文数量: 1005
  • 博客积分: 8199
  • 博客等级: 中将
  • 技术积分: 13071
  • 用 户 组: 普通用户
  • 注册时间: 2010-05-25 20:19
个人简介

脚踏实地、勇往直前!

文章分类

全部博文(1005)

文章存档

2020年(2)

2019年(93)

2018年(208)

2017年(81)

2016年(49)

2015年(50)

2014年(170)

2013年(52)

2012年(177)

2011年(93)

2010年(30)

分类: HADOOP

2014-10-30 15:42:28

       我们在创建表的时候可以指定external关键字创建外部表,外部表对应的文件存储在location指定的目录下,向该目录添加新文件的同时,该表也会读取到该文件(当然文件格式必须跟表定义的一致),删除外部表的同时并不会删除location指定目录下的文件.

1.查看hdfs系统目录/user/hadoop1/myfile下文件
[hadoop1@node1]$ hadoop fs -ls /user/hadoop1/myfile/
Found 1 items
-rw-r--r--   3 hadoop1 supergroup     567839 2014-10-29 16:50 /user/hadoop1/myfile/tb_class.txt


2.创建外部表指向myfile目录下的文件
hive (hxl)> create external table tb_class_info_external
          > (id int,
          > class_name string,
          > createtime timestamp ,
          > modifytime timestamp)
          > ROW FORMAT DELIMITED
          > FIELDS TERMINATED BY '|'
          > location '/user/hadoop1/myfile';
OK
Time taken: 0.083 seconds

注意这里的location指向的是hdfs系统上的路径,而不是本地机器上的路径,这里表tb_class_info_external会读取myfile目录下的所有文件

3.查看外部表
hive (hxl)> select count(1) from tb_class_info_external;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201410300915_0009, Tracking URL =
Kill Command = /usr1/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker= -kill job_201410300915_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-10-30 15:25:10,652 Stage-1 map = 0%,  reduce = 0%
2014-10-30 15:25:12,664 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:13,671 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:14,682 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:15,690 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:16,697 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:17,704 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:18,710 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:19,718 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:20,725 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.21 sec
2014-10-30 15:25:21,730 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.21 sec
2014-10-30 15:25:22,737 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.21 sec
MapReduce Total cumulative CPU time: 1 seconds 210 msec
Ended Job = job_201410300915_0009
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.21 sec   HDFS Read: 568052 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 210 msec
OK
10001
Time taken: 14.742 seconds

可以看到这里表记录数是10001,下面我们在myfile目录下添加另外一个文件tb_class_bak.txt

4.在myfile目录下添加文本
$hadoop fs -cp /user/hadoop1/myfile/tb_class.txt /user/hadoop1/myfile/tb_class_bak.txt


5.再次查询表记录数
hive (hxl)> select count(1) from tb_class_info_external;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201410300915_0010, Tracking URL =
Kill Command = /usr1/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker= -kill job_201410300915_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-10-30 15:32:02,275 Stage-1 map = 0%,  reduce = 0%
2014-10-30 15:32:04,286 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:05,292 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:06,300 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:07,306 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:08,313 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:09,319 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:10,327 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:11,331 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:12,338 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.16 sec
2014-10-30 15:32:13,343 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.16 sec
2014-10-30 15:32:14,350 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.16 sec
MapReduce Total cumulative CPU time: 1 seconds 160 msec
Ended Job = job_201410300915_0010
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.16 sec   HDFS Read: 1135971 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 160 msec
OK
20002
Time taken: 14.665 seconds

可以看到记录数加倍了,那就说明表已经读取了新增加的文件.

6.删除表
hive (hxl)> drop table tb_class_info_external;
OK
Time taken: 1.7 seconds

表对应的文件并没有删除
[hadoop1@node1]$ hadoop fs -ls /user/hadoop1/myfile/
Found 2 items
-rw-r--r--   3 hadoop1 supergroup     567839 2014-10-29 16:50 /user/hadoop1/myfile/tb_class.txt
-rw-r--r--   3 hadoop1 supergroup     567839 2014-10-30 15:28 /user/hadoop1/myfile/tb_class_bak.txt


-----------------------------------------------外部分区表-------------------------------------

1.创建外部表目录
[flowrate@richinfo109 ~]$ hadoop fs -mkdir /tmp/bill/20161206
[flowrate@richinfo109 ~]$ hadoop fs -mkdir /tmp/bill/20161206/18

2.拷贝文件到外部表目录
hadoop fs -cp /hive/warehouse/richmail.db/t_part_usernumber_t1/statedate=20161206/provcode=18/b.txt /tmp/bill/20161206/18/b.txt

3.创建分区表
create external table t_part_ext_usernumber_t1
(
 usernumber string
)
partitioned by(statedate string,provcode string)
row format delimited
fields terminated by '|';

4.新添加分区,指定外部目录
alter table t_part_ext_usernumber_t1 add partition(statedate='20161206',provcode='18') location '/tmp/bill/20161206/18';

5.查看数据
select * from t_part_ext_usernumber_t1;


-- The End --

阅读(22331) | 评论(1) | 转发(0) |
给主人留下些什么吧!~~

CoderLeonard2016-08-19 10:11:17

感谢楼主的分享,看完之后受益匪浅!
请问楼主,在什么情况下应该使用外部表,什么情况下使用内部表呢?两者在性能上有什么区别么?我知道部分HQL不支持外部表,但是内部表会把HDFS上的数据mv到自己的hive目录下,会导致其他应用找不到所需的文件,可以改mv为cp操作么?