全部博文(39)
分类:
2012-02-24 18:35:39
基于hadoop的大规模数据排序算法
——Hadoop TeraSort 基准测试实验
更新:
由于起初没有能理解的比较清楚,直接参照参考资料1运行的示例,因此下面的测试过程中前面的部分描述出现问题。
bin/hadoop jar hadoop-examples-0.20.203.0.jar teragen 行数 terasort/目录
产生数据只要一行命令即可,下面例子中前两次仅为介绍这个命令的示例。下面的这一行才是真正用来排序的100M数据:apple@ubuntu:~/hadoop-0.20.203.0$ bin/hadoop jar hadoop-examples-0.20.203.0.jar teragen 1000000 terasort/100M-input
另外,这里面排序仅仅默认使用了1个reducer,所以完全没有发挥出Terasort的优势。需要修改。测试过程仅证明了Terasort可以跑起来。
请注意鉴别文中错误及不足,谢谢!。
组长:万虎
成员:牛庆亚、宋思梦、文滔、胡海绅
关于Hadoop Terasort的分析会在另外一篇文章中单独分析,或等韩旭红组分析。我们为了能够更好的理解Hadoop Example里面的排序程序,在Hadoop环境下对Terasort进行了测试实验。
由于是在虚拟机环境中,生成的测试数据大小选择为100M,我们开始时选择对1G的数据进行测试,实验了两次,但是每次在排序的时候机器都会死掉。第一次排序在我们吃饭回来后还没有完成,机器卡死了。最终选择对100M数据进行排序,运行成功。
参考资料:
Hadoop TeraSort 基准测试实验
http://blog.csdn.net/zklth/article/details/6295517
测试眼里的Hadoop系列之Terasort http://blog.csdn.net/leafy1980/article/details/6633828
相关资料[没有具体看]:
Hadoop MapReduce扩展性的测试:
用MPI实现Hadoop: Map/Reduce的TeraSort http://emonkey.blog.sohu.com/166546157.html
Hadoop中TeraSort算法分析:
hadoop的1TB排序terasort:http://hi.baidu.com/dtzw/blog/item/cffc8e1830f908b94bedbc12.html
Sort Benchmark:
Trir树:http://www.cnblogs.com/cherish_yimi/archive/2009/10/12/1581666.html
运行环境:
VMware虚拟机
ubuntu10.10
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) Client VM (build 21.0-b17, mixed mode)
hadoop-0.20.203.0
Hadoop 安装目录为 /home/apple/hadoop-0.20.203.0/
下面是整个运行Terasort过程中的输入命令及输出。
(注:橙色为终端的提示符及输入命令,蓝色为解释性文字,默认颜色为Hadoop输出。)
整个过程运行了如下命令:
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
cd hadoop-0.20.203.0/
bin/stop-all.sh
bin/hadoop namenode -format
bin/start-all.sh
bin/hadoop jar hadoop-examples-0.20.203.0.jar teragen 100000 terasort/100000-input
bin/hadoop fs -ls /user/apple/terasort/100000-input
bin/hadoop jar hadoop-examples-0.20.203.0.jar teragen 10 terasort/100000-input2
bin/hadoop jar hadoop-examples-0.20.203.0.jar teragen 1000000 terasort/100M-input
bin/hadoop jar hadoop-examples-0.20.203.0.jar terasort terasort/100M-input terasort/100M-output
bin/hadoop fs -ls terasort/100M-output
bin/hadoop jar hadoop-examples-0.20.203.0.jar teravalidate terasort/100M-output terasort/100M-validate
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
运行过程及部分注释:
由于运行了Hadoop,为了防止出现乱七八糟的问题,我们先停止Hadoop,并对Hadoop的namenode进行重新格式化,并运行。
apple@ubuntu:~/hadoop-0.20.203.0$ bin/stop-all.sh
no jobtracker to stop
localhost: no tasktracker to stop
no namenode to stop
localhost: no datanode to stop
localhost: no secondarynamenode to stop
apple@ubuntu:~/hadoop-0.20.203.0$ bin/hadoop namenode -format
11/11/06 04:15:10 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.203.0
STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203 -r 1099333; compiled by 'oom' on Wed May 4 07:57:50 PDT 2011
************************************************************/
11/11/06 04:15:10 INFO util.GSet: VM type = 32-bit
11/11/06 04:15:10 INFO util.GSet: 2% max memory = 19.33375 MB
11/11/06 04:15:10 INFO util.GSet: capacity = 2^22 = 4194304 entries
11/11/06 04:15:10 INFO util.GSet: recommended=4194304, actual=4194304
11/11/06 04:15:11 INFO namenode.FSNamesystem: fsOwner=apple
11/11/06 04:15:11 INFO namenode.FSNamesystem: supergroup=supergroup
11/11/06 04:15:11 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/11/06 04:15:11 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
11/11/06 04:15:11 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
11/11/06 04:15:11 INFO namenode.NameNode: Caching file names occuring more than 10 times
11/11/06 04:15:11 INFO common.Storage: Image file of size 111 saved in 0 seconds.
11/11/06 04:15:11 INFO common.Storage: Storage directory /tmp/hadoop-apple/dfs/name has been successfully formatted.
11/11/06 04:15:11 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
apple@ubuntu:~/hadoop-0.20.203.0$ bin/start-all.sh
starting namenode, logging to /home/apple/hadoop-0.20.203.0/bin/../logs/hadoop-apple-namenode-ubuntu.out
localhost: starting datanode, logging to /home/apple/hadoop-0.20.203.0/bin/../logs/hadoop-apple-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /home/apple/hadoop-0.20.203.0/bin/../logs/hadoop-apple-secondarynamenode-ubuntu.out
starting jobtracker, logging to /home/apple/hadoop-0.20.203.0/bin/../logs/hadoop-apple-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /home/apple/hadoop-0.20.203.0/bin/../logs/hadoop-apple-tasktracker-ubuntu.out
利用TeraGen生成排序输入数据:
(1)teragen后的数值单位是行数;因为每行100个字节,所以如果要产生1T的数据量,则这个数值应为1T/100=10000000000(10个0)。我们生成100M的数据,则为100000。
(2)后面的terasort目录为分布式文件系统中目录,我们的环境中为/user/apple/terasort,此目录会由Hadoop自动创建。(感谢沈岩提醒)
100000-input目录,名字可以任意选择,为便于和后面的目录区别,我们的数据目录分别命名如下:100000-input,100000-input2,100M-input,100M-output
apple@ubuntu:~/hadoop-0.20.203.0$ bin/hadoop jar hadoop-examples-0.20.203.0.jar teragen 100000 terasort/100000-input
Generating 100000 using 2 maps with step of 50000
11/11/06 04:33:37 INFO mapred.JobClient: Running job: job_201111060257_0017
11/11/06 04:33:38 INFO mapred.JobClient: map 0% reduce 0%
11/11/06 04:34:32 INFO mapred.JobClient: map 50% reduce 0%
11/11/06 04:34:38 INFO mapred.JobClient: map 100% reduce 0%
11/11/06 04:34:47 INFO mapred.JobClient: Job complete: job_201111060257_0017
11/11/06 04:34:47 INFO mapred.JobClient: Counters: 15
11/11/06 04:34:47 INFO mapred.JobClient: Job Counters
11/11/06 04:34:47 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=72824
11/11/06 04:34:47 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/11/06 04:34:47 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/11/06 04:34:47 INFO mapred.JobClient: Launched map tasks=2
11/11/06 04:34:47 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
11/11/06 04:34:47 INFO mapred.JobClient: File Input Format Counters
11/11/06 04:34:47 INFO mapred.JobClient: Bytes Read=0
11/11/06 04:34:47 INFO mapred.JobClient: File Output Format Counters
11/11/06 04:34:47 INFO mapred.JobClient: Bytes Written=10000000
11/11/06 04:34:47 INFO mapred.JobClient: FileSystemCounters
11/11/06 04:34:47 INFO mapred.JobClient: HDFS_BYTES_READ=164
11/11/06 04:34:47 INFO mapred.JobClient: FILE_BYTES_WRITTEN=41782
11/11/06 04:34:47 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10000000
11/11/06 04:34:47 INFO mapred.JobClient: Map-Reduce Framework
11/11/06 04:34:47 INFO mapred.JobClient: Map input records=100000
11/11/06 04:34:47 INFO mapred.JobClient: Spilled Records=0
11/11/06 04:34:47 INFO mapred.JobClient: Map input bytes=100000
11/11/06 04:34:47 INFO mapred.JobClient: Map output records=100000
11/11/06 04:34:47 INFO mapred.JobClient: SPLIT_RAW_BYTES=164
下面的命令是查看生成的目录,来证明确实生成了相应的数据,使用分布式文件系统的命令,如下,路径如上面注释所提/user/apple/terasort/。
结果为生成两个数据,每个的大小是 5000000 B = 5 M
apple@ubuntu:~/hadoop-0.20.203.0$ bin/hadoop fs -ls /user/apple/terasort/100000-input
Found 4 items
-rw-r--r-- 1 apple supergroup 0 2011-11-06 04:34 /user/apple/terasort/100000-input/_SUCCESS
drwxr-xr-x - apple supergroup 0 2011-11-06 04:33 /user/apple/terasort/100000-input/_logs
-rw-r--r-- 1 apple supergroup 5000000 2011-11-06 04:34 /user/apple/terasort/100000-input/part-00000
-rw-r--r-- 1 apple supergroup 5000000 2011-11-06 04:34 /user/apple/terasort/100000-input/part-00001
将生成两个 500 B 的数据,加起来是 1000 B = 1 kb
产生的数据一行是100B,参数10表示产生10行,共1000B;100,000 行就有 100,000,000 B = 10 M;
teragen是用两个 map 来完成数据的生成,每个 map 生成一个文件,两个文件大小共 10M,每个就是 5 M .
apple@ubuntu:~/hadoop-0.20.203.0$ bin/hadoop jar hadoop-examples-0.20.203.0.jar teragen 10 terasort/100000-input2
Generating 10 using 2 maps with step of 5
11/11/06 04:37:59 INFO mapred.JobClient: Running job: job_201111060257_0018
11/11/06 04:38:00 INFO mapred.JobClient: map 0% reduce 0%
11/11/06 04:38:25 INFO mapred.JobClient: map 50% reduce 0%
11/11/06 04:38:32 INFO mapred.JobClient: map 100% reduce 0%
11/11/06 04:38:37 INFO mapred.JobClient: Job complete: job_201111060257_0018
11/11/06 04:38:37 INFO mapred.JobClient: Counters: 15
11/11/06 04:38:37 INFO mapred.JobClient: Job Counters
11/11/06 04:38:37 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=36091
11/11/06 04:38:37 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/11/06 04:38:37 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/11/06 04:38:37 INFO mapred.JobClient: Launched map tasks=2
11/11/06 04:38:37 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
11/11/06 04:38:37 INFO mapred.JobClient: File Input Format Counters
11/11/06 04:38:37 INFO mapred.JobClient: Bytes Read=0
11/11/06 04:38:37 INFO mapred.JobClient: File Output Format Counters
11/11/06 04:38:37 INFO mapred.JobClient: Bytes Written=1000
11/11/06 04:38:37 INFO mapred.JobClient: FileSystemCounters
11/11/06 04:38:37 INFO mapred.JobClient: HDFS_BYTES_READ=158
11/11/06 04:38:37 INFO mapred.JobClient: FILE_BYTES_WRITTEN=41776
11/11/06 04:38:37 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1000
11/11/06 04:38:37 INFO mapred.JobClient: Map-Reduce Framework
11/11/06 04:38:37 INFO mapred.JobClient: Map input records=10
11/11/06 04:38:37 INFO mapred.JobClient: Spilled Records=0
11/11/06 04:38:37 INFO mapred.JobClient: Map input bytes=10
11/11/06 04:38:37 INFO mapred.JobClient: Map output records=10
11/11/06 04:38:37 INFO mapred.JobClient: SPLIT_RAW_BYTES=158
如果产生 1G 的数据,由于数据块是 64 M 一块,这会被分成16个数据块,当运行terasort时会有64个map task。但是我们产生的是100M的数据,从下面的输出中可以看到一些信息,Launched map tasks=2
apple@ubuntu:~/hadoop-0.20.203.0$ bin/hadoop jar hadoop-examples-0.20.203.0.jar teragen 1000000 terasort/100M-input
Generating 1000000 using 2 maps with step of 500000
11/11/06 04:41:11 INFO mapred.JobClient: Running job: job_201111060257_0019
11/11/06 04:41:12 INFO mapred.JobClient: map 0% reduce 0%
11/11/06 04:41:43 INFO mapred.JobClient: map 10% reduce 0%
11/11/06 04:41:58 INFO mapred.JobClient: map 11% reduce 0%
11/11/06 04:42:04 INFO mapred.JobClient: map 50% reduce 0%
11/11/06 04:42:27 INFO mapred.JobClient: map 96% reduce 0%
11/11/06 04:42:33 INFO mapred.JobClient: map 100% reduce 0%
11/11/06 04:42:45 INFO mapred.JobClient: Job complete: job_201111060257_0019
11/11/06 04:42:45 INFO mapred.JobClient: Counters: 15
11/11/06 04:42:45 INFO mapred.JobClient: Job Counters
11/11/06 04:42:45 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=121603
11/11/06 04:42:45 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/11/06 04:42:45 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/11/06 04:42:45 INFO mapred.JobClient: Launched map tasks=2
11/11/06 04:42:45 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
11/11/06 04:42:45 INFO mapred.JobClient: File Input Format Counters
11/11/06 04:42:45 INFO mapred.JobClient: Bytes Read=0
11/11/06 04:42:45 INFO mapred.JobClient: File Output Format Counters
11/11/06 04:42:45 INFO mapred.JobClient: Bytes Written=100000000
11/11/06 04:42:45 INFO mapred.JobClient: FileSystemCounters
11/11/06 04:42:45 INFO mapred.JobClient: HDFS_BYTES_READ=167
11/11/06 04:42:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=41780
11/11/06 04:42:45 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=100000000
11/11/06 04:42:45 INFO mapred.JobClient: Map-Reduce Framework
11/11/06 04:42:45 INFO mapred.JobClient: Map input records=1000000
11/11/06 04:42:45 INFO mapred.JobClient: Spilled Records=0
11/11/06 04:42:45 INFO mapred.JobClient: Map input bytes=1000000
11/11/06 04:42:45 INFO mapred.JobClient: Map output records=1000000
11/11/06 04:42:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=167
执行 terasort 程序,将会执行 2 个 MapTask,特别容易死在这儿。从下面输出的时间可以看出,仅仅排序100M 数据就从用来15分钟。
apple@ubuntu:~/hadoop-0.20.203.0$ bin/hadoop jar hadoop-examples-0.20.203.0.jar terasort terasort/100M-input terasort/100M-output
11/11/06 04:44:24 INFO terasort.TeraSort: starting
11/11/06 04:44:26 INFO mapred.FileInputFormat: Total input paths to process : 2
11/11/06 04:44:30 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/11/06 04:44:30 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
11/11/06 04:44:30 INFO compress.CodecPool: Got brand-new compressor
Making 1 from 100000 records
Step size is 100000.0
11/11/06 04:44:32 INFO mapred.FileInputFormat: Total input paths to process : 2
11/11/06 04:44:34 INFO mapred.JobClient: Running job: job_201111060257_0020
11/11/06 04:44:35 INFO mapred.JobClient: map 0% reduce 0%
11/11/06 04:49:00 INFO mapred.JobClient: map 1% reduce 0%
11/11/06 04:49:16 INFO mapred.JobClient: map 2% reduce 0%
11/11/06 04:49:19 INFO mapred.JobClient: map 4% reduce 0%
11/11/06 04:49:26 INFO mapred.JobClient: map 7% reduce 0%
11/11/06 04:49:27 INFO mapred.JobClient: map 8% reduce 0%
11/11/06 04:49:32 INFO mapred.JobClient: map 10% reduce 0%
11/11/06 04:49:39 INFO mapred.JobClient: map 11% reduce 0%
11/11/06 04:49:40 INFO mapred.JobClient: map 14% reduce 0%
11/11/06 04:49:49 INFO mapred.JobClient: map 16% reduce 0%
11/11/06 04:49:53 INFO mapred.JobClient: map 20% reduce 0%
11/11/06 04:49:55 INFO mapred.JobClient: map 23% reduce 0%
11/11/06 04:50:00 INFO mapred.JobClient: map 24% reduce 0%
11/11/06 04:50:01 INFO mapred.JobClient: map 26% reduce 0%
11/11/06 04:50:05 INFO mapred.JobClient: map 29% reduce 0%
11/11/06 04:50:08 INFO mapred.JobClient: map 30% reduce 0%
11/11/06 04:50:11 INFO mapred.JobClient: map 33% reduce 0%
11/11/06 04:50:16 INFO mapred.JobClient: map 35% reduce 0%
11/11/06 04:50:19 INFO mapred.JobClient: map 36% reduce 0%
11/11/06 04:50:22 INFO mapred.JobClient: map 38% reduce 0%
11/11/06 04:50:28 INFO mapred.JobClient: map 39% reduce 0%
11/11/06 04:51:31 INFO mapred.JobClient: map 41% reduce 0%
11/11/06 04:52:19 INFO mapred.JobClient: map 44% reduce 0%
11/11/06 04:52:27 INFO mapred.JobClient: map 51% reduce 0%
11/11/06 04:52:31 INFO mapred.JobClient: map 52% reduce 0%
11/11/06 04:52:34 INFO mapred.JobClient: map 55% reduce 0%
11/11/06 04:52:43 INFO mapred.JobClient: map 56% reduce 0%
11/11/06 04:53:01 INFO mapred.JobClient: map 57% reduce 0%
11/11/06 04:53:06 INFO mapred.JobClient: map 59% reduce 0%
11/11/06 04:53:10 INFO mapred.JobClient: map 60% reduce 0%
11/11/06 04:53:18 INFO mapred.JobClient: map 67% reduce 0%
11/11/06 04:54:59 INFO mapred.JobClient: map 69% reduce 0%
11/11/06 04:55:05 INFO mapred.JobClient: map 71% reduce 0%
11/11/06 04:55:30 INFO mapred.JobClient: map 86% reduce 0%
11/11/06 04:55:38 INFO mapred.JobClient: map 91% reduce 0%
11/11/06 04:55:48 INFO mapred.JobClient: map 92% reduce 0%
11/11/06 04:55:55 INFO mapred.JobClient: map 95% reduce 0%
11/11/06 04:56:00 INFO mapred.JobClient: map 96% reduce 0%
11/11/06 04:56:10 INFO mapred.JobClient: map 97% reduce 0%
11/11/06 04:56:19 INFO mapred.JobClient: map 99% reduce 0%
11/11/06 04:57:57 INFO mapred.JobClient: map 100% reduce 0%
11/11/06 04:58:36 INFO mapred.JobClient: map 100% reduce 16%
11/11/06 04:58:41 INFO mapred.JobClient: map 100% reduce 33%
11/11/06 04:58:47 INFO mapred.JobClient: map 100% reduce 66%
11/11/06 04:58:50 INFO mapred.JobClient: map 100% reduce 68%
11/11/06 04:58:54 INFO mapred.JobClient: map 100% reduce 78%
11/11/06 04:59:04 INFO mapred.JobClient: map 100% reduce 80%
11/11/06 04:59:10 INFO mapred.JobClient: map 100% reduce 82%
11/11/06 04:59:16 INFO mapred.JobClient: map 100% reduce 89%
11/11/06 04:59:19 INFO mapred.JobClient: map 100% reduce 96%
11/11/06 04:59:22 INFO mapred.JobClient: map 100% reduce 99%
11/11/06 04:59:28 INFO mapred.JobClient: map 100% reduce 100%
11/11/06 04:59:40 INFO mapred.JobClient: Job complete: job_201111060257_0020
11/11/06 04:59:42 INFO mapred.JobClient: Counters: 26
11/11/06 04:59:42 INFO mapred.JobClient: Job Counters
11/11/06 04:59:42 INFO mapred.JobClient: Launched reduce tasks=1
11/11/06 04:59:42 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=1485517
11/11/06 04:59:42 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/11/06 04:59:42 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/11/06 04:59:42 INFO mapred.JobClient: Launched map tasks=2
11/11/06 04:59:42 INFO mapred.JobClient: Data-local map tasks=2
11/11/06 04:59:42 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=143382
11/11/06 04:59:42 INFO mapred.JobClient: File Input Format Counters
11/11/06 04:59:42 INFO mapred.JobClient: Bytes Read=100000000
11/11/06 04:59:42 INFO mapred.JobClient: File Output Format Counters
11/11/06 04:59:42 INFO mapred.JobClient: Bytes Written=100000000
11/11/06 04:59:42 INFO mapred.JobClient: FileSystemCounters
11/11/06 04:59:42 INFO mapred.JobClient: FILE_BYTES_READ=204000294
11/11/06 04:59:42 INFO mapred.JobClient: HDFS_BYTES_READ=100000232
11/11/06 04:59:42 INFO mapred.JobClient: FILE_BYTES_WRITTEN=306065543
11/11/06 04:59:42 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=100000000
11/11/06 04:59:42 INFO mapred.JobClient: Map-Reduce Framework
11/11/06 04:59:42 INFO mapred.JobClient: Map output materialized bytes=102000012
11/11/06 04:59:42 INFO mapred.JobClient: Map input records=1000000
11/11/06 04:59:42 INFO mapred.JobClient: Reduce shuffle bytes=102000012
11/11/06 04:59:42 INFO mapred.JobClient: Spilled Records=3000000
11/11/06 04:59:42 INFO mapred.JobClient: Map output bytes=100000000
11/11/06 04:59:42 INFO mapred.JobClient: Map input bytes=100000000
11/11/06 04:59:42 INFO mapred.JobClient: Combine input records=0
11/11/06 04:59:42 INFO mapred.JobClient: SPLIT_RAW_BYTES=232
11/11/06 04:59:42 INFO mapred.JobClient: Reduce input records=1000000
11/11/06 04:59:42 INFO mapred.JobClient: Reduce input groups=1000000
11/11/06 04:59:42 INFO mapred.JobClient: Combine output records=0
11/11/06 04:59:42 INFO mapred.JobClient: Reduce output records=1000000
11/11/06 04:59:42 INFO mapred.JobClient: Map output records=1000000
11/11/06 04:59:42 INFO terasort.TeraSort: done
执行完成,排序,生成的数据仍是 100M
apple@ubuntu:~/hadoop-0.20.203.0$ bin/hadoop fs -ls terasort/100M-output
Found 3 items
-rw-r--r-- 1 apple supergroup 0 2011-11-06 04:59 /user/apple/terasort/100M-output/_SUCCESS
drwxr-xr-x - apple supergroup 0 2011-11-06 04:44 /user/apple/terasort/100M-output/_logs
-rw-r--r-- 1 apple supergroup 100000000 2011-11-06 04:58 /user/apple/terasort/100M-output/part-00000
TeraSort校验程序
apple@ubuntu:~/hadoop-0.20.203.0$ bin/hadoop jar hadoop-examples-0.20.203.0.jar teravalidate terasort/100M-output terasort/100M-validate
11/11/06 06:53:22 INFO mapred.FileInputFormat: Total input paths to process : 1
11/11/06 06:53:26 INFO mapred.JobClient: Running job: job_201111060257_0021
11/11/06 06:53:27 INFO mapred.JobClient: map 0% reduce 0%
11/11/06 06:54:20 INFO mapred.JobClient: map 17% reduce 0%
11/11/06 06:54:24 INFO mapred.JobClient: map 44% reduce 0%
11/11/06 06:54:27 INFO mapred.JobClient: map 55% reduce 0%
11/11/06 06:54:31 INFO mapred.JobClient: map 67% reduce 0%
11/11/06 06:54:37 INFO mapred.JobClient: map 76% reduce 0%
11/11/06 06:54:40 INFO mapred.JobClient: map 90% reduce 0%
11/11/06 06:54:43 INFO mapred.JobClient: map 100% reduce 0%
11/11/06 06:55:07 INFO mapred.JobClient: map 100% reduce 100%
11/11/06 06:55:12 INFO mapred.JobClient: Job complete: job_201111060257_0021
11/11/06 06:55:13 INFO mapred.JobClient: Counters: 25
11/11/06 06:55:13 INFO mapred.JobClient: Job Counters
11/11/06 06:55:13 INFO mapred.JobClient: Launched reduce tasks=1
11/11/06 06:55:13 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=68933
11/11/06 06:55:13 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/11/06 06:55:13 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/11/06 06:55:13 INFO mapred.JobClient: Launched map tasks=1
11/11/06 06:55:13 INFO mapred.JobClient: Data-local map tasks=1
11/11/06 06:55:13 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=21792
11/11/06 06:55:13 INFO mapred.JobClient: File Input Format Counters
11/11/06 06:55:13 INFO mapred.JobClient: Bytes Read=100000000
11/11/06 06:55:13 INFO mapred.JobClient: File Output Format Counters
11/11/06 06:55:13 INFO mapred.JobClient: Bytes Written=0
11/11/06 06:55:13 INFO mapred.JobClient: FileSystemCounters
11/11/06 06:55:13 INFO mapred.JobClient: FILE_BYTES_READ=64
11/11/06 06:55:13 INFO mapred.JobClient: HDFS_BYTES_READ=100000117
11/11/06 06:55:13 INFO mapred.JobClient: FILE_BYTES_WRITTEN=42241
11/11/06 06:55:13 INFO mapred.JobClient: Map-Reduce Framework
11/11/06 06:55:13 INFO mapred.JobClient: Map output materialized bytes=64
11/11/06 06:55:13 INFO mapred.JobClient: Map input records=1000000
11/11/06 06:55:13 INFO mapred.JobClient: Reduce shuffle bytes=64
11/11/06 06:55:13 INFO mapred.JobClient: Spilled Records=4
11/11/06 06:55:13 INFO mapred.JobClient: Map output bytes=54
11/11/06 06:55:13 INFO mapred.JobClient: Map input bytes=100000000
11/11/06 06:55:13 INFO mapred.JobClient: Combine input records=0
11/11/06 06:55:13 INFO mapred.JobClient: SPLIT_RAW_BYTES=117
11/11/06 06:55:13 INFO mapred.JobClient: Reduce input records=2
11/11/06 06:55:13 INFO mapred.JobClient: Reduce input groups=2
11/11/06 06:55:13 INFO mapred.JobClient: Combine output records=0
11/11/06 06:55:13 INFO mapred.JobClient: Reduce output records=0
11/11/06 06:55:13 INFO mapred.JobClient: Map output records=2
apple@ubuntu:~/hadoop-0.20.203.0$