2013年(350)
分类: HADOOP
2013-04-11 16:46:03
MapReduce原理大致明白了,可现在仍不知从何处入手咋整呐,还好还好,自带了几个jar包可用于测试:
[grid@hdnode1 hadoop-0.20.2]$ ll *.jar
-rw-rw-r-- 1 grid grid 6839 Feb 19 2010 hadoop-0.20.2-ant.jar
-rw-rw-r-- 1 grid grid 2689741 Feb 19 2010 hadoop-0.20.2-core.jar
-rw-rw-r-- 1 grid grid 142466 Feb 19 2010 hadoop-0.20.2-examples.jar
-rw-rw-r-- 1 grid grid 1563859 Feb 19 2010 hadoop-0.20.2-test.jar
-rw-rw-r-- 1 grid grid 69940 Feb 19 2010 hadoop-0.20.2-tools.jar这5个jar包功能各有不同,我们下面以hadoop-0.20.2-examples.jar为例(这招是跟tigerfish老师学的)。
有了现成的jar文件,那怎么执行呢,通过hadoop命令附加jar选项即可,例如:
[grid@hdnode1 ~]$ hadoop jar /usr/local/hadoop-0.20.2/hadoop-0.20.2-examples.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.通过上面输出的信息可以看到,hadoop-0.20.2-example.jar支持的参数不少,我们先来试试最后一个参数,wordcount,计算指定文件中的单词数据。文件是现成的,就用咱们刚刚上传到HDFS/jss文件夹下的两个文件。
执行命令如下,将结果输出到jsscount文件夹中:
[grid@hdnode1 ~]$ hadoop jar /usr/local/hadoop-0.20.2/hadoop-0.20.2-examples.jar wordcount jss jsscount
13/02/17 20:00:48 INFO input.FileInputFormat: Total input paths to process : 2
13/02/17 20:00:48 INFO mapred.JobClient: Running job: job_201302041636_0001
13/02/17 20:00:49 INFO mapred.JobClient: map 0% reduce 0%
13/02/17 20:00:58 INFO mapred.JobClient: map 50% reduce 0%
13/02/17 20:01:01 INFO mapred.JobClient: map 100% reduce 0%
13/02/17 20:01:10 INFO mapred.JobClient: map 100% reduce 100%
13/02/17 20:01:12 INFO mapred.JobClient: Job complete: job_201302041636_0001
13/02/17 20:01:13 INFO mapred.JobClient: Counters: 17
13/02/17 20:01:13 INFO mapred.JobClient: Job Counters
13/02/17 20:01:13 INFO mapred.JobClient: Launched reduce tasks=1
13/02/17 20:01:13 INFO mapred.JobClient: Launched map tasks=2
13/02/17 20:01:13 INFO mapred.JobClient: Data-local map tasks=2
13/02/17 20:01:13 INFO mapred.JobClient: FileSystemCounters
13/02/17 20:01:13 INFO mapred.JobClient: FILE_BYTES_READ=84
13/02/17 20:01:13 INFO mapred.JobClient: HDFS_BYTES_READ=42
13/02/17 20:01:13 INFO mapred.JobClient: FILE_BYTES_WRITTEN=238
13/02/17 20:01:13 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=35
13/02/17 20:01:13 INFO mapred.JobClient: Map-Reduce Framework
13/02/17 20:01:13 INFO mapred.JobClient: Reduce input groups=4
13/02/17 20:01:13 INFO mapred.JobClient: Combine output records=6
13/02/17 20:01:13 INFO mapred.JobClient: Map input records=2
13/02/17 20:01:13 INFO mapred.JobClient: Reduce shuffle bytes=90
13/02/17 20:01:13 INFO mapred.JobClient: Reduce output records=4
13/02/17 20:01:13 INFO mapred.JobClient: Spilled Records=12
13/02/17 20:01:13 INFO mapred.JobClient: Map output bytes=66
13/02/17 20:01:13 INFO mapred.JobClient: Combine input records=6
13/02/17 20:01:13 INFO mapred.JobClient: Map output records=6
13/02/17 20:01:13 INFO mapred.JobClient: Reduce input records=6执行信息暂略过不表,先来看结果:
[grid@hdnode1 ~]$ hadoop dfs -ls
Found 2 items
drwxr-xr-x - grid supergroup 0 2013-02-17 16:58 /user/grid/jss
drwxr-xr-x - grid supergroup 0 2013-02-17 20:01 /user/grid/jsscount果然出现jsscount目录一枚,查看该目录下都有什么内容:
[grid@hdnode1 ~]$ hadoop dfs -ls jsscount
Found 2 items
drwxr-xr-x - grid supergroup 0 2013-02-17 20:00 /user/grid/jsscount/_logs
-rw-r--r-- 3 grid supergroup 35 2013-02-17 20:01 /user/grid/jsscount/part-r-00000一个目录和一个文件,目录不管它,咱们先看文件:
[grid@hdnode1 ~]$ hadoop dfs -cat jsscount/part-r-00000
Hello 2
Junsansi 2
says: 1
world 1目测这个结果是正确的。