1.1 单机安装hadoop
实验平台:linux2.6,Hadoop1.0.3,JDK1.6
step1. ssh的安装与设置
由于hadoop是用ssh通信的,需要进行免密码登录设定
$ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ssh localhost 就可以无密码登录[确保本机可以无密码登录的.]
step2 安装java
step3 下载安装hadoop
$wget
$tar zxf hadoop-1.0.3.tar.gz
$echo $JAVA_HOME
$vim conf/hadoop-env.sh 将export JAVA_HOME=/opt/taobao/java 指到合适的java目录.
修改配置文件
修改conf/core-site.xml:
fs.default.name
hdfs://localhost:9000
修改conf/hdfs-site.xml:
dfs.replication
1
修改conf/mapred-site.xml :
mapred.job.tracker
localhost:9001
初始化hadoop Namenode:
$bin/hadoop namenode -format
启动:
$bin/start-all.sh
确认启动:
$jps
11107 JobTracker
11005 SecondaryNameNode
11433 Jps
10841 DataNode
11239 TaskTracker
10707 NameNode
PS: 需要了解一下conf里面的这几个配置文件到底是什么作用的?
1.2 Apache访问日志的格式内容如下
- XXX.24.XXX.XXX - - [07/Jun/2012:23:59:59 +0800] "GET /api/cli.api.php?act=l&input_k=ip&input_v=XXX.24.XXX.XXX&&&vm=vm& HTTP/1.1" 200 31 "-" "Python-urllib/2.6"
- XXX.24.XXX.XXX - - [07/Jun/2012:23:59:59 +0800] "GET /api/cli.api.php?act=l&input_k=ip&input_v=XXX.24.XXX.XXX&output=g HTTP/1.1" 200 47 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"
1.3 需求
需要解析出来API的原形。比如像上例中的第一条就需要输出如下的数据:
/api/cli.api.php?act=&input_k=&input_v=&&&vm=&
即过滤掉=&之间的字符串。[因为参数可以变化的]
1.4 编写map-reduce处理脚本
mapper.py代码如下:
- #!/usr/bin/env python
- # - coding:utf - 8 -*-
- import sys
- import re
- p=re.compile('(?<=\=).*?(?=&|$)',re.I)
- def read_input(file):
- for line in file:
- yield line.split()
- def main(separator='\t'):
- data = read_input(sys.stdin)
- for words in data:
- print "%s%s%d" % (p.sub("",words[6]),separator,1)
- if __name__ == "__main__":
- main()
reduce.py的代码如下:
- #!/usr/bin/env python
- from itertools import groupby
- from operator import itemgetter
- import sys
- def read_mapper_output(file,separator='\t'):
- for line in file:
- yield line.rstrip().split(separator,1)
- def main(separator='\t'):
- data = read_mapper_output(sys.stdin,separator=separator)
- for current_word,group in groupby(data,itemgetter(0)):
- try:
- total_count = sum(int(count) for current_word,count in group)
- print "%s%s%d" % (current_word,separator,total_count)
- except ValueError:
- pass
- if __name__ == "__main__":
- main()
测试maper.py与reduce.py的正确性可以用:
$ echo "foo foo quux labs foo bar quux"|/home/hadoop/mapper.py |sort -k1,1|/home/hadoop/reducer.py
Ps:感觉Hadoop里面解决了sorted的过程。
1.5 运行
- hadoop jar contrib/streaming/hadoop-streaming-1.0.3.jar -jobconf io.sort.mb=1024 -mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input /user/yaofang.zjl/gutenberg -output gutenberg-out
1.6 遇到的问题
Apache的日志大小大概500M.跑map的时候遇到:
Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
如果这个日志文件很小的时候就正常处理了。
不清楚是什么原因。后面再找找看。希望有懂的朋友可以给我发邮件:
1.7 如何解决?
阅读(5644) | 评论(0) | 转发(1) |