reduce到某个阶段时,卡在某处
到reduce运行的node上,查看logs/userlog/下相应的目录,找到对应的job、task目录,查看里面的syslog
可以看到:
- 2011-10-21 14:52:07,483 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201110210917_0234_r_000007_1 copy failed: attempt_201110210917_0234_m_000011_0 from sjz134.uniqlcik.com
-
2011-10-21 14:52:07,484 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection timed out
-
at java.net.PlainSocketImpl.socketConnect(Native Method)
-
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
-
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
-
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
-
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
-
at java.net.Socket.connect(Socket.java:529)
-
at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
-
at sun.net.(HttpClient.java:394)
-
at sun.net.(HttpClient.java:529)
-
at sun.net.(HttpClient.java:233)
-
at sun.net.(HttpClient.java:306)
-
at sun.net.(HttpClient.java:323)
-
at sun.net.(HttpURLConnection.java:975)
-
at sun.net.(HttpURLConnection.java:916)
-
at sun.net.(HttpURLConnection.java:841)
-
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
-
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
原因:
这是由于tasktracker无法拷贝其他tasktracker的map结果造成的。
reduce的Shuffle过程:通过网络(HTTP协议)访问并复制Mapper的输出记录;
每个reduce task都会有一个后台进程GetMapCompletionEvents,它获取heartbeat中(从JobTracker)传过来的已经完成的task列表,并将与该reduce task对应的数据位置信息保存到mapLocations中,mapLocations中的数据位置信息经过滤和去重(相同的位置信息因为某种原因,可能发过来多次)等处理后保存到集合scheduledCopies中,然后由几个拷贝线程(默认为5个)通过HTTP并行的拷贝数据。
通过http协议需要hostname,关于hostname:
注:关于slaves文件,默认情况下,里面写得域名仅仅用于主namenode启动slave时ssh使用。其余地方用不到这几个域名。tasktracker和jobtracker,datanode和namenode之间的heartbeat里面包含的信息都是hostname,而不是slaves文件里所写的域名。
- TaskTracker (DataNode) will send to the
-
JobTracker (NameNode) status messages regularly, which contain its hostname.
-
Consequently, when a Map or Reduce task obtains the addresses of the
-
TaskTrackers (DataNodes) from the JobTracker (NameNode), e.g., for copying
-
the Map output or reading a HDFS block, it will get the hostnames specified
-
in the status messages and talk to the TaskTrackers (DataNodes) using those
-
hostnames.
如何让tasktracker和jobtracker,datanode和namenode之间使用slaves里所写的域名?
在源码中可以看到:/usr/local/hadoop/src/mapred/org/apache/hadoop/mapred/TaskTracker.java
- localFs = FileSystem.getLocal(fConf);
-
if (fConf.get("slave.host.name") != null) {
-
this.localHostname = fConf.get("slave.host.name");
-
}
-
if (localHostname == null) {
-
this.localHostname =
-
DNS.getDefaultHost
-
(fConf.get("mapred.tasktracker.dns.interface","default"),
-
fConf.get("mapred.tasktracker.dns.nameserver","default"));
-
}
另外在haoop的wiki faq中也有说明:
When writing a New InputFormat, what is the format for the array of string returned by InputSplit\#getLocations()?
It appears that DatanodeID.getHost() is the standard place to retrieve this name, and the machineName variable, populated in DataNode.java\#startDataNode, is where the name is first set. The first method attempted is to get "slave.host.name" from the configuration; if that is not available, DNS.getDefaultHost is used instead.
所以,可以在 mapred-site.xml和hdfs-site.xml中指定该节点要使用的hostname。
You can bypass all of Hadoop's efforts to automatically figure out the slave's host name by specifying the slave.host.name parameter in the configuration files. If that is set, Hadoop will just take your word for it and use the name you provide.
- <property>
-
<name>slave.host.name</name>
-
<value>hdp-datanode145</value>
-
</property>
需要注意:每天服务器仍然需要能够解析自己的主机名,因为在启动时,会解析自己的主机名,通过记录的日志名可以看出,仍然需要用到本机的主机名。
STARTUP_MSG: host = xxx.xxx.com/172.18.6.143
不写得话会报错:
STARTUP_MSG: host = java.net.UnknownHostException:
可以使用Groovy进行测试
groovy -e "print InetAddress.getLocalHost().getHostName()"
参考文档:
http://blog.sina.com.cn/s/blog_61ef49250100uul8.html
%2FIP%A5%A2%A5%C9%A5%EC%A5%B9%A4%F2%BB%C8%A4%C3%A4%BF%A5%AF%A5%E9%A5%B9%A5%BF%A4%CE%B9%BD%C0%AE
http://blog.csdn.net/baggioss/article/details/5462593
http://western-skies.blogspot.com/2010/11/fix-for-exceeded-maxfaileduniquefetches.html (需翻墙)
阅读(2542) | 评论(0) | 转发(0) |