Õ¾ÄÚËÑË÷ÒýÇæNutchÅäÖÃÈ«¹ý³Ì£¨ubuntu£©
(2007-08-29 17:33)
·ÖÀࣺ jsp+java
ÏÂÔØ:
¿ÉÒÔÈ¥ApacheµÄ¹Ù·½ÍøÒ³http://www.apache.org/dyn/closer.cgi/lucene/nutch/ ÏÂÔØ×îаæµÄNutch£¬Ä¿Ç°×îаæÊÇnutch-0.9£¬65M´óС¡£
½âѹËõ½øÈëbin/¾ÍÄÜÓÃNutchÊÇÓÃjavaдµÄÒ»¸ö¿ªÔ´ÏîÄ¿£¬ËùÒÔҪʹËüÕý³£ÔËÐбØÐë°²×°JDK£¨Ò²ÎªÁËÄÜÐÞ¸Änutch£©£¬Java 1.4.xÒÔÉϰ汾£¬ÉèÖû·¾³±äÁ¿NUTCH_JAVA_HOMEΪjavaÐéÄâ»úµÄ°²×°Ä¿Â¼¡£
´ËÍ⣬»¹±ØÐë°²×°Apache's Tomcat 4.x ÒÔÉϰ汾¡£
×îºó£¬ÏëµÃµ½½ÏºÃµÄÔËÐÐЧ¹û£¬±ØÐëÓÐÖÁÉÙ1GµÄÊ£Óà¿Õ¼äºÍÒ»¸öÍøËٽϿìµÄÍøÂç¡£
µÚ1-6Ïî (¹²ÓÐ 31 Ïî²éѯ½á¹û):
ÔÚNutchµÄ°²×°Ä¿Â¼Öн¨Á¢Ò»¸öÃûΪmyurlµÄÎı¾Îļþ£¬ÎļþÖÐдÈëÒª×¥È¡ÍøÕ¾µÄ¶¥¼¶ÍøÖ·£¬¼´Òª×¥È¡µÄÆðʼҳ¡£
ÒÔÎÒҪץȡµÄÍøÒ³ÎªÀý£¬ÊäÈ룺
×¢Ò⣺×îºóÒ»¸ö¡°/¡±ºÍconf/crawl-urlfilter.txtÖеÄÄÚÈÝͳһ¡£
¸ü¸ÄÅäÖÃÎļþ crawl-urlfilter.txt
±à¼conf/crawl-urlfilter.txtÎļþ£¬ÐÞ¸ÄMY.DOMAIN.NAME²¿·Ö£¬°ÑËüÌæ»»ÎªÄãÏëҪץȡµÄÓòÃû£¨µØÖ·£©£¬¼´°Ñ
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
ÐÞ¸ÄΪ£º
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*sdau.edu.cn /
ÔËÐÐ crawlÃüÁî×¥È¡ÍøÕ¾ÄÚÈÝ
¡¤-dir dirnames ÉèÖñ£´æËù×¥È¡ÍøÒ³µÄĿ¼.
¡¤-depth depth ±íÃ÷×¥È¡ÍøÒ³µÄ²ã´ÎÉî¶È
¡¤-delay delay ±íÃ÷·ÃÎʲ»Í¬Ö÷»úµÄÑÓʱ£¬µ¥Î»Îª¡°Ã롱
¡¤-threads threads ±íÃ÷ÐèÒªÆô¶¯µÄÏß³ÌÊý
¸Ä±äµ±Ç°¹¤×÷ÇøÎªnutch°²×°Ä¿Â¼£¬ÔËÐÐÒÔÏÂÃüÁîÐУº
bin/nutch crawl myurl -dir mydir -depth 2 -threads 4 >&logs/logs1.log
ÔÚÉÏÊöÃüÁîµÄ²ÎÊýÖУ¬myurl
¾ÍÊǸղÅÎÒÃÇ´´½¨µÄÄǸöÎļþ,´æ·ÅÎÒÃÇҪץȡµÄÍøÖ·,dirÖ¸¶¨×¥È¡ÄÚÈÝËù´æ·ÅµÄĿ¼£¬depth±íʾÒÔÒª×¥È¡ÍøÕ¾¶¥¼¶ÍøÖ·ÎªÆðµãµÄÅÀÐÐÉî¶È£¬
threadsÖ¸¶¨²¢·¢µÄÏß³ÌÊý¡£×îºóµÄlogs/logs1.log±íʾ°ÑÏÔʾµÄÄÚÈݱ£´æÔÚÎļþlogs1.logÖУ¬ÒÔ±ã·ÖÎö³ÌÐòµÄÔËÐÐÇé¿ö¡£
1. Èç¹ûmydirÔÚÔËÐÐǰÒÑ´æÔÚ£¬ÔòÔËÐÐʱ½«±¨´í£ºmydir already exist¡£½¨ÒéÏÈɾ³ýÕâ¸öĿ¼£¬»òÕßÖ¸¶¨ÆäËûµÄĿ¼´æ·ÅץȡµÄÍøÒ³¡£
ÐÞ¸Ä conf/nutch-site.xml
Èç¹ûûÓÐÅäÖôËagent£¬ÅÀȡʱ»á³öÏÖ Agent name not configured! µÄ´íÎó¡£
ÐÞ¸Ä conf/nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>HD nutch agent</value>
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
</property>
</configuration>
<property>
<name>http.agent.name</name>
<value>HD nutch agent</value>
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
</property>
</configuration>
Èç¹ûûÓÐÅäÖôËagent£¬ÅÀȡʱ»á³öÏÖ Agent name not configured! µÄ´íÎó¡£
ËÄ.ÔÚTomcatÖÐÔËÐв鿴½á¹û£¨ÔÚWindowsϲ¿Êð³É¹¦£¬µ«ÊÇÔÚLInuxÏÂ×ÜÊdzö´í£©
Èç¹ûÒѾץȡ³É¹¦£¬Ôò¿ÉÒÔÔÚTomcatÉϲ¿ÊðÁË
¸´ÖÆnutch.0.9.warµ½tomcatĿ¼/webapps
¸´ÖÆnutch.0.9.warµ½tomcatĿ¼/webapps
ÐÞ¸Ä/webapps/nutch/WEB-INF/classes/nutch-site.xml :
½«
<nutch-conf>
</nutch-conf>
»»³É
<nutch-conf>
<property>
<name>searcher.dir</name>
<value>Your_crawl_dir_path</value>
</property>
</nutch-conf>
Your_crawl_dir_pathÖ¸¸Õ²Å×¥È¡ÍøÒ³Ê±ÍøÒ³±£´æµÄÎļþ¼Ð£¬±ÈÈçÎҵľÍÊÇ£º/usr/locla/mutch-0.9/bin/mydir
×îºóÔÚä¯ÀÀÆ÷ÖÐÊäÈë http://localhost:8080 /mutch-0.9
ÊäÈ룺»ú¹¹ÉèÖÃ
×îºóÔÚä¯ÀÀÆ÷ÖÐÊäÈë http://localhost:8080 /mutch-0.9
ÊäÈ룺»ú¹¹ÉèÖÃ
ɽ¶«Å©Òµ´óѧ
... ѧ ¡¡ ¡¡ ¡¡ ѧУ¸Å¿ö »ú¹¹ÉèÖà ÕÐÉú¾ÍÒµ ÈË ... ºÓÅ©³¡ ѧԺÉè ...
http://www.sdau.edu.cn/sdau2005/department.html (ÍøÒ³¿ìÕÕ) (ÆÀ·ÖÏê½â) (anchors) (more from www.sdau.edu.cn)
ɽ¶«Å©Òµ´óѧ
... ѧ ¡¡ ¡¡ ¡¡ ѧУ¸Å¿ö »ú¹¹ÉèÖà ÕÐÉú¾ÍÒµ ÈË ... ½é ¡¡ Ð£Ê·ÑØ¸ï ¡¡ »ú¹¹ ...
http://www.sdau.edu.cn/sdau2005/gk3.html (ÍøÒ³¿ìÕÕ) (ÆÀ·ÖÏê½â) (anchors) (more from www.sdau.edu.cn)
ɽ¶«Å©Òµ´óѧѧ¿Æ½¨ÉèÑÐÌֻᷢÑÔÕªÒª£¨Î壩
... Óëѧλµã½¨ÉèµÄ¾ÑéÓ뽨 ... ѧԺѧ¿Æ½¨ÉèµÄÖ÷Òª¾Ñé ...
http://weekly.sdau.edu.cn/html2006/2006/xbzl/2007_13_29_6940.htm (ÍøÒ³¿ìÕÕ) (ÆÀ·ÖÏê½â) (anchors) (more from weekly.sdau.edu.cn)
ɽ¶«Å©Òµ´óѧ¾«Æ·¿Î³Ì½¨Éè
... µÄÉ걨ÊÜÀí»ú¹¹£¬½ÓÊܸ÷Ê¡¡¢½Ì ... ¾«Æ·¿Î³Ì½¨Éè ...
http://jpkc.sdau.edu.cn/2004-5-12.html (ÍøÒ³¿ìÕÕ) (ÆÀ·ÖÏê½â) (anchors)
µç×Ó°æÎÄÕÂÁбí¨D¨Dɽ¶«Å©´ó±¨»¶ÓÄú
... Äê7ÔÂ5ÈÕ »ú¹Øµ³Î¯±»ÆÀ ... ¼Óǿѧ·ç½¨ÉèºÍµ±Ç°¹¤×÷ ...
http://weekly.sdau.edu.cn/html2006/2006/xxyw/index.htm (ÍøÒ³¿ìÕÕ) (ÆÀ·ÖÏê½â) (anchors) (more from weekly.sdau.edu.cn)
´Ëʱtomcat¿ÉÄܳöÏÖÂÒÂ룬Çë¼ûÎҵIJ©¿Ízhongzhouxian.cublog.cn½â¾ötoncatÂÒÂë
Èç¹ûÓÐÎÊÌ⣬ÁôÑÔ¡£ÎÒÒ²ÊÇÔÚÍøÉÏÕÒÁËÐí¶àʵÏÖµÄ
ÏÖÔÚ£¬ÎÒ×Å»¹ÓÐС´íÎ󣬲¢ÏëʵÏÖ¶¨Ê±Ë÷Òý£¬Íû¸ßÊִͽÌ


