(2)爬行策略------如何设计一个简洁的web抓取引擎-tomqq

/* * Download a url source and write into d_file * */ public static void getResource(String s_url, String d_file){ try{ URL url = new URL(s_url); HttpURLConnection httpurl = (HttpURLConnection)url.openConnection(); httpurl.connect(); BufferedInputStream in = new BufferedInputStream(httpurl.getInputStream()); FileOutputStream out = new FileOutputStream(d_file); byte[] buf = new byte[1024]; int size = 0; while((size=in.read(buf))!=-1){ out.write(buf, 0, size); } in.close(); out.close(); }catch(Exception e){ System.out.println("Error url: " + s_url); try{ BufferedWriter rw = new BufferedWriter(new FileWriter("e:\\jitapu\\error.txt",true)); rw.write(s_url+"\n"); rw.close(); }catch(Exception e2){ e2.printStackTrace(); } e.printStackTrace(); } }

那么，剩下来的事情就是设计爬行的策略了。这就要具体问题具体分析了。理论上来讲，你需要找到一个或者若干个url，从这个url，你能够遍历全站，如果不能，那么任何一个搜索引擎的爬虫也做不到。我的办法是：从入口url，使用正则表达式逐级分析次级页面，每一个中间过程使用文件加以保存，直到最后达到你要的目标。具体过程主要是分析html，指定相应的规则，每一个网站是不一样的，这里就不在赘述了。
正则类似：

Pattern p = Pattern.compile("

.*?

.*?(.*?)
");

                    Matcher m  = p.matcher(sb.toString());

                    while(m.find()){

                        content = m.group(1).trim();

                    }

阅读(639) | 评论(0) | 转发(0) |

上一篇：Vista激活方法&ubuntu双系统启动

下一篇：看看阿里老板年终给员工的信

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6