博客首页 注册 建议与交流 排行榜 加入友情链接
推荐 投诉 搜索: 帮助

MANIAC

逝去的时光,奔腾的流水,放浪的生涯,燃烧的梦想,繁繁茫茫的人海,漂泊流浪的我...
  maniac.cublog.cn

关于作者
小时候一直以为我跟别人不一样,原来长大以后,每个人都一样.要么忙着活,要么忙着死,只有这两种选择.

QQ    396002399
MSN   00ahui@gmail.com
EMAIL 00ahui@gmail.com
|| << >> ||
我的分类


A Big Bug in Heritrix ?
I tested ARCWriter & ARCReader of Heritrix, and I got a big problem when reading chinese content from ARC file.

I defined page and http-header :

final String PAGE = "<HTML><HEAD></HEAD><BODY> TEST test 测试中文 </BODY></HTML>";
final String CONTENT = "HTTP/1.1 200 OK\r\n"
+ "Content-Type: text/html\r\n\r\n" + PAGE;

and then write it to ARC in looping.

But there're problems When reading
I used ARCRecord.dump to dump content to console, and got this:

HTTP/1.1 200 OK
Content-Type: text/html

<HTML><HEAD></HEAD><BODY> TEST test 测试中文 </BODY></H

the last 4 byte disapears.

I also used ARCRecord.dump(OutputStream) & ArchiveRecord.read to dump content to a file, and got the same problem.

My code:
------------------------
    public static void testARCWriter() throws IOException {
        final AtomicInteger SERIAL_NO = new AtomicInteger();
        final File[] ARC_DIRs = { new File("d:/tmp/arc1")};
        final String PREFIX = "TMP";
        final boolean COMPRESS = true;

        final String URL = "http://192.168.0.31/test.html";
        final String TYPE = "text/html";
        final String HOST = "192.168.0.31";
        final long DATE = new Date().getTime();
        final String PAGE = "<HTML><HEAD></HEAD><BODY> TEST test 测试中文 </BODY></HTML>";
        final String CONTENT = "HTTP/1.1 200 OK\r\n"
                + "Content-Type: text/html\r\n\r\n" + PAGE;

        ARCWriter aw = new ARCWriter(SERIAL_NO, Arrays.asList(ARC_DIRs),
                PREFIX, COMPRESS, DEFAULT_MAX_ARC_FILE_SIZE);

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        baos.write(CONTENT.getBytes());

        // write first record
        aw.write(URL, TYPE, HOST, DATE, CONTENT.length(), baos);

        for (int i = 0; i < 30; i++) {
            long start = aw.getPosition();
            aw.write(URL, TYPE, HOST, DATE, CONTENT.length(), baos);
            long end = aw.getPosition();
            System.out.println("record " + i + " --> file:"
                    + aw.getFile().getAbsolutePath() + "\t offset: " + start
                    + "\t size:" + (end - start));
        }

        aw.close();
    }
       
    //NOTE:
    // change file and offset when use :
   
    public static void testARCReader() throws IOException {
        final String arcFile = "d:\\tmp\\arc1\\TMP-20070912062413-00000.arc.gz";
        ARCReader reader = ARCReaderFactory.get(new URL("file:////" + arcFile));
        ARCRecord r = (ARCRecord) reader.get(309);
        System.out.println(r.getBodyOffset());
        System.out.println(r.getHeader().getDate());
        System.out.println(r.getHeader().getLength());
        System.out.println(r.getHeader().getOffset());
        System.out.println(r.getHeader().getMimetype());
        System.out.println(r.getHeader().getUrl());
        // r.dumpHttpHeader();
        //r.skipHttpHeader();
        r.dump();
        r.close();
/*
or dump to file, got the same problem
But when I write several r.dump() after the first r.dump(), I got one char each time
and the whole content can be dumped till a exception happens
*/

     }

 TAG Heritrix ARC archive
发表于: 2007-09-12,修改于: 2007-09-13 08:22,已浏览529次,有评论2条 推荐 投诉


网友评论
网友: 本站网友 时间:2007-10-25 20:48:22 IP地址:194.62.232.★
Are there other file formats apart from ARC?

网友: nkmaniac 时间:2007-10-26 20:43:51 IP地址:60.28.145.★
yes,there are many open source tools you can use.
search them on sourceforge OR openopen !

 发表评论