|
I tested ARCWriter & ARCReader of Heritrix, and I got a big problem when reading chinese content from ARC file.
I defined page and http-header :
final String PAGE = "<HTML><HEAD></HEAD><BODY> TEST test 测试中文 </BODY></HTML>";
final String CONTENT = "HTTP/1.1 200 OK\r\n"
+ "Content-Type: text/html\r\n\r\n" + PAGE;
and then write it to ARC in looping.
But there're problems When reading
I used ARCRecord.dump to dump content to console, and got this:
HTTP/1.1 200 OK
Content-Type: text/html
<HTML><HEAD></HEAD><BODY> TEST test 测试中文 </BODY></H
the last 4 byte disapears.
I also used ARCRecord.dump(OutputStream) & ArchiveRecord.read to dump content to a file, and got the same problem.
My code: ------------------------ public static void testARCWriter() throws IOException { final AtomicInteger SERIAL_NO = new AtomicInteger(); final File[] ARC_DIRs = { new File("d:/tmp/arc1")}; final String PREFIX = "TMP"; final boolean COMPRESS = true;
final String URL = "http://192.168.0.31/test.html"; final String TYPE = "text/html"; final String HOST = "192.168.0.31"; final long DATE = new Date().getTime(); final String PAGE = "<HTML><HEAD></HEAD><BODY> TEST test 测试中文 </BODY></HTML>"; final String CONTENT = "HTTP/1.1 200 OK\r\n" + "Content-Type: text/html\r\n\r\n" + PAGE;
ARCWriter aw = new ARCWriter(SERIAL_NO, Arrays.asList(ARC_DIRs), PREFIX, COMPRESS, DEFAULT_MAX_ARC_FILE_SIZE);
ByteArrayOutputStream baos = new ByteArrayOutputStream(); baos.write(CONTENT.getBytes());
// write first record aw.write(URL, TYPE, HOST, DATE, CONTENT.length(), baos);
for (int i = 0; i < 30; i++) { long start = aw.getPosition(); aw.write(URL, TYPE, HOST, DATE, CONTENT.length(), baos); long end = aw.getPosition(); System.out.println("record " + i + " --> file:" + aw.getFile().getAbsolutePath() + "\t offset: " + start + "\t size:" + (end - start)); }
aw.close(); } //NOTE:
// change file and offset when use :
public static void testARCReader() throws IOException { final String arcFile = "d:\\tmp\\arc1\\TMP-20070912062413-00000.arc.gz"; ARCReader reader = ARCReaderFactory.get(new URL("file:////" + arcFile)); ARCRecord r = (ARCRecord) reader.get(309); System.out.println(r.getBodyOffset()); System.out.println(r.getHeader().getDate()); System.out.println(r.getHeader().getLength()); System.out.println(r.getHeader().getOffset()); System.out.println(r.getHeader().getMimetype()); System.out.println(r.getHeader().getUrl()); // r.dumpHttpHeader(); //r.skipHttpHeader(); r.dump(); r.close(); /*
or dump to file, got the same problem
But when I write several r.dump() after the first r.dump(), I got one char each time
and the whole content can be dumped till a exception happens
*/
}
|