Html转XML-flyoversky-ChinaUnix博客

梦幻岛flyoversky.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

flyoversky

博客访问： 1065453
博文数量： 155
博客积分： 5339
博客等级：大校
技术积分： 1436
用户组：普通用户
注册时间： 2005-08-10 21:41

文章分类

全部博文（155）

持续集成（1）
编程（3）
logstash（4）
Qt（3）
Websphere（2）
Linux（5）
电脑爱好者（13）
网络技术（17）
Database（4）
Java技术（31）
工作流（1）
生活（7）
杂（30）
听音乐（12）
看电影（8）
成长博客（6）
日记（8）
未分配的博文（0）

文章存档

2016年（3）

2015年（7）

2014年（3）

2013年（1）

2012年（8）

2011年（5）

2010年（1）

2009年（5）

2008年（4）

2007年（26）

2006年（46）

2005年（46）

相关博文

Html转XML

分类： Java

2005-11-11 17:26:41

使用nekohtml进行转化

nekohtml下载地址：

源程序：

html2xml.java

import org.w3c.dom.Node;
import org.w3c.dom.DocumentFragment;
import org.w3c.dom.html.HTMLDocument;
import org.xml.sax.InputSource;
import org.apache.html.dom.HTMLDocumentImpl;
import org.cyberneko.html.parsers.DOMFragmentParser;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import java.util.Properties;
import java.util.Calendar;
import java.io.File;
import java.io.InputStreamReader;
import java.io.InputStream;
import java.io.FileReader;
import java.net.HttpURLConnection;
import java.net.URL;

public class html2xml {
    public static void main(String args[]){
        if(args!=null&&args.length>=2){
            try {
                String path=args[0];
                String fromfile=args[1];
                String outputfile=getFileName();
                if(args.length>2){
                    outputfile=args[2];
                }
                boolean b=Boolean.valueOf(fromfile).booleanValue();
                html2xml h2x=new html2xml();
                DocumentFragment df=h2x.getSourceNode(path,b);
                File file=new File(outputfile);
                if(file.exists())
                    file.delete();
                h2x.genXmlFile(df,file);
                System.out.println("generate "+file.getCanonicalPath()+" successfully!");
            } catch (Exception e) {
                e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
            }

        }else{
            System.out.println("usage:html2xml path fromfile [outputfile]");
            System.out.println("html2xml false D:/tempfile.xml");
            System.out.println("html2xml D:/htmlfile.htm true D:/tempfile.xml");
            System.out.println("--");
        }
    }

    public void genXmlFile(Node output,File file) throws Exception,Error{
            TransformerFactory tf=TransformerFactory.newInstance();
            Transformer transformer=tf.newTransformer();
            DOMSource source=new DOMSource(output);
            java.io.FileOutputStream fos=new java.io.FileOutputStream(file);
            StreamResult result=new StreamResult(fos);
            Properties props = new Properties();
            props.setProperty("encoding", "GB2312");
            props.setProperty("method", "xml");
            props.setProperty("omit-xml-declaration", "yes");

transformer.setOutputProperties(props);

transformer.transform(source,result);
fos.close();

}

    public DocumentFragment getSourceNode(String path,boolean fromfile) throws Exception,Error{
        DOMFragmentParser parser = new DOMFragmentParser();
        HTMLDocument document = new HTMLDocumentImpl();
        DocumentFragment fragment = document.createDocumentFragment();

            if(path!=null&&!path.trim().equals(""))
            {
                String tmp=path;

                if(fromfile){
                    File input = new File(path);
                    FileReader fr=new FileReader(input);
                    InputSource is=new InputSource(fr);
                    parser.parse(is,fragment);
                    fr.close();
                }else{

                    URL url = new URL(tmp);
                    HttpURLConnection con = (HttpURLConnection) url.openConnection();
                    InputStream inputs = con.getInputStream();
                    InputStreamReader isr=new InputStreamReader(inputs,"GBK");
                    InputSource source=new InputSource(isr);
                    parser.parse(source,fragment);
                }
                return fragment;
            }else{
                return null;
            }

}

    public static String getFileName() throws Exception{
        Calendar c=Calendar.getInstance();
        String name="tmp"+c.get(Calendar.YEAR)+(c.get(Calendar.MONTH)<9?"0":"")+
                (c.get(Calendar.MONTH)+1)+(c.get(Calendar.DAY_OF_MONTH)<10?"0":"")+
                c.get(Calendar.DAY_OF_MONTH)+(c.get(Calendar.HOUR_OF_DAY)<10?"0":"")+
                c.get(Calendar.HOUR_OF_DAY)+(c.get(Calendar.MINUTE)<10?"0":"")+
                c.get(Calendar.MINUTE)+(c.get(Calendar.SECOND)<10?"0":"")+
                c.get(Calendar.SECOND)+(c.get(Calendar.MILLISECOND)<10?"0":"")+
                (c.get(Calendar.MILLISECOND)<100?"0":"")+c.get(Calendar.MILLISECOND);
        return name;
    }
}

目录结构：

html2xml
├─classes
├─lib
└─src

在目录html2xml下建立一个批处理文件run.bat，内容为：java -cp "./lib/nekohtml.jar;./lib/xercesImpl.jar;./lib/xml-apis.jar;./lib/commons-logging.jar;./lib/commons-discovery.jar;./lib/saaj.jar;./classes" html2xml %1 %2 %3

使用方法：

1.文件转化，从命令行输入：run D: est.html true D: est.xml，第一个参数是被转化的目标html文件，第二个参数是标志位，文件转化时使用“true”，第三个参数是输出的xml文件，第三个参数若省缺将会产生一个临时文件用于存储输出xml。

2.通过网页地址转换，从命令行输入：run false D:sina.xml，第一个参数是网页地址，第二个应该设置为false，第三个参数同上。

由于nekohtml具有较好的容错性，对于大多数情况都能够成功转化，需要注意的是目标网页的html标签存在多个时会发生错误，还有一个也是在使用中发现的，attribute='a'b'，对于这种属性无法解析，而一般是不会出现这种极烂的书写方式。

nekohtml可以从下载获得。

阅读(11856) | 评论(0) | 转发(0) |

上一篇：native2ascii命令

下一篇：文字的鼠标跟随特效

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6