Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1567477
  • 博文数量: 3500
  • 博客积分: 6000
  • 博客等级: 准将
  • 技术积分: 43870
  • 用 户 组: 普通用户
  • 注册时间: 2008-05-03 20:31
文章分类

全部博文(3500)

文章存档

2008年(3500)

我的朋友

分类:

2008-05-04 19:57:57

一起学习
利用XML Schemas构建SAX解析器 虽然SAX(Simple API for XML)解析器是解析XML十分方便的工具,但开发和维护SAX解析器是十分困难的,本文会告诉你如何使用XML Schemas中的信息来生成一个SAX解析器的框架源码,并且完成解析xml的技术 Simple API for XML (SAX) parser offers an invaluable tool for parsing XML files, especially if you need to parse large XML input files that cannot load into main memory. A SAX parser can also prove helpful if you have a slow input stream, like an Internet connection, and you need to process bytes as soon as they arrive, instead of waiting for the complete input. As a bonus, a well-designed SAX parser is generally faster than the approach of processing a DOM (Document Object Model) tree; you need only one pass over the XML data as opposed to the two passes needed with a DOM tree (one to build the tree, and one to do the processing). Unfortunately, a SAX parser can be difficult to develop because of its event-driven nature. In this article, I create a source code generator that will help you easily develop a SAX parser. Note: I don't explain SAX in detail here; see Resources below for some excellent references. SAX 解析器是为那些巨大到无法一次性将其加载到内存中的XML文件提供解析的工具,对那些较慢的输入流,比如象网络连接,你可能需要一点点接收输入字节,而不是一次的完整输入,在这种情况下,一个设计良好的SAX解析器会比使用DOM树解析来的快;你只需要处理XML的解析,而不是象DOM树那样有两个处理环节(一个构建DOM树,一个解析数据) 不幸的是,由于SAX解析器的 事件驱动(event-driven) 特性,使得开发他很困难,在该文中我们将会创建一个源码构造器来简化SAX解析器的开发 注意,这里我们将不过多介绍SAX的有关概念,请参考文尾的相关参考 SAX reviewed SAX is a standard API that parses an XML input stream, like a file or network connection, and triggers events in an event-handler class. Many different SAX parser implementations are available for Java. In my examples here, I use Xerces from the Apache XML Project, one of the most popular parser implementations. Listings 1 and 2 below show an XML file and a SAX event handler, respectively. (You can download all source code and examples for this article from Resources.) SAX 概述 SAX 是解析XML输入流(比如一个文件、网络连接、在事件处理类中一个触发事件)的标准API,有许多Java的SAX解析器实现,此处我们用Apache XML Project 的Xerces为例,它也是目前最为流行的解析器实现。 列表1 和2 分别列出了一个XML文件和一个SAX事件处理机 Listing 1. Example XML John Dole 1-50 123456 Jane Dole 1-51 123457 Listing 2. SAX handler public void startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, Attributes attributes) throws SAXException { text.reset(); if (qName.equals ("company")) { String name = attributes.getValue("name"); String header = "Employee Listing For " name; System.out.println (header); System.out.println (); } } public void endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName) throws SAXException { if (qName.equals ("first")) { firstName = getText(); } if (qName.equals ("last")) { lastName = getText(); } if (qName.equals ("office")) { office = getText(); } if (qName.equals ("telephone")) { telephone = getText (); } if (qName.equals ("employee")) { System.out.println (office "\t " firstName "\t" lastName "\t" telephone); } } The SAX handler above merely prints the XML file's data to the standard output device. It prints a header line containing the company name followed by tab-delimited employee data. As you can see from Listing 2, parsing even a simple XML file can produce a significant amount of source code. SAX's event-driven (as opposed to document-driven) nature also makes the source code difficult to maintain and debug because you must be constantly aware of the parser's state when writing SAX code. Writing a SAX parser for complex document definitions can prove even more demanding; see Resources for challenging real-life examples. We must reduce the work involved in writing an event-handler structure so we have more time to work on actual processing. 上述的SAX处理器仅仅把XML的数据打印到标准输出上,开头是公司名,然后是一系列的雇员数据 从列表2 你可以看到,即使解析这么简单的xml文件也要写很多代码,SAX的事件驱动特性(相对文档驱动document-driven)令编码难于维护和测试,因为你必须时刻注意解析器的状态,为复杂的文档编写解析器甚至更加困难;参见 有关真实情况的挑战 参考 我们必须减少用在处理事件结构上的工作,从而投入更多的时间到其他一些实际工作上 XML Schemas To lighten our workload, we can automate most of the process of writing the event-handler structure. Luckily, the computer already knows the format of the XML file we will parse; the format is defined in a computer-readable DTD (document type definition) or in an XML Schema. I explore ways to use this knowledge for generating source code that removes the sting from SAX parser development. For this article, I rely on XML Schemas only. Though younger than DTDs, the XML Schema standard will probably replace DTDs in the future. You can easily convert your existing DTD files to XML Schemas with the help of some simple tools. 要减轻工作流量,我们可以自动产生大部分的有关事件处理代码.好在计算机已经知道要处理的xml的格式;这些格式定义在计算机可读的DTD(document type definition)或者XML Schema文件中.在本文中我们只介绍利用XML Schemas,虽然比DTD的历史短,但在将来它将会完全替代DTD成为标准。你可以使用一些简单的工具把现有的DTD文件转化成Schema(Schema本身也是一种xml文档) The first step towards building our code generator is to load the information contained in the XML Schema into a memory model. For this article, I use a simple memory model that defines only the XML entity and attribute names, as well as the entities' relationship to each other. This custom model eases the code generation process. My simplified memory model consists of two classes: Element and Elements. The former stores information for an entity, and the latter manages a list of entities. 要构建我们的代码生成器的第一步是把XML Schema中的信息加载到内存数据模型(一些class)中,在本文中,我们使用一个简单的内存模型来定义 XML项和属性名,还有它彼此间的关系.我们定义了2个class:Element 和 Elements,前者存储一项信息,后者存储一系列项目信息 Next, we need a mechanism that populates the memory model from an XML Schema. Because an XML Schema is also an XML file, you can use a SAX parser to parse an XML Schema and populate the memory model. In this case, a SAX parser does offer a good choice: you need to only handle events for the entity parts and attribute definitions you're interested in, and ignore extra information by letting the unneeded SAX events pass without handling them. See Resources for the XML Schema parser's full source code. Once we load the XML Schema information into memory, we can start generating source code for our new SAX parser. 接着,我们需要一个机制把XML Schema中的信息导入以上的class中,由于Schema也是XML文件,你可以使用SAX解析器完成导入。这里你需要处理你想要的条目部分,忽略额外的信息。参见XML Schema解析器的完整源码。 一旦我们把Schema信息导入内存模型中我们就可以我们的SAX解析器生成代码了 Source code templates To generate the SAX parser's source code, I use a text-based template engine, which lets me easily insert the memory model's information into source code templates. My favorite text-based template engine is Velocity from Apache's Jakarta project. You can easily change my source code templates to suit your needs; doing so requires a text editor for editing the templates and only a basic knowledge of Velocity's syntax. 源码模板 要生成SAX解析器的源码,我们使用一个文本的模板引擎,可以十分简单的把内存模型中的数据加入到源码模板中,我最喜爱的模板引擎是Apache Jakarta project的Velocity 你可以自行修改我的源码模板来满足你的需求,这需要使用Velocity的基本语法知识 My SAX parser source code templates generate a separate event handler, or Java class, for each complex XML entity. I define a complex entity as one that might contain other XML entities. Methods inside the complex entities' event handlers handle simple entities—that is, those entities that contain only text content and/or attributes. Because of the multiple class separation, you can more easily find the right place to insert custom source code. The separate event handlers also make code easier to maintain, should any bugs occur later. The first source code template is for the class that handles events for complex XML entities. It creates methods for each child entity as well as temporary storage for XML attributes: 我的SAX解析器源码生成模板为每个复杂的xml条目创建一个独立的事件处理机,或者说是一个java class,我所指的复杂XML条目是指还可能包含其他XML条目的XML数据项,每个class中的方法用来解析复杂XML中的只包含文本内容和属性的项,使用分离的事件处理可以方便的找到插入自定代码的位置,并使代码便于维护. Listing 3. Event handler template package ${package}; // JDK Classes import java.util.*; import java.io.*; // Xerces Classes import org.xml.sax.*; import org.apache.xerces.parsers.*; import org.xml.sax.helpers.DefaultHandler; public class ${element.Name}handler extends DefaultHandler { private CharArrayWriter text = new CharArrayWriter (); private Stack path; private Map params; private DefaultHandler parent; private SAXParser parser; public ${element.Name}handler(Stack path, Map params, Attributes attributes, SAXParser parser, DefaultHandler parent) throws SAXException { this.path = path; this.params = params; this.parent = parent; this.parser = parser; start(attributes); } ## Some code omitted public void endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName) throws SAXException { if (qName.equals("${element.Name}")) { end(); path.pop(); parser.setContentHandler (parent); } #foreach ($child in $element.Children) #if ($child.hasChildren()) #else if (qName.equals("${child.Name}")) end${child.Name} (); #end #end } The second class template is the entry point for the SAX parser and is responsible for initialization tasks and for calling the root element's handler: 第二个类模板是作为用来初始化任务并调用根元素(root element)的处理机初始点 Listing 4. The parser template ## Some code omitted public void startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, Attributes attributes) throws SAXException { if (qName.equals("${elements.RootElement.Name}")) { DefaultHandler handler = new ${elements.RootElement.Name}handler(path,params,attributes,parser,this); path.push ("${elements.RootElement.Name}"); parser.setContentHandler (handler); } } ## Some code omitted The controller class (控制类) Now we simply put everything together in a controller class (download the class's source code from Resources). A controller class handles the process's logic—see the MVC (Model-View-Controller) model. Called Generator, the controller class requires two command-line parameters. The first parameter indicates the XML Schema to use, and the second gives the output classes' package name. Generator then loads the XML Schema into memory and executes the source code templates. 现在我们来把所有的类放入控制类,控制类是MVC(Model-View-Controller)模式中的控制器 调用构建器,控制类需要两个命令参数,第一个是指明所使用的XML Schema,第二个是指定要产生类的包(package)名,这样构建器就可以把Schema的信息加载然后按模板生成代码 With the Generator class, you can easily create a SAX parser. To illustrate how to use the SAX generator, let's create a SAX parser for Listing 1's XML. I include that listing's XML Schema (example1.xsd) in Resources as well as the SAX generator's source and binary versions. Before you use the SAX generator's prepackaged binary version, read the readme.txt file for usage directions and required external jar libraries. Also, make sure you correctly set your $JAVA_HOME environment variable. Now you can use generate.bat (for Windows machines) or generate.sh (for Unix/Linux machines) to start the SAX generator. To create a SAX parser for example1.xsd, execute one of the following on the command line: 要看看如何使用构建器,让我们为列表1 的XML文件生成SAX解析器,example1.xsd是其XML Schema,在使用SAX构建器前,请阅读readme.txt 获取使用指南和它所需的其他类库说明,确认设置了$JAVA_HOME,然后使用generate.bat(window平台)或generate.sh(unix/linux平台),比如对example1.xsd生成它的解析器使用: For Windows: generate examples\example-1.xsd com.mycompany.package For Unix/Linux: ./generate.sh examples/example-1.xsd com.mycompany.package The first parameter indicates the XML Schema the program should use to build the SAX parser; the second parameter indicates the Java package name for the new classes. This process gives you a set of new classes that form the basis of a new SAX parser. They are located in your SAX generator's output/ subdirectory. Assuming you used example1.xsd, you will have classes called CompanyHandler, EmployeesHandler, EmployeeHandler, and NameHandler. 第一个是指明所使用的XML Schema,第二个是指定要产生类的包(package)名 该进程产生一系列新的classes组成一个新的SAX解析器,他们位于SAX构建器的 output/ 子目录,如果你使用的example1.xsd,那将会产生CompanyHandler, EmployeesHandler, EmployeeHandler,和 NameHandler 4个类 Use the generated SAX parser To use the generated SAX parser, you must create an entry-point class instance, named Parser by default, and call the parse() method. Listing 5 shows you how to initiate the SAX parser: 使用生成的SAX解析器 要使用生成的SAX解析器,你首先要创建初始点class的实例,默认名为Parser,然后调用它的parse()方法,见列表5 的例子 Listing 5. Initiate the SAX parser public static void main (String[] args) throws Exception { Parser parser = new Parser(); FileInputStream fis = new FileInputStream (args[0]); parser.parse (fis); } At this stage, the newly generated classes do nothing; we must write implementations for the relevant methods. We must write an implementation for the CompanyHandler class to print the company heading. Currently, the CompanyHandler class has only empty methods: The SAX parser calls this handler's start() method when it encounters the element; the end() method executes when the closing is parsed. The startEmployees() method executes when the parser enters an element. In this case, we want to print the company name when the element starts, so we must add code to the start() method. Note that the SAX generator has already declared local variables for the entity's attributes. After we add code to print the header line, the method looks like this: 目前,这些产生的新类还什么都不作,我们要实现(implementation)一些相关的方法,比如我们要为CompanyHandler class编写打印公司名的代码,现在CompanyHandler class只有一些空的方法:SAX解析器会在遇到元素的时候调用该class的start()方法,在遇到元素的时候调用该class的end()方法.遇到元素时调用startEmployees()方法 Listing 6. Print the company header public void start (Attributes attributes) throws SAXException { String Name = attributes.getValue("Name"); System.out.println ("Employee Listing For " Name); System.out.println (); } Before we can print the employee information, we first must handle the entity. Because this entity is complex, it has a separate handler class and needs some way to communicate name information back to the EmployeeHandler class. For this purpose, I created a global map object, called params, which allows you to pass information from one handler to the other. The Parser class automatically creates this map for you; to use the map, you simply need to add some information to it. Now we need to add the text data enclosed in the XML elements to the params map. To access an XML element's text content, use the getText() method. A utility method, getText() returns the text enclosed in an entity with leading and trailing white-space characters removed. The following code snippet adds the text from and to the params map: 在我们可以打印这些雇员信息之前,我要首先处理 项,由于该项是一个复杂项(下层还包含xml节点),它有自己的处理类NameHandler,这就需要把一些在该类处理的信息传回到EmployeeHandler类中,为了这个目的,我加了一个全局的map对象,叫params,它提供了个处理机class间传递信息的功能。Parser class自动的创建这个map,你只需要把一些信息加入到这个map中 现在我们需要在xml中的文本信息加入到这个名为params map对象中,要获取元素的文本信息,使用getText()方法(一个工具方法),见如下例子 (对的处理): Listing 7. Add information to global params public void endfirst () throws SAXException { params.put ("firstname",getText()); } public void endlast () throws SAXException { params.put ("lastname",getText()); } Now we can extract this information from the EmployeeHandler class. Because we must wait until the entire entity has been parsed (remember the tag has to be handled before we print the employee information), we must add the following code to the end() method: 现在,我们可以在EmployeeHandler中得到NameHandler处理的一些信息,在EmployeeHandler的end()方法中加入如下的代码,注意之所以在EmployeeHandler的end()方法中操作,是因为处理标签的工作是在处理之前的 Listing 8. The employee handler public void end () throws SAXException { String firstname = params.remove ("firstname").toString(); String lastname = params.remove ("lastname").toString(); System.out.println (office "\t " firstname "\t" lastname "\t" telephone); } If you need to do some processing, depending on the context in which an element occurs, you can use the Path object to find the current entity's context. Path is a stack containing the path of element names from the root element to the current element. Don't directly alter this stack, as doing so causes an invalid state for the SAX parser; you should use it as read-only information. Now you can recompile the classes and run the new program. You will see the same output as Listing 2's example, with much less work and more readable source code. 如果你需要作一些依赖所在元素的上下文关系的处理,你可以使用Path对象来得到当前项的上下文关系,Path是一个包含从root元素名到当前元素名的路径的堆栈(stack),请不要直接改动这个堆栈,这样会造成SAX解析器的错误状态,你应该把它作为一个只读的变量 Generate a simple parser Because a computer can discover an XML file's structure by parsing the XML Schema, a Generator class can go a long way to helping create a SAX parser's structure. In this article, you learned how to create an easy-to-use skeleton SAX parser with a SAX code generator, and also saw how to use this parser to parse XML files. The code generator saves you hours of SAX parser development time, and its structure provides you with more readable and maintainable source code. You have also received a set of source code templates that you can modify to meet your specific needs. With these templates, you can create any SAX structure. Of course, you can also apply these templates to problems other than SAX—I'd like to hear about any new and interesting ways in which you use the code generator. 下载本文示例代码


用SAX的代码生成器降低SAX解析器的繁琐用SAX的代码生成器降低SAX解析器的繁琐用SAX的代码生成器降低SAX解析器的繁琐用SAX的代码生成器降低SAX解析器的繁琐用SAX的代码生成器降低SAX解析器的繁琐用SAX的代码生成器降低SAX解析器的繁琐用SAX的代码生成器降低SAX解析器的繁琐用SAX的代码生成器降低SAX解析器的繁琐用SAX的代码生成器降低SAX解析器的繁琐用SAX的代码生成器降低SAX解析器的繁琐用SAX的代码生成器降低SAX解析器的繁琐用SAX的代码生成器降低SAX解析器的繁琐
阅读(239) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~