Chinaunix首页 | 论坛 | 博客
  • 博客访问: 181349
  • 博文数量: 12
  • 博客积分: 3000
  • 博客等级: 中校
  • 技术积分: 240
  • 用 户 组: 普通用户
  • 注册时间: 2007-08-04 15:43
文章分类

全部博文(12)

文章存档

2011年(1)

2009年(1)

2008年(10)

我的朋友

分类: 数据库开发技术

2009-08-31 20:57:59

Project Name Research on Web Data Mining

Organizer & Number: National Huaqiao University (China) , 07HZR27 

My task XML Data Model Extraction

Introduction of Project: This project is organized by National Huaqiao University , which is directed by my teacher, Miss Lei ,who taught me the course of Introduction to XML. This program covers research on application, theory and some supporting approaches of Web Data Mining. I was invited to program team because of a excellent Curriculum design about getting DTD from XML Document and all processes are automatic.

Publication : A New Binary-tree-based Algorithm for XML Data Model Extraction  基于标记二叉树的XML数据模式提取算法 (雷庆,熊汉琛)(Computer Engineering and Design 2009.8)

Description of my task: XML Data Model Extraction Since the increasing use of XML for data storage on Web, XML data model extraction is important to this program. My work is to create a good algorithm to getting XML data model from any XML document, and if this problem is solved well,it is will be very useful for data clustering and data classification. So this task play a critical role in the program. At that time, there are some famous algorithm , such as “Learning Document Type Descriptors from XML Document Collections ”(Minos Garofalakis )and “ Exploiting local similarity for efficient indexing of paths in graph structured data” published by Kaushik R at 18th Int’l Conference. on Data Engineering (ICDE). However, the former one is based on regular expression ,and the system will be under a heavy burden。The later one translate NFA to DFA for XML Data reduction based on DataGuide theory .Both of them are too complex ,and there is not enough time for finishing them.

Completion of my task: At the end of project, I successed in putting forward a new algorithm based on tag tree for XML data mode .The test result proved that the algorithm can get a tree structure efficiently from any XML document, and generate a DTD document at the same time, even though the step relationship of elements in xml document is very complex. Miss Lei is very satisfied with my work, and help me finished a thesis “A New Binary-tree-based Algorithm for XML Data Model Extraction”

Detailed report: 1. General idea: When get data model from any XML document ,there are following three main step: 1) Recognize all the element tags: my solution is to get all element tags (including begin tags and end tags) by using a simple DFA , and then put them into a list, nodes of which represent all tags; 2) Analyse the step relationship between tags :numbering list of tags in order,and then analyze step relationship between all tags according to matching number of each node; 3) Get the simplest document-tag-tree: Based on the relationship got during last step, we can make use of binary tree to present this relationship of all tags and get a document-tag-tree of a XML document ,and then get the simplest one by deleting some redundant element tags. 2. Main Algorithm: 2.1 Analyzing step relationship of all tags The fig1 is a sample of XML document, and we will put all elements into a list, and each node of list represent a element tags .At last we will get a document-tags-tree by analysis of step relationship of tags:                            fig1. record1.xml

At first, we should get all element tags of XML document and create a list of tags, and then add attributes of “Num” and “Match” into each node of list by numbering nodes of list and matching; fig 2 is the result of above process

                   fig 2 tags list of record1.xml

1) Identifying root element:we get get a conclusion easily by observe fig 2 that the root element of document has the biggist of attribute of “Mach”. Therefore , we can find that the “ ” is the root element of record1.xml. 2) Getting the the step relationship of no-root elements:to all no-root elements, we can use get step relationship of them by using following way: considering a node A in list, we can traverse all nodes of list,and search for a node B whose “Num” is the biggest among all node whose “Mach” is bigger than A.Match,and A is nested in B。We can find that is nested in, andis nested i.Code can be simply written following to finish above this step: B=head ; MaxNum=-1; while (B) { if ((B->MatchNum>A->MatchNum)&&(tB->Num>MaxNum)} MaxNum=B->Num; B=B->next; } We can get the relationship of record1.xml by analyzing “Num” and “ Match” of each begin tags according to way said above. Fig 3 demonstrate the process of analysis and we get a document-tags-tree of record1.xml as Fig 4 show

f ig 3 :analysis of step relationship of begin tags in record1.xml

fig 4 document-tags-tree record1.xml

2.2 Getting the simplest document-tags-tree Since there is no repeat tags and optional tags in record1.xml,so the fig 4 is the simplest document-tags-tree of record1.xml. But repeat tags and optional tags is allowed in XML document, is that case, we can not get the simplest document-tags-tree of a XML document with repeat tags and optional tags in above way. At last, I solve this problem as following discussion:

fig 5 record2.xml

fig 6 record2.

record2.xml ( fig 5) is a sample XML document with repeat tags and optional tags, and we can get document-tags-tree (but not the simplest one) of record2.xm l ( fig 6).We can compare each node with their right-child node ,if they are different, there are not repeat tags in document-tags-tree, and the final document-tags-tree is the simplest one. If they are same, we should discuss more cases, : 1) consider node A and its right-child node A’(they are same), if the tree with A as root and the tree with A’ as root is also completely same, we can judge A and A’ are repeat tags, so we can delete A’ to delete redundancy. 2) consider node B and its right-child node B’(they are same), if there are some differences between the tree with B as root and the tree with B’ as root. The document-tags-tree of record2.xml ( fig 6) is just a example, they have different left-child node “name” and “ID”. In this case , we should compare all child nodes, and find all nodes nested in B’ but not in B and add them as child nodes of B at reasonable position before deleting B’. We can success in deleting redundancy of document-tags-tree of record2.xm l in the way discussed above (( fig 7)

fig 7 the simplest document-tags-tree of record2.xm l

3.Test Result:

fig 9 record3.xml

fig 10 test result DTD of record3.xml

test result of XML document with repeat tags and optional tags

fig 11 record4.xml

                          fig 12 test result DTD of record4.xml
阅读(692) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~