Chinaunix首页 | 论坛 | 博客
  • 博客访问: 2468634
  • 博文数量: 392
  • 博客积分: 7040
  • 博客等级: 少将
  • 技术积分: 4138
  • 用 户 组: 普通用户
  • 注册时间: 2009-06-17 13:03
个人简介

范德萨发而为

文章分类

全部博文(392)

文章存档

2017年(5)

2016年(19)

2015年(34)

2014年(14)

2013年(47)

2012年(40)

2011年(51)

2010年(137)

2009年(45)

分类: 大数据

2015-06-24 09:47:51

http://baojie.org/blog/2014/06/16/nlp-parser/

特性总表

 
Features Satisfied by Note
Web-scale parsing: for both training and parsing time, should be able to handle TB or higher text volume efficiently Link, MiniPar, Malt, DeSR, MST, pfp, MBSP Linear-time parsing is generally possible with dependency parsing; also parallelism support is important
Potentially support both statistical and knowledge-based parsing Link, NLTK, Malt, DepParse, MBSP
High accuracy Stanford, Collins and Bikel, Berkeley, Charniak-Johnson, RASP, Malt, Link, DeSR, MST, pfp, Senna
Active development Stanford, Berkeley, Link, NLTK, Malt, DeSR, pfp, MBSP, OpenNLP, Senna
Production-friendly license Link, NLTK, RASP, Malt, DepParse, OpenNLP Some others with GPL can be used in production as a web service without opening source other parts
Good documentation Stanford, Link, NLTK, Malt, DeSR, MBSP, OpenNLP
Code Reusability: easy-to-use API or easy-to-understand code Stanford, Link, NLTK, MiniPar, DeSR, DepParse, pfp, MBSP, Senna

 

详细比较

这张表比较宽,点击开头的print或pdf按钮可见全表

Parser Internationalization Feature Summary Links Active Project
Stanford Parser


  • Constituency and dependency
  • Java, with Python and Ruby interfaces
  • GPL license
  • By Chris Manning et al
English, Chinese, German, Arabic, Italian, Bulgarian, and Portuguese
  • Part of 
  • It is a package of three kinds of parsers: a PCFG (probabilistic context-free grammar) parser, a lexicalized dependency parser, and a lexicalized PCFG parser
  • Parsing accuracy ranks consistently high in surveys
  • Good documentation
  • The PCFG parser is based CKY algorithm
  • However, the dependency parser is an with O(n^4) complexity. It is much worse than other linear time O(n) dependency parsers
  • Homepage 
  • Download 
  • Online test 
  • Javadoc 
Yes (frequent releases)
Collins and Bikel Parser


  • Constituency parser
  • Java
  • Free for research
  • By Dan Bikel (UPenn) and(Columbia)
English, Chinese, Arabic
  • It is an improvement of Collins parser
  • Based on CYK algorithm ()
  • Lexicalized PCFG
  • state-of-the-art performance for English
  • Homepage: 
  • Download 
  • Javadoc 
No (since 2008)
Berkeley parser


  • Constituency parser
  • Java
  • GPL
  • Slav Petrov and Dan Klein
English, Bulgarian, Arabic, Chinese, French, German
  • based on a hierarchical coarse-to-fine parsing, where a sequence ofgrammars is considered
  • no need for language-specific adaptations, Automatically induced PCFG
  • state-of-the-art performance for English on the Penn Treebank
  • Project homepage 
  • Online test 
(infrequent changes)
Charniak-Johnson Parser


  • Constituency parser
  • C
  • Eugene Charniak (Brown Univ) and Mark Johnson
English
  • Based on discriminative reranking, dynamic programming
  • Lexicalized N-Best PCFG : for each sentence, constructing sets of 50-best parses based on a heuristic coarse-to-fine generative parser
  • estimate the reranker feature weights using MaxEnt, Averaged Perceptron, etc
  • State of the art performance on English
  • Current C-J parser (2011):
  • Original (2005) Charniak parser 
Yes (infrequent changes)
Link Grammar Parser


  • Dependency parser
  • C, Bindings from Ruby, Python, perl, Java and Ocaml
  • BSD license
  • Davy Temperley, John Lafferty and Daniel Sleator (CMU)
  • Dom Lachowicz, Linas Vepstas (AbiWord)
Persian, Arabic, Chinese, German, Russian
  • Based on lexicons of link grammar (similar to IBM Watson’s English slot grammar parser). Its has 70k+ words
  • Produce both dependencies (labelled links connecting pairs of words) and constituents (Penn tree-bank style phrase tree)
  • Performance is comparable to the Stanford PCFG parsing model, and is 3+ times faster than the Stanford lexicalized model.
  • 10+ extensions, including FrameNet-style framing, reference (anaphora) resolution and natural language generation
  • However, it is grammar-rigid, may fail when the sentence is grammatically incomplete or incompliant
  • Very good documentation
  • Original CMU page: 
  • Project page:  part of
  • Online test: 
  • SVN: 
  • API: 
  • Documentation: 
Yes (frequent releases)
NLTK Parser


  • Constituency and dependency
  • Python
  • Apache License
  • Steven Bird
English, German, Chinese, Japanese
  • Very good documentation, various books available. Widely adopted in education and web application development
  • Very easy to use, clean API interface
  • Part of whole set of NLP tools covering major NLP needs
  • Constituency parser with PCFG
  • Dependency parser using shift-reduce algorithm, based CFG
  • However, its parser implementation is less optimized
  • Project homepage: 
  • Source code: 
  • Book: 
  • Book: 
Yes (very active)
MiniPar


  • Dependency parser
  • C and Lisp, with Java binding in GATE
  • free of charge for non-commercial use
  • Dekang Lin
English
  • One of the early dependency parser
  • After 15+ years, is slightly worse than state-of-the-art parsers
  • Code is small and easy to extend
  • Its dependency maybe useful in designing a new parser
  • Homepage and download
No (since 1994)
RASP


  • C and Common Lisp
  • Constituency and dependency
  • LGPL
  • John Carroll et al (Sussex and Cambridge)
English
  • RASP = Robust Accurate Statistical Parsing
  • fully domain-independent automated training
  • integration of statistical techniques and incremental grammar rule induction
  • state-of-the-art performance
  • Homepage:
  • Download: 
Yes (infrequent releases)
MaltParser


  • Dependency parser
  • Java, with Python binding in NLTK
  • Johan Hall, Jens Nilsson and Joakim Nivre
English, French, Swedish
  • Shift-reduce algorithm (automaton-based)
  • Inductive dependency parsing that learns from a treebank
  • Very fast: linear time parsing
  • State-of-the-art performance on accuracy
  • Project home 
  • Javadoc 
Yes ()
DeSR


  • Dependency parser
  • C++ wth Python binding
  • GPL
  • Giuseppe Attardi
Italian, English, French, and 10+ others
  • Part of the 
  • shift-reduce dependency parser, can handle non-projective dependencies
  • deterministically parsing, very fast (linear time)
  • fully labeled dependency trees
  • training with Multi Layer Perceptron, Averaged Perceptron, Maximum Entropy, SVM, memory-based learning using TiMBL
  •  on English labeled dependency parsing
  • Project homepage 
  • Code 
  • SVN: 
  • API: 
  • Online test: 
Yes ()
MSTParser


  • Dependency parser
  • Java
  • Jason Baldrige and Ryan McDonald (UPenn)
English, Chinese and 10+ other languages
  • MST = Maximum-Spanning Tree, based on graph algorithm
  • Support online learning
  • State-of-the-art performance, comparable to MaltParser
  • outperform MaltParser on longer dependencies, but typically slower
  • Project homepage
  • SVN 
No (since 2007)
DepParse


  • Dependency parser
  • Python
  • MIT Lincense
  • Leif Johnson (UT Austin)
English
  • maximum spanning tree (MST) parser and a stack-based, shift-reduce parser
  • support data parallelism on multicore machines
  • performance has not been evaluated
  • Self-contained, easy to extend
  • Project homepage 
  • Source 
No ()
pfp


  • Constituency parser
  • C++ and Python
  • GPL
  • Erik Frey, Norman Casagrande et al (Wavii Inc)
English
  • pfp — pretty fast statistical parser
  • Using PCFG grammar and CYK algorithm
  • 3-4x faster than the Stanford parser, and uses 5-8x less resident memory
  • Thread-safe/multi-core support
  • Homepage 
Yes 
MBSP


  • Shallow (dependency) parsing
  • Python
  • GPL and Commercial
English
  • Memory-Based Shallow Parser, based on the TiMBL and MBT memory-based learning applications
  • No need for manual pattern or grammar definition
  • Client-server architecture
  • Do shallow parsing,
  • Share an API with Pattern
  • Can be used together with DeSR and NLTK
  • Homepage 
Yes
OpenNLP Parser


  • Constituency parser
  • Java
  • Apache License (An Apache project)
English
  • A chunking parser (relatively simple)
  • Can be used with UIMA
  • Project homepage 
  • Source SVN 
Yes
Senna


  • Constituency parser
  • C
  • a non-commercial license
English
  • Using deep-learning
  • Very small code (3500 lines)
  • syntactic parsing
  • State-of-the-art performance
  • Pro
阅读(2670) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~