Chinaunix首页 | 论坛 | 博客
  • 博客访问: 481300
  • 博文数量: 115
  • 博客积分: 3777
  • 博客等级: 中校
  • 技术积分: 1070
  • 用 户 组: 普通用户
  • 注册时间: 2009-11-07 09:20
文章分类

全部博文(115)

文章存档

2015年(1)

2013年(3)

2012年(26)

2011年(30)

2010年(34)

2009年(21)

我的朋友

分类: LINUX

2012-03-19 08:16:11

Queequeg, A Tiny English Grammar Checker
$Id: index-e.html,v 1.2 2003/08/07 00:04:40 euske Exp $

| |

What's It?

Queequeg is a tiny English grammar checker for non-native speakers who are not used to verb conjugation and number agreement. We especially focus on people who're writing academic papers or business documents where thorough checking is required. We aim to reduce this laborious work with automated checking. Queequeg is named after a character in Herman Melville's masterpiece.

Sample Run Suppose you wrote the following sentences:
Paraphrases plays an important role in the variety and complexity
of natural language documents. However, they add to the difficulty
of natural language processing. Here we describe a procedure for
obtaining paraphrases from news articles. Articles derived from
different newspapers can contain paraphrases if it indeed report
the same event on the same day. We exploit these two feature by
using Named Entity recognition. Our approach is based on the
assumption that named entities are preserved across
paraphrases. We applied our method to articles of two domains and
obtained notable example.

Queequeg (command name: qq) prints the following results for the above document:

$ qq -Wall sample.txt
-- sample.txt
sample.txt:0: (S:Paraphrases) (V:plays) an important ... (number disagreement between "paraphrases" and "plays")
sample.txt:0: ... variety and (complexity) of natural ...
sample.txt:2: ... difficulty of (natural language) processing .
sample.txt:4: ... paraphrases if (S:it) indeed (V:report) the same ... (number disagreement between "it" and "report")
sample.txt:5: We exploit (Det:these two) (N:feature) by using ... ("feature" should be in plural form)
sample.txt:5: ... by using (Named Entity recognition) .
sample.txt:8: ... and obtained (notable example) . (an article needed, or should be plural)

Different types of errors are shown in different colors. A number displayed at the beginning of each line is the line number in a file.

Currently Queequeg recognizes the following document formats: plain text, LaTeX and HTML.

Download and Install Prerequisite The program is written in Python. You need version 2.3 or newer so that it works fine. WordNet is a free online thesaurus developed by Prof. George Miller.
(Queequeg uses only dictionary files included in WordNet distribution package, so you don't need to install the binaries.) (optional) Used for dictionary access. Queequeg works faster with this library. Download of the Program

Download the archive file in the follwing page. (about 60kbytes)

  • (the newest version: 0.91)
Installation
  1. Extract the archive file in an appropreate directory (e.g. /usr/local/queequeg-0.9).
  2. Extract package somewhere.
  3. Build a system dictionary. Type at the top directory of Queequeg:
    $ make dict WORDNET=/src/wordnet/dict
    where the environment variable WORDNET should be the pathname of the dict/ directory in WordNet package.
    (Note: If you're using a Debian package, the dictionaries are put in /usr/share/wordnet.)
    If Python-cdb module is installed, a CDB type dictionary file dict.cdb is generated. Otherwise dict.txt is generated.
  4. The main program is qq. Have your shell look into this path. You may create a symbolic link in some directory like /usr/local/bin to qq. (It tries to find a dictionary file located at the same directory.)
How to Use

Just feed Queequeg a file you want to check (command name: qq). It recognizes the document formats automatically based on its extention (.tex, .html or .htm).

Queequeg issues warnings based on the follwing types of grammatical errors:

  • GREEN ... Number disagreement between a noun group and its determiner. (e.g. "three desk", "a cups")
  • YELLOW ... Number disagreement between a subject and a verb. (e.g. "he drink a coffee.", "I wrote a book which make me rich.")
  • RED ... An arcitle is required. (e.g. "this is pen.")
    (Note: Since this checking is rather fallacy and verbose, it is disabled by default. Give -Wall option to enable this feature.)

Also qq accepts the following command line options:

OptionFeature
-v Verbose mode. It displays the name of errors.
-q Quiet mode. It doesn't display file names.
-p Force it to recognize all files as plain text format. Each paragraph is separated with an empty line in plain text format.
-l Force it to recognize all files as HTML format.
-t Force it to recognize all files as LaTeX format.
-s pathname Specify the pathname of a system dictionary (dict.txt or dict.cdb). By default, it tries to find a dictionary file located at the same directory.

The following options are for debugging purpose:

f
OptionFeature
-D debuglevel Specify the debuglevel as integer.
-S stage Specify the stage to which the process is performed. The default is grammar (to check grammatical error). Acceptable values are token (tokenize input files), sentence (split sentences), pos0 (pos tagging phase 1), or pos1 (pos tagging phase 2).
-W type1,type2,... Specify which type of errors should be checked. Acceptable values are sv1 (a subject and a verb placed across a prepotitional phrase), sv2 (a subject and a verb placed adjacently), sv3 (a subject and a verb in "there-be" type syntax), or det (determiner requirement), plural (numbers of nouns). Values should be separated with comma. The default is sv1,sv2,sv3,plural. Value all is also accepted for specifying every type of errors.
Why I get unreasonable results?

The current version of Queequeg reports lots of false positives which should not be reported generally.

For example, a sentence "my paper clip" looks like consisting of a noun phrase. But actually an error is reported since this can also be regarded as "my paper clip[s]", where the last "s" is missing. Also, a noun phrase "three additional links" also generates a number disagreement warning though, this is because a singular noun called "links" is contained in a system dictionary file.

Determiner checking tends to generate more false positives, because Queequeg don't know if a target noun is mass noun or not. Normally, material names such as "meat" or "water", or abstract nouns such as "information" need not take any article. However WordNet doesn't have this kind of information. (Some dictionaries like COMLEX do have this, but I didn't use them because they cannot be freely distributed.)

Bugs and TODOs
  • Improve accuracy. Paranoia mode and normal mode should be separated.
  • Must support user dictionaries.
  • Warning for unknown words.
  • Change warning colors on terminal.
  • Ispell-like interface on Emacsen. Support Tkinter too.
  • Port to Windows.
  • Support collocation.
  • setup.py.
  • Source code comments.
  • Make it faster.

Queequeg identifies grammatical errors with pattern recognition based on simple finite automata (i.e. regexps) and unification of features assigned on each portion of an expression. It doesn't parse a sentence to earn speed and coverage. The core part of checking is done in constraint.py and unification.py.

POS tagging is performed in two phases. First it looks up dictionaries and obtains multiple candidates for each word (sentence.py, dictionary.py), then tries to fix several tags using regexp based pattern matching (postagfix.py).

We used a modified version of Penn Treebank tagset. Plural form of pronouns (PRPS) and determiners (DTS) are extended so that Queequeg identifies the number of a noun group by looking the POS tag assigned to each noun.

Unlike other natural language systems, Queequeg cannot assume a given sentence is grammatical. It decreases the accuracy of POS tagging.

Queequeg comes with ABSOLUTELY NO WARRANTY. This software is distributed under the GNU General Public License.

Author

We need more testers! Feel free to send us any comments or bug reports.

阅读(5007) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~