tpop(3): Design and Implementation
The design of the data structures is the central decision i nthe creation of a program. Once the data structures are laid out, the algorithms tend to fall into place, and the coding is comparatively easy.
The choice of programming language is relatively unimportant to the overall design. Porgram design can certainly be colored by a language but is not usually dominated by it.
The problem we have chosen is unsual, but in basic form it is typcial of many programs: some data comes in, some data goes out, and the processing depends on a little ingenuity.
Specifically, we're going to generate random English text that reads well. If we emit random letters or random words, the result will be nonsense. For example, a program that randomly selects letters might produce this:
xptmxgn xusaja afqnzgxl lhidlwcd rjdjuvpydrlwnjy
which is not vey convincing. If we weight the letters by their frequency of appearance in English text, we might get this:
idtefoae tcs trder jcii ofdslnqetacp t ola
which isn't a great deal better. Words chosen from the dictionary at random don't make much more sense:
polydactyl equatorial splashily jowl verandah circumscribe
For better results, we need a statistical model with more structure. such as the frequency of appearance of whole phrases. But where can we find such statistics?
We could grab a large body of English and study it in detail, but there is an easier and more entertaining approach. The key observation is that we can use any existing text to construct a statistical model of the language as used in that text, and from that generate random text that has similar statistics to the original.
3.1 The Markov Chain Algorithm
An elegant way to do this sort of processing is a technique challed a Markov chain algorithm. If we imagine the input as a sequence of overlapping phrases, the algorithm divides each phrase into two parts, a multi-word prefix and a single suffix word that follows the prefix. A Markov chain algorithm emits output phrases by randomly choosing the suffix that follows the prefix, according to the statistics of (in our case) the original text. Three-word phrases work well---a two-word prefix is used to select the suffix word:
set w1 and w2 to the first two words in the text
print wi and w2
loop:
randomly choose w3, one of the successors of prefix w1w2 in the text
print w3
replace w1 and w2 by w2 and w3
repeat loop
Our program will read a piece of English text and use a Markov chain algorithm to generate new text based on the frequency of appearance of phrases of a fixed length. The number of words in the prefix is a parameter.
3.2 Data Structure Alternatives
How much input do we intend to deal with? How fast must the program run? It seams reasonable to ask our program to read in a whole book, so we should be prepared for input sizes of n=100,000 words or more. The output will be hundreds or perhaps thousands of words, and the program should run in a few seconds instead of minutes. With 100,000 words of input text, n is fairly large so the algorithms can't be too simplistic if we want the program to be fast.
The Markov algorithm must see all the input before it can begin to generate output, so it must store the entire input in some form. One possibility is to read the whole input and store it in a long string, bu we clearly want the input broken down into words. If we store it as an array of pointers to words, output generation is simple: to produce each word, scan the input text to see what possible suffix words follow the prefix that was just emitted, and then choose one at random. However, that means scanning all 100,000 input words for each word we generate; 1,000 words of output means hundreds of millions of string comparisons, which will not be fast.
Another possibility is to store only unique input words, together with a list of where they appear in the input so that we can locate successor words more quickly. We could use a hash table like the one in Chapter 2, but that version doesn't directly address the needs of the Markov algorithm, which must quickly locate all the suffixes of a given prefix.
We need a data structure that better represents a prefix and its associated suffixes. The program will have two passes, an input pass that builds the data structure representing the phrases, and an output passthat uses the data structure to generate the random output. In both passes, we need to look up a prefix (qiuckly): in the input pass to update its suffixes, and in the output pass to select at random from the possible suffixes. This suggests a hash table whose keys are prefixes and whose values are the sets of suffixes for the corresponding prefixes.
For purposes of description, we'll assume a two-word prefix, so each output word is based on the pair of words that precede it. The number of words in the prefix doesn't affect the design and the programs should handle any prefix length, but selecting a number makes the discussion concrete. The prefix and the set of all its possible suffixes we'll call a state, which is standard terminology for Markov algorithms.
Given a prefix, we need to store all the suffixes that follow it so we can access them later. The suffixes are unordered and added one at a time. We don't know how many there will be, so we need a data structure that grows easily and efficiently, such as a a list or a dynamic array. When we are generating output, we need to be able to choose one suffix at random from the set of suffixes associated with a particular prefix. Items are never deleted.
In summary, each state comprises a prefix and a list of suffixes. This information is stored in a hash table, with prefix as key. Each prefix is a fixed-size set of words. If a suffix occurs more than once for a given prefix, each occurrence will be included separately in the list.
阅读(451) | 评论(0) | 转发(0) |