今天看regular expression,在文档中看到一个类似于解释编译器的小东西,十分elegant。
又把python的快速方便赞叹一次。Python can do anything!
- import collections
- import re
- Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
- def tokenize(s):
- keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
- token_specification = [
- ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
- ('ASSIGN', r':='), # Assignment operator
- ('END', r';'), # Statement terminator
- ('ID', r'[A-Za-z]+'), # Identifiers
- ('OP', r'[+*\/\-]'), # Arithmetic operators
- ('NEWLINE', r'\n'), # Line endings
- ('SKIP', r'[ \t]'), # Skip over spaces and tabs
- ]
- tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification) # 得到re
- print(tok_regex)
- get_token = re.compile(tok_regex).match # 得到match对象
- line = 1
- pos = line_start = 0
- mo = get_token(s)
- while mo is not None:
- typ = mo.lastgroup
- # print(mo.groups(), typ) # 打印匹配结果,方便调试查看
- if typ == 'NEWLINE': # 换行,pos在换行符的位置上,line_start 实际是上一行最后一个字符的位置
- line_start = pos
- line += 1
- elif typ != 'SKIP':
- val = mo.group(typ)
- if typ == 'ID' and val in keywords:
- typ = val # typ 换成语言关键词
- yield Token(typ, val, line, mo.start()-line_start) # 每次返回打印,然后接着执行
- pos = mo.end()
- mo = get_token(s, pos)
- if pos != len(s): #中间有不符合语法的字符,比如'&',就会报错
- raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line)) # %r - repr(), %s - str()
- statements = '''
- IF quantity THEN
- total := total + price * quantity;
- tax := price * 0.05;
- ENDIF;
- '''
- for token in tokenize(statements): # 这里返回generator,所以需要迭代
- print(token)
通过这段代码,我们可以结合语境做语法分析,找错误。甚至更多。
阅读(1391) | 评论(0) | 转发(0) |