python – 解析penn语法树以提取其语法规则
发布时间:2020-12-20 12:01:08 所属栏目:Python 来源:网络整理
导读:我有一个PENN-Syntax-Tree,我想以递归方式获取该树包含的所有规则. (ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk)))))) 我的目标是获得如下的语法规则: ROOT -- SS -- NP VPNP -- NN... 正如我所说,我需要递
我有一个PENN-Syntax-Tree,我想以递归方式获取该树包含的所有规则.
(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk)))) ) ) 我的目标是获得如下的语法规则: ROOT --> S S --> NP VP NP --> NN ... 正如我所说,我需要递归地执行此操作,而无需NLTK包或任何其他模块或正则表达式.这是我到目前为止所拥有的.参数树是在每个空间上分割的Penn-Tree. def extract_rules(tree): tree = tree[1:-1] print("nn") if len(tree) == 0: return root_node = tree[0] print("Current Root: "+root_node) remaining_tree = tree[1:] right_side = [] temp_tree = list(remaining_tree) print("remaining_tree: ",remaining_tree) symbol = remaining_tree.pop(0) print("Symbol: "+symbol) if symbol not in ["(",")"]: print("CASE: No Brackets") print("Rule: "+root_node+" --> "+str(symbol)) right_side.append(symbol) elif symbol == "(": print("CASE: Opening Bracket") print("Temp Tree: ",temp_tree) cursubtree_end = bracket_depth(temp_tree) print("Subtree ends at position "+str(cursubtree_end)+" and Element is "+temp_tree[cursubtree_end]) cursubtree_start = temp_tree.index(symbol) cursubtree = temp_tree[cursubtree_start:cursubtree_end+1] print("Subtree: ",cursubtree) rnode = extract_rules(cursubtree) if rnode: right_side.append(rnode) print("Rule: "+root_node+" --> "+str(rnode)) print(right_side) return root_node def bracket_depth(tree): counter = 0 position = 0 subtree = [] for i,char in enumerate(tree): if char == "(": counter = counter + 1 if char == ")": counter = counter - 1 if counter == 0 and i != 0: counter = i position = i break subtree = tree[0:position+1] return position 目前它适用于S的第一个子树,但所有其他子树都不会被递归解析.很高兴得到任何帮助.. 解决方法
我倾向于保持尽可能简单,而不是试图重新发明你目前不允许使用的解析模块.就像是:
string = ''' (ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP (DT a) (NN talk))) ) ) ''' def is_symbol_char(character): ''' Predicate to test if a character is valid for use in a symbol,extend as needed. ''' return character.isalpha() or character in '-=$!?.' def tokenize(characters): ''' Process characters into a nested structure. The original string '(DT the)' is passed in as ['(','D','T',' ','t','h','e',')'] ''' tokens = [] while characters: character = characters.pop(0) if character.isspace(): pass # nothing to do,ignore it elif character == '(': # signals start of recursive analysis (push) characters,result = tokenize(characters) tokens.append(result) elif character == ')': # signals end of recursive analysis (pop) break elif is_symbol_char(character): # if it looks like a symbol,collect all # subsequents symbol characters symbol = '' while is_symbol_char(character): symbol += character character = characters.pop(0) # push unused non-symbol character back onto characters characters.insert(0,character) tokens.append(symbol) # Return whatever tokens we collected and any characters left over return characters,tokens def extract_rules(tokens): ''' Recursively walk tokenized data extracting rules. ''' head,*tail = tokens print(head,'-->',*[x[0] if isinstance(x,list) else x for x in tail]) for token in tail: # recurse if isinstance(token,list): extract_rules(token) characters,tokens = tokenize(list(string)) # After a successful tokenization,all the characters should be consumed assert not characters,"Didn't consume all the input!" print('Tokens:',tokens[0],'Rules:',sep='nn',end='nn') extract_rules(tokens[0]) OUTPUT Tokens: ['ROOT',['S',['NP',['NN','Carnac'],['DT','the'],'Magnificent']],['VP',['VBD','gave'],'a'],'talk']]]]] Rules: ROOT --> S S --> NP VP NP --> NN DT NN NN --> Carnac DT --> the NN --> Magnificent VP --> VBD NP VBD --> gave NP --> DT NN DT --> a NN --> talk 注意 我更改了原始树作为此子句: (NP ((DT a) (NN talk))) 似乎不正确,因为它在网络上可用的语法树grapher上生成一个空节点,所以我简化为: (NP (DT a) (NN talk)) 根据需要调整. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |