java – 句子中的单词cooccurence
我在一个文件中有一大组句子(10,000).该文件包含每个文件一个句子.在整个集合中,我想找出一个句子中出现的单词及其频率.
例句: "Proposal 201 has been accepted by the Chief today.","Proposal 214 and 221 are accepted,as per recent Chief decision","This proposal has been accepted by the Chief.","Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.","Proposal 214,ValueMania,has been accepted by the Chief."}; 我想编写以下输出.我应该能够提供三个起始单词作为程序参数:“Chief,accepted,Proposal” Chief accepted Proposal 5 Chief accepted Proposal has 3 Chief accepted Proposal has been 3 ... ... for all combinations. 我知道组合可能很大. 我在网上搜索但找不到.我写了一些代码,但无法理解它.也许知道域名的人可能知道. ReadFileLinesIntoArray rf = new ReadFileLinesIntoArray(); try { String[] tmp = rf.readFromFile("c:/scripts/SelectedSentences.txt"); for (String t : tmp){ String[] keys = t.split(" "); String[] uniqueKeys; int count = 0; System.out.println(t); uniqueKeys = getUniqueKeys(keys); for(String key: uniqueKeys) { if(null == key) { break; } for(String s : keys) { if(key.equals(s)) { count++; } } System.out.println("Count of ["+key+"] is : "+count); count=0; } } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } private static String[] getUniqueKeys(String[] keys) { String[] uniqueKeys = new String[keys.length]; uniqueKeys[0] = keys[0]; int uniqueKeyIndex = 1; boolean keyAlreadyExists = false; for (int i = 1; i < keys.length; i++) { for (int j = 0; j <= uniqueKeyIndex; j++) { if (keys[i].equals(uniqueKeys[j])) { keyAlreadyExists = true; } } if (!keyAlreadyExists) { uniqueKeys[uniqueKeyIndex] = keys[i]; uniqueKeyIndex++; } keyAlreadyExists = false; } return uniqueKeys; } 有人可以帮忙编码吗? 解决方法
您可以应用标准信息检索数据结构,尤其是倒排索引.这是你如何做到的.
考虑你的原始句子.使用一些整数标识符为它们编号,如下所示:
对于您在句子中遇到的每对单词,将其添加到倒置索引,该索引将该对映射到句子标识符的集合(一组唯一项).对于长度为N的句子,有N-choose-2对. 适当的Java数据结构将是Map< String,Map< String,Set< Integer>>.按字母顺序排列对,以便“有”和“建议”对仅出现(“有”,“建议”)而不出现(“建议”,“有”). 此地图将包含以下内容: "has","Proposal" --> Set(1,5) "accepted",2,"has" --> Set(1,3,5) etc. 例如,单词对“has”和“Proposal”具有一组(1,5),意味着它们在句子1和5中找到. 现在假设您要查找“已接受”,“有”和“提案”列表中单词的共现次数.生成此列表中的所有对并与其各自的列表相交(使用Java的Set.retainAll()).这里的结果将最终设置为(1,5).它的大小为2,意味着有两个句子包含“已接受”,“有”和“提案”. 要生成所有对,只需根据需要迭代地图.要生成大小为N的所有单词元组,您需要根据需要迭代并使用递归. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |