加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 大数据 > 正文

FP-Tree思想与实现

发布时间:2020-12-14 03:54:20 所属栏目:大数据 来源:网络整理
导读:在关联规则挖掘领域最经典的算法法是Apriori,其致命的缺点是需要多次扫描事务数据库。于是人们提出了各种裁剪(prune)数据集的方法以减少I/O开支,韩嘉炜老师的FP-Tree算法就是其中非常高效的一种。 支持度和置信度 严格地说Apriori和FP-Tree都是寻找频繁

在关联规则挖掘领域最经典的算法法是Apriori,其致命的缺点是需要多次扫描事务数据库。于是人们提出了各种裁剪(prune)数据集的方法以减少I/O开支,韩嘉炜老师的FP-Tree算法就是其中非常高效的一种。

支持度和置信度

严格地说Apriori和FP-Tree都是寻找频繁项集的算法,频繁项集就是所谓的“支持度”比较高的项集,下面解释一下支持度和置信度的概念。

设事务数据库为:

复制代码

A  E  F  G

A  F  G

A  B  E  F  G

E  F  G

复制代码

则{A,F,G}的支持度数为3,支持度为3/4。

{F,G}的支持度数为4,支持度为4/4。

{A}的支持度数为3,支持度为3/4。

{A}=>{F,G}的置信度为:{A,G}的支持度数 除以?{A}的支持度数,即3/3

强关联规则挖掘是在满足一定支持度的情况下寻找置信度达到阈值的所有模式。

FP-Tree算法

我们举个例子来详细讲解FP-Tree算法的完整实现。

事务数据库如下,一行表示一条购物记录:

复制代码

牛奶,鸡蛋,面包,薯片

鸡蛋,爆米花,薯片,啤酒

鸡蛋,面包,薯片

牛奶,鸡蛋,面包,爆米花,薯片,啤酒

牛奶,面包,啤酒

鸡蛋,面包,啤酒

牛奶,面包,薯片

牛奶,鸡蛋,面包,黄油,薯片

牛奶,鸡蛋,黄油,薯片
我们的目的是要找出哪些商品总是相伴出现的,比如人们买薯片的时候通常也会买鸡蛋,则[薯片,鸡蛋]就是一条频繁模式(frequent pattern)。

FP-Tree算法第一步:扫描事务数据库,每项商品按频数递减排序,并删除频数小于最小支持度MinSup的商品。(第一次扫描数据库)

薯片:7鸡蛋:7面包:7牛奶:6啤酒:4? ? ? ? ? ? ? ? ? ? ? ?(这里我们令MinSup=3)

以上结果就是频繁1项集,记为F1。

第二步:对于每一条购买记录,按照F1中的顺序重新排序。(第二次也是最后一次扫描数据库)

复制代码

薯片,鸡蛋,面包,牛奶

薯片,啤酒

薯片,面包

薯片,牛奶,啤酒

面包,啤酒

鸡蛋,牛奶
第三步:把第二步得到的各条记录插入到FP-Tree中。刚开始时后缀模式为空。

插入每一条(薯片,牛奶)之后

插入第二条记录(薯片,啤酒)

插入第三条记录(面包,sans-serif; line-height:20px">

估计你也知道怎么插了,最终生成的FP-Tree是:

上图中左边的那一叫做表头项,树中相同名称的节点要链接起来,链表的第一个元素就是表头项里的元素。

如果FP-Tree为空(只含一个虚的root节点),则FP-Growth函数返回。

此时输出表头项的每一项+postModel,支持度为表头项中对应项的计数。

第四步:从FP-Tree中找出频繁项。

遍历表头项中的每一项(我们拿“牛奶:6”为例),对于各项都执行以下(1)到(5)的操作:

(1)从FP-Tree中找到所有的“牛奶”节点,向上遍历它的祖先节点,得到4条路径:

复制代码

薯片:7,鸡蛋:6,牛奶:1

薯片:6,面包:4,牛奶:3

薯片:7,面包:1,牛奶:1

面包:1
对于每一条路径上的节点,其count都设置为牛奶的count

1,鸡蛋:3,鸡蛋:3,面包:3,牛奶:1,面包: 因为每一项末尾都是牛奶,可以把牛奶去掉,得到条件模式基(Conditional Pattern Base,CPB),此时的后缀模式是:(牛奶)。

(2)我们把上面的结果当作原始的事务数据库,返回到第3步,递归迭代运行。

没讲清楚,你可以参考这篇博客,直接看核心代码吧:

[java]? view plain copy
  1. public?void?FPGrowth(List<List<String>>?transRecords,??
  2. ????????List<String>?postPattern,Context?context)?throws?IOException,?InterruptedException?{??
  3. ????//?构建项头表,同时也是频繁1项集??
  4. ????ArrayList<TreeNode>?HeaderTable?=?buildHeaderTable(transRecords);??
  5. //?构建FP-Tree??
  6. ????TreeNode?treeRoot?=?buildFPTree(transRecords,?HeaderTable);??
  7. //?如果FP-Tree为空则返回??
  8. ????if?(treeRoot.getChildren()==null?||?treeRoot.getChildren().size()?==?0)??
  9. ????????return;??
  10. ????//输出项头表的每一项+postPattern??
  11. ????if(postPattern!=null){??
  12. ????????for?(TreeNode?header?:?HeaderTable)?{??
  13. ????????????String?outStr=header.getName();??
  14. ????????????int?count=header.getCount();??
  15. ????????????for?(String?ele?:?postPattern)??
  16. ????????????????outStr+="t"?+?ele;??
  17. ????????????context.write(new?IntWritable(count),?new?Text(outStr));??
  18. ????????}??
  19. ????}??
  20. //?找到项头表的每一项的条件模式基,进入递归迭代??
  21. for?(TreeNode?header?:?HeaderTable)?{??
  22. ????????//?后缀模式增加一项??
  23. ????????List<String>?newPostPattern?=?new?LinkedList<String>();??
  24. ????????newPostPattern.add(header.getName());??
  25. if?(postPattern?!=?null)??
  26. ????????????newPostPattern.addAll(postPattern);??
  27. ????????//?寻找header的条件模式基CPB,放入newTransRecords中??
  28. ????????List<List<String>>?newTransRecords?=?new?LinkedList<List<String>>();??
  29. ????????TreeNode?backnode?=?header.getNextHomonym();??
  30. while?(backnode?!=?null)?{??
  31. int?counter?=?backnode.getCount();??
  32. ????????????List<String>?prenodes?=?new?ArrayList<String>();??
  33. ????????????TreeNode?parent?=?backnode;??
  34. ????????????//?遍历backnode的祖先节点,放到prenodes中??
  35. while?((parent?=?parent.getParent()).getName()?!=?null)?{??
  36. ????????????????prenodes.add(parent.getName());??
  37. ????????????}??
  38. while?(counter--?>?0)?{??
  39. ????????????????newTransRecords.add(prenodes);??
  40. ????????????}??
  41. ????????????backnode?=?backnode.getNextHomonym();??
  42. //?递归迭代??
  43. ????????FPGrowth(newTransRecords,?newPostPattern,context);??
  44. }??

对于FP-Tree已经是单枝的情况,就没有必要再递归调用FPGrowth了,直接输出整条路径上所有节点的各种组合+postModel就可了。例如当FP-Tree为:

我们直接输出:

3  A+postModel

3  B+postModel

3  A+B+postModel

就可以了。

如何按照上面代码里的做法,是先输出:

然后把B插入到postModel的头部,重新建立一个FP-Tree,这时Tree中只含A,于是输出

3  A+(B+postModel)

两种方法结果是一样的,但毕竟重新建立FP-Tree计算量大些。

Java实现

FP树节点定义

copy
    package?fptree;??
  1. ????
  2. import?java.util.ArrayList;??
  3. import?java.util.List;??
  4. ????
  5. class?TreeNode?implements?Comparable<TreeNode>?{??
  6. private?String?name;?//?节点名称??
  7. private?int?count;?//?计数??
  8. private?TreeNode?parent;?//?父节点??
  9. private?List<TreeNode>?children;?//?子节点??
  10. private?TreeNode?nextHomonym;?//?下一个同名节点??
  11. public?TreeNode()?{??
  12. ????}??
  13. public?TreeNode(String?name)?{??
  14. this.name?=?name;??
  15. public?String?getName()?{??
  16. return?name;??
  17. void?setName(String?name)?{??
  18. int?getCount()?{??
  19. return?count;??
  20. void?setCount(int?count)?{??
  21. this.count?=?count;??
  22. public?TreeNode?getParent()?{??
  23. return?parent;??
  24. void?setParent(TreeNode?parent)?{??
  25. this.parent?=?parent;??
  26. public?List<TreeNode>?getChildren()?{??
  27. return?children;??
  28. void?addChild(TreeNode?child)?{??
  29. if?(this.getChildren()?==? ????????????List<TreeNode>?list?=?new?ArrayList<TreeNode>();??
  30. ????????????list.add(child);??
  31. this.setChildren(list);??
  32. ????????}?else?{??
  33. this.getChildren().add(child);??
  34. ????????}??
  35. public?TreeNode?findChild(String?name)?{??
  36. ????????List<TreeNode>?children?=?this.getChildren();??
  37. if?(children?!=?for?(TreeNode?child?:?children)?{??
  38. ????????????????if?(child.getName().equals(name))?{??
  39. ????????????????????return?child;??
  40. ????????????????}??
  41. return?null;??
  42. void?setChildren(List<TreeNode>?children)?{??
  43. this.children?=?children;??
  44. void?printChildrenName()?{??
  45. ????????????????System.out.print(child.getName()?+?"?");??
  46. ????????}?else?{??
  47. ????????????System.out.print("null");??
  48. public?TreeNode?getNextHomonym()?{??
  49. return?nextHomonym;??
  50. void?setNextHomonym(TreeNode?nextHomonym)?{??
  51. this.nextHomonym?=?nextHomonym;??
  52. void?countIncrement(int?n)?{??
  53. this.count?+=?n;??
  54. ????@Override??
  55. int?compareTo(TreeNode?arg0)?{??
  56. //?TODO?Auto-generated?method?stub??
  57. int?count0?=?arg0.getCount();??
  58. //?跟默认的比较大小相反,导致调用Arrays.sort()时是按降序排列??
  59. return?count0?-?this.count;??
  60. }??

挖掘频繁模式
copy
    ???
  1. import?java.io.BufferedReader;??
  2. import?java.io.FileReader;??
  3. import?java.io.IOException;??
  4. import?java.util.ArrayList;??
  5. import?java.util.Collections;??
  6. import?java.util.Comparator;??
  7. import?java.util.HashMap;??
  8. import?java.util.LinkedList;??
  9. import?java.util.List;??
  10. import?java.util.Map;??
  11. import?java.util.Map.Entry;??
  12. import?java.util.Set;??
  13. ???
  14. class?FPTree?{??
  15. int?minSuport;??
  16. int?getMinSuport()?{??
  17. return?minSuport;??
  18. void?setMinSuport(int?minSuport)?{??
  19. this.minSuport?=?minSuport;??
  20. //?从若干个文件中读入Transaction?Record??
  21. public?List<List<String>>?readTransRocords(String...?filenames)?{??
  22. ????????List<List<String>>?transaction?=?null;??
  23. if?(filenames.length?>?0)?{??
  24. ????????????transaction?=?for?(String?filename?:?filenames)?{??
  25. try?{??
  26. ????????????????????FileReader?fr?=?new?FileReader(filename);??
  27. ????????????????????BufferedReader?br?=?new?BufferedReader(fr);??
  28. try?{??
  29. ????????????????????????String?line;??
  30. ????????????????????????List<String>?record;??
  31. ????????????????????????while?((line?=?br.readLine())?!=? ????????????????????????????if(line.trim().length()>0){??
  32. ????????????????????????????????String?str[]?=?line.split(",");??
  33. ????????????????????????????????record?=? ????????????????????????????????for?(String?w?:?str)??
  34. ????????????????????????????????????record.add(w);??
  35. ????????????????????????????????transaction.add(record);??
  36. ????????????????????????????}??
  37. ????????????????????????}??
  38. ????????????????????}?finally?{??
  39. ????????????????????????br.close();??
  40. ????????????????????}??
  41. ????????????????}?catch?(IOException?ex)?{??
  42. ????????????????????System.out.println("Read?transaction?records?failed."??
  43. ????????????????????????????+?ex.getMessage());??
  44. ????????????????????System.exit(1);??
  45. return?transaction;??
  46. //?FP-Growth算法??
  47. ????????????List<String>?postPattern)?{??
  48. ????????ArrayList<TreeNode>?HeaderTable?=?buildHeaderTable(transRecords);??
  49. ????????TreeNode?treeRoot?=?buildFPTree(transRecords,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????System.out.print(header.getCount()?+?"t"?+?header.getName());??
  50. for?(String?ele?:?postPattern)??
  51. ????????????????????System.out.print("t"?+?ele);??
  52. ????????????????System.out.println();??
  53. //?找到项头表的每一项的条件模式基,进入递归迭代??
  54. ????????????//?后缀模式增加一项??
  55. ????????????List<String>?newPostPattern?=?new?LinkedList<String>();??
  56. ????????????newPostPattern.add(header.getName());??
  57. null)??
  58. ????????????????newPostPattern.addAll(postPattern);??
  59. //?寻找header的条件模式基CPB,放入newTransRecords中??
  60. ????????????List<List<String>>?newTransRecords?=?new?LinkedList<List<String>>();??
  61. ????????????TreeNode?backnode?=?header.getNextHomonym();??
  62. int?counter?=?backnode.getCount();??
  63. ????????????????List<String>?prenodes?=?new?ArrayList<String>();??
  64. ????????????????TreeNode?parent?=?backnode;??
  65. ????????????????//?遍历backnode的祖先节点,放到prenodes中??
  66. ????????????????????prenodes.add(parent.getName());??
  67. ???????????????? ????????????????????newTransRecords.add(prenodes);??
  68. ????????????????}??
  69. ????????????????backnode?=?backnode.getNextHomonym();??
  70. //?递归迭代??
  71. ????????????FPGrowth(newTransRecords,?newPostPattern);??
  72. public?ArrayList<TreeNode>?buildHeaderTable(List<List<String>>?transRecords)?{??
  73. ????????ArrayList<TreeNode>?F1?=?if?(transRecords.size()?>? ????????????F1?=?new?ArrayList<TreeNode>();??
  74. ????????????Map<String,?TreeNode>?map?=?new?HashMap<String,?TreeNode>();??
  75. //?计算事务数据库中各项的支持度??
  76. for?(List<String>?record?:?transRecords)?{??
  77. for?(String?item?:?record)?{??
  78. ????????????????????if?(!map.keySet().contains(item))?{??
  79. ????????????????????????TreeNode?node?=?new?TreeNode(item);??
  80. ????????????????????????node.setCount(1);??
  81. ????????????????????????map.put(item,?node);??
  82. ????????????????????}? ????????????????????????map.get(item).countIncrement( ????????????????????}??
  83. //?把支持度大于(或等于)minSup的项加入到F1中??
  84. ????????????Set<String>?names?=?map.keySet();??
  85. for?(String?name?:?names)?{??
  86. ????????????????TreeNode?tnode?=?map.get(name);??
  87. if?(tnode.getCount()?>=?minSuport)?{??
  88. ????????????????????F1.add(tnode);??
  89. ????????????Collections.sort(F1);??
  90. return?F1;??
  91. //?构建FP-Tree??
  92. public?TreeNode?buildFPTree(List<List<String>>?transRecords,248); line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????ArrayList<TreeNode>?F1)?{??
  93. ????????TreeNode?root?=?new?TreeNode();?//?创建树的根节点??
  94. for?(List<String>?transRecord?:?transRecords)?{??
  95. ????????????LinkedList<String>?record?=?sortByF1(transRecord,?F1);??
  96. ????????????TreeNode?subTreeRoot?=?root;??
  97. ????????????TreeNode?tmpRoot?=?if?(root.getChildren()?!=?while?(!record.isEmpty()??
  98. ????????????????????????&&?(tmpRoot?=?subTreeRoot.findChild(record.peek()))?!=? ????????????????????tmpRoot.countIncrement( ????????????????????subTreeRoot?=?tmpRoot;??
  99. ????????????????????record.poll();??
  100. ????????????addNodes(subTreeRoot,?record,?F1);??
  101. return?root;??
  102. //?把交易记录按项的频繁程序降序排列??
  103. public?LinkedList<String>?sortByF1(List<String>?transRecord,??
  104. ????????????ArrayList<TreeNode>?F1)?{??
  105. ????????Map<String,?Integer>?map?=?for?(String?item?:?transRecord)?{??
  106. //?由于F1已经是按降序排列的,??
  107. for?(int?i?=?0;?i?<?F1.size();?i++)?{??
  108. ????????????????TreeNode?tnode?=?F1.get(i);??
  109. if?(tnode.getName().equals(item))?{??
  110. ????????????????????map.put(item,?i);??
  111. ????????ArrayList<Entry<String,?Integer>>?al?=?new?ArrayList<Entry<String,?Integer>>(??
  112. ????????????????map.entrySet());??
  113. ????????Collections.sort(al,?new?Comparator<Map.Entry<String,?Integer>>()?{??
  114. ????????????int?compare(Entry<String,?Integer>?arg0,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????????Entry<String,?Integer>?arg1)?{??
  115. ????????????????//?降序排列??
  116. return?arg0.getValue()?-?arg1.getValue();??
  117. ????????});??
  118. ????????LinkedList<String>?rest?=?for?(Entry<String,?Integer>?entry?:?al)?{??
  119. ????????????rest.add(entry.getKey());??
  120. return?rest;??
  121. //?把record作为ancestor的后代插入树中??
  122. void?addNodes(TreeNode?ancestor,?LinkedList<String>?record,153); background-color:inherit; font-weight:bold">if?(record.size()?>?while?(record.size()?>? ????????????????String?item?=?record.poll();??
  123. ????????????????TreeNode?leafnode?=? ????????????????leafnode.setCount( ????????????????leafnode.setParent(ancestor);??
  124. ????????????????ancestor.addChild(leafnode);??
  125. for?(TreeNode?f1?:?F1)?{??
  126. if?(f1.getName().equals(item))?{??
  127. while?(f1.getNextHomonym()?!=? ????????????????????????????f1?=?f1.getNextHomonym();??
  128. ????????????????????????f1.setNextHomonym(leafnode);??
  129. break;??
  130. ????????????????addNodes(leafnode,153); background-color:inherit; font-weight:bold">static?void?main(String[]?args)?{??
  131. ????????FPTree?fptree?=?new?FPTree();??
  132. ????????fptree.setMinSuport(3);??
  133. ????????List<List<String>>?transRecords?=?fptree??
  134. ????????????????.readTransRocords("/home/orisun/test/market");??
  135. ????????fptree.FPGrowth(transRecords,153); background-color:inherit; font-weight:bold">null);??
  136. 输入文件

    复制代码

    牛奶,鸡蛋,面包,薯片
    鸡蛋,爆米花,薯片,啤酒
    鸡蛋,面包,薯片
    牛奶,鸡蛋,面包,爆米花,薯片,啤酒
    牛奶,面包,啤酒
    鸡蛋,面包,啤酒
    牛奶,面包,薯片
    牛奶,鸡蛋,面包,黄油,薯片
    牛奶,鸡蛋,黄油,薯片
    输出

    复制代码

    6    薯片    鸡蛋
    5    薯片    面包
    5    鸡蛋    面包
    4    薯片    鸡蛋    面包
    5    薯片    牛奶
    5    面包    牛奶
    4    鸡蛋    牛奶
    4    薯片    面包    牛奶
    4    薯片    鸡蛋    牛奶
    3    面包    鸡蛋    牛奶
    3    薯片    面包    鸡蛋    牛奶
    3    鸡蛋    啤酒
    3    面包    啤酒

    复制代码

    用Hadoop来实现

    在上面的代码我们把整个事务数据库放在一个List<List<String>>里面传给FPGrowth,在实际中这是不可取的,因为内存不可能容下整个事务数据库,我们可能需要从关系关系数据库中一条一条地读入来建立FP-Tree。但无论如何?FP-Tree是肯定需要放在内存中的,但内存如果容不下怎么办?另外FPGrowth仍然是非常耗时的,你想提高速度怎么办?解决办法:分而治之,并行计算。

    我们把原始事务数据库分成N部分,在N个节点上并行地进行FPGrowth挖掘,最后把关联规则汇总到一起就可以了。关键问题是怎么“划分”才会不遗露任何一条关联规则呢?参见这篇博客。这里为了达到并行计算的目的,采用了一种“冗余”的划分方法,即各部分的并集大于原来的集合。这种方法最终求出来的关联规则也是有冗余的,比如在节点1上得到一条规则(6:啤酒,尿布),在节点2上得到一条规则(3:尿布,啤酒),显然节点2上的这条规则是冗余的,需要采用后续步骤把冗余的规则去掉。

    代码:

    Record.java

    copy
    import?java.io.DataInput;??
  1. import?java.io.DataOutput;??
  2. import?java.util.Collections;??
  3. import?java.util.LinkedList;??
  4. import?org.apache.hadoop.io.WritableComparable;??
  5. class?Record?implements?WritableComparable<Record>{??
  6. ???????
  7. ????LinkedList<String>?list;??
  8. public?Record(){??
  9. ????????list=public?Record(String[]?arr){??
  10. for(int?i=0;i<arr.length;i++)??
  11. ????????????list.add(arr[i]);??
  12. public?String?toString(){??
  13. ????????String?str=list.get(0);??
  14. 1;i<list.size();i++)??
  15. ????????????str+="t"+list.get(i);??
  16. return?str;??
  17. void?readFields(DataInput?in)?throws?IOException?{??
  18. ????????list.clear();??
  19. ????????String?line=in.readUTF();??
  20. ????????String?[]arr=line.split("s+");??
  21. 0;i<arr.length;i++)??
  22. ????????????list.add(arr[i]);??
  23. ????@Override??
  24. void?write(DataOutput?out)?throws?IOException?{??
  25. ????????out.writeUTF(this.toString());??
  26. int?compareTo(Record?obj)?{??
  27. ????????Collections.sort(list);??
  28. ????????Collections.sort(obj.list);??
  29. this.toString().compareTo(obj.toString());??
  30. DC_FPTree.java

    copy
    import?java.io.IOException;??
  1. import?java.io.InputStreamReader;??
  2. import?java.util.BitSet;??
  3. import?java.util.Comparator;??
  4. import?java.util.HashMap;??
  5. import?java.util.Map;??
  6. import?java.util.Map.Entry;??
  7. import?java.util.Set;??
  8. import?org.apache.hadoop.conf.Configuration;??
  9. import?org.apache.hadoop.conf.Configured;??
  10. import?org.apache.hadoop.fs.FSDataInputStream;??
  11. import?org.apache.hadoop.fs.FileSystem;??
  12. import?org.apache.hadoop.fs.Path;??
  13. import?org.apache.hadoop.io.IntWritable;??
  14. import?org.apache.hadoop.io.LongWritable;??
  15. import?org.apache.hadoop.io.Text;??
  16. import?org.apache.hadoop.mapreduce.Job;??
  17. import?org.apache.hadoop.mapreduce.Mapper;??
  18. import?org.apache.hadoop.mapreduce.Reducer;??
  19. import?org.apache.hadoop.mapreduce.lib.input.FileInputFormat;??
  20. import?org.apache.hadoop.mapreduce.lib.input.TextInputFormat;??
  21. import?org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;??
  22. import?org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;??
  23. import?org.apache.hadoop.util.Tool;??
  24. import?org.apache.hadoop.util.ToolRunner;??
  25. class?DC_FPTree?extends?Configured?implements?Tool?{??
  26. final?int?GroupNum?=?10;??
  27. int?minSuport=6;??
  28. class?GroupMapper?extends??
  29. ????????????Mapper<LongWritable,?Text,?IntWritable,?Record>?{??
  30. ????????List<String>?freq?=?new?LinkedList<String>();?//?频繁1项集??
  31. ????????List<List<String>>?freq_group?=?new?LinkedList<List<String>>();?//?分组后的频繁1项集??
  32. ????????void?setup(Context?context)?//?从文件读入频繁1项集??
  33. ????????????FileSystem?fs?=?FileSystem.get(context.getConfiguration());??
  34. ????????????Path?freqFile?=?new?Path("/user/orisun/input/F1");??
  35. ????????????FSDataInputStream?in?=?fs.open(freqFile);??
  36. ????????????InputStreamReader?isr?=?new?InputStreamReader(in);??
  37. ????????????BufferedReader?br?=?new?BufferedReader(isr);??
  38. ????????????????String?line;??
  39. ????????????????????String[]?str?=?line.split("s+");??
  40. ????????????????????String?word?=?str[0];??
  41. ????????????????????freq.add(word);??
  42. ????????????}?finally?{??
  43. ????????????????br.close();??
  44. //?对频繁1项集进行分组??
  45. ????????????Collections.shuffle(freq);?//?打乱顺序??
  46. int?cap?=?freq.size()?/?GroupNum;?//?每段分为一组??
  47. 0;?i?<?GroupNum;?i++)?{??
  48. ????????????????List<String>?list?=?int?j?=?0;?j?<?cap;?j++)?{??
  49. ????????????????????list.add(freq.get(i?*?cap?+?j));??
  50. ????????????????freq_group.add(list);??
  51. int?remainder?=?freq.size()?%?GroupNum;??
  52. int?base?=?GroupNum?*?cap;??
  53. 0;?i?<?remainder;?i++)?{??
  54. ????????????????freq_group.get(i).add(freq.get(base?+?i));??
  55. ????????void?map(LongWritable?key,?Text?value,?Context?context)??
  56. ????????????String[]?arr?=?value.toString().split("s+");??
  57. ????????????Record?record?=?new?Record(arr);??
  58. ????????????LinkedList<String>?list?=?record.list;??
  59. ????????????BitSet?bs=new?BitSet(freq_group.size());??
  60. ????????????bs.clear();??
  61. while?(record.list.size()?>? ????????????????String?item?=?list.peekLast();?//?取出record的最后一项??
  62. 0;??
  63. for?(;?i?<?freq_group.size();?i++)?{??
  64. if(bs.get(i))??
  65. ????????????????????????continue;??
  66. if?(freq_group.get(i).contains(item))?{??
  67. ????????????????????????bs.set(i);??
  68. if(i<freq_group.size()){?????//找到了??
  69. ????????????????????context.write(new?IntWritable(i),?record);????
  70. ????????????????record.list.pollLast();??
  71. class?FPReducer?extends?Reducer<IntWritable,Record,IntWritable,Text>{??
  72. void?reduce(IntWritable?key,Iterable<Record>?values,Context?context) ????????????List<List<String>>?trans=while(values.iterator().hasNext()){??
  73. ????????????????Record?record=values.iterator().next();??
  74. ????????????????LinkedList<String>?list=for(String?ele:record.list)??
  75. ????????????????????list.add(ele);??
  76. ????????????????trans.add(list);??
  77. ????????????FPGrowth(trans,153); background-color:inherit; font-weight:bold">null,context);??
  78. //?FP-Growth算法??
  79. ????????????List<String>?postPattern,?InterruptedException?{??
  80. //?构建项头表,同时也是频繁1项集??
  81. ????????ArrayList<TreeNode>?HeaderTable?=?buildHeaderTable(transRecords);??
  82. ????????TreeNode?treeRoot?=?buildFPTree(transRecords,?HeaderTable);??
  83. //?如果FP-Tree为空则返回??
  84. 0)??
  85. return;??
  86. //输出项头表的每一项+postPattern??
  87. null){??
  88. ????????????????String?outStr=header.getName();??
  89. int?count=header.getCount();??
  90. ????????????????????outStr+="t"?+?ele;??
  91. ????????????????context.write(new?Text(outStr));??
  92. ????????????ArrayList<TreeNode>?F1?=? ????????????????F1?=? ????????????????Map<String,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????????????????TreeNode?node?=? ????????????????????????????node.setCount( ????????????????????????????map.put(item,248); line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????????????}? ????????????????????????????map.get(item).countIncrement( ????????????????Set<String>?names?=?map.keySet();??
  93. ????????????????????TreeNode?tnode?=?map.get(name);??
  94. ????????????????????????F1.add(tnode);??
  95. ????????????????Collections.sort(F1);??
  96. ????????????}? ????????????????ArrayList<TreeNode>?F1)?{??
  97. ????????????TreeNode?root?=? ????????????????LinkedList<String>?record?=?sortByF1(transRecord,248); line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????TreeNode?subTreeRoot?=?root;??
  98. ????????????????TreeNode?tmpRoot?=? ????????????????????????????&&?(tmpRoot?=?subTreeRoot.findChild(record.peek()))?!=? ????????????????????????tmpRoot.countIncrement( ????????????????????????subTreeRoot?=?tmpRoot;??
  99. ????????????????????????record.poll();??
  100. ????????????????addNodes(subTreeRoot,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????ArrayList<TreeNode>?F1)?{??
  101. ????????????????????TreeNode?tnode?=?F1.get(i);??
  102. ????????????????????????map.put(item,248); line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????ArrayList<Entry<String,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????????map.entrySet());??
  103. ????????????Collections.sort(al,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ???????????????? ????????????????????????Entry<String,248); line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ???????????????????? ????????????});??
  104. ????????????LinkedList<String>?rest?=? ????????????????rest.add(entry.getKey());??
  105. ????????????????????String?item?=?record.poll();??
  106. ????????????????????TreeNode?leafnode?=? ????????????????????leafnode.setCount( ????????????????????leafnode.setParent(ancestor);??
  107. ????????????????????ancestor.addChild(leafnode);??
  108. ???????????????????????????? ????????????????????????????????f1?=?f1.getNextHomonym();??
  109. ????????????????????????????}??
  110. ????????????????????????????f1.setNextHomonym(leafnode);??
  111. ????????????????????????}??
  112. ????????????????????addNodes(leafnode,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ???????
  113. class?InverseMapper? ????????????String?[]arr=value.toString().split("s+");??
  114. int?count=Integer.parseInt(arr[0]);??
  115. ????????????Record?record=new?Record();??
  116. 1;i<arr.length;i++){??
  117. ????????????????record.list.add(arr[i]);??
  118. ????????????context.write(record,153); background-color:inherit; font-weight:bold">new?IntWritable(count));??
  119. class?MaxReducer?extends?Reducer<Record,Record>{??
  120. void?reduce(Record?key,Iterable<IntWritable>?values,153); background-color:inherit; font-weight:bold">int?max=-1;??
  121. for(IntWritable?value:values){??
  122. int?i=value.get();??
  123. if(i>max)??
  124. ????????????????????max=i;??
  125. new?IntWritable(max),?key);??
  126. int?run(String[]?arg0)?throws?Exception?{??
  127. ????????Configuration?conf=getConf();??
  128. ????????conf.set("mapred.task.timeout",?"6000000");??
  129. ????????Job?job=new?Job(conf);??
  130. ????????job.setJarByClass(DC_FPTree.class);??
  131. ????????FileSystem?fs=FileSystem.get(getConf());??
  132. ???????????
  133. ????????FileInputFormat.setInputPaths(job,?"/user/orisun/input/data");??
  134. ????????Path?outDir=new?Path("/user/orisun/output");??
  135. ????????fs.delete(outDir,true);??
  136. ????????FileOutputFormat.setOutputPath(job,?outDir);??
  137. ???????????
  138. ????????job.setMapperClass(GroupMapper. ????????job.setReducerClass(FPReducer.class);??
  139. ????????job.setInputFormatClass(TextInputFormat. ????????job.setOutputFormatClass(TextOutputFormat. ????????job.setMapOutputKeyClass(IntWritable. ????????job.setMapOutputValueClass(Record. ????????job.setOutputKeyClass(IntWritable. ????????job.setOutputKeyClass(Text.boolean?success=job.waitForCompletion(true);??
  140. ????????job=new?Job(conf);??
  141. ????????job.setJarByClass(DC_FPTree."/user/orisun/output/part-r-*");??
  142. ????????Path?outDir2=new?Path("/user/orisun/output2");??
  143. ????????fs.delete(outDir2,?outDir2);??
  144. ????????job.setMapperClass(InverseMapper. ????????job.setReducerClass(MaxReducer.//job.setNumReduceTasks(0);??
  145. ????????job.setInputFormatClass(TextInputFormat. ????????job.setOutputFormatClass(TextOutputFormat. ????????job.setMapOutputKeyClass(Record. ????????job.setMapOutputValueClass(IntWritable. ????????job.setOutputKeyClass(IntWritable. ????????job.setOutputKeyClass(Record. ????????success?|=?job.waitForCompletion(return?success?0:1;??
  146. void?main(String[]?args)?throws?Exception{??
  147. int?res=ToolRunner.run(new?Configuration(),153); background-color:inherit; font-weight:bold">new?DC_FPTree(),?args);??
  148. ????????System.exit(res);??
  149. }??

结束语

在实践中,关联规则挖掘可能并不像人们期望的那么有用。一方面是因为支持度置信度框架会产生过多的规则,并不是每一个规则都是有用的。另一方面大部分的关联规则并不像“啤酒与尿布”这种经典故事这么普遍。关联规则分析是需要技巧的,有时需要用更严格的统计学知识来控制规则的增殖。 

原文来自:博客园(华夏35度)http://www.cnblogs.com/zhangchaoyang 作者:Orisun

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读