在关联规则挖掘领域最经典的算法法是Apriori,其致命的缺点是需要多次扫描事务数据库。于是人们提出了各种裁剪(prune)数据集的方法以减少I/O开支,韩嘉炜老师的FP-Tree算法就是其中非常高效的一种。
支持度和置信度
严格地说Apriori和FP-Tree都是寻找频繁项集的算法,频繁项集就是所谓的“支持度”比较高的项集,下面解释一下支持度和置信度的概念。
设事务数据库为:
A E F G
A F G
A B E F G
E F G
则{A,F,G}的支持度数为3,支持度为3/4。
{F,G}的支持度数为4,支持度为4/4。
{A}的支持度数为3,支持度为3/4。
{A}=>{F,G}的置信度为:{A,G}的支持度数 除以?{A}的支持度数,即3/3
强关联规则挖掘是在满足一定支持度的情况下寻找置信度达到阈值的所有模式。
FP-Tree算法
我们举个例子来详细讲解FP-Tree算法的完整实现。
事务数据库如下,一行表示一条购物记录:
牛奶,鸡蛋,面包,薯片
鸡蛋,爆米花,薯片,啤酒
鸡蛋,面包,薯片
牛奶,鸡蛋,面包,爆米花,薯片,啤酒
牛奶,面包,啤酒
鸡蛋,面包,啤酒
牛奶,面包,薯片
牛奶,鸡蛋,面包,黄油,薯片
牛奶,鸡蛋,黄油,薯片
我们的目的是要找出哪些商品总是相伴出现的,比如人们买薯片的时候通常也会买鸡蛋,则[薯片,鸡蛋]就是一条频繁模式(frequent pattern)。
FP-Tree算法第一步:扫描事务数据库,每项商品按频数递减排序,并删除频数小于最小支持度MinSup的商品。(第一次扫描数据库)
薯片:7鸡蛋:7面包:7牛奶:6啤酒:4? ? ? ? ? ? ? ? ? ? ? ?(这里我们令MinSup=3)
以上结果就是频繁1项集,记为F1。
第二步:对于每一条购买记录,按照F1中的顺序重新排序。(第二次也是最后一次扫描数据库)
薯片,鸡蛋,面包,牛奶
薯片,啤酒
薯片,面包
薯片,牛奶,啤酒
面包,啤酒
鸡蛋,牛奶
第三步:把第二步得到的各条记录插入到FP-Tree中。刚开始时后缀模式为空。
插入每一条(薯片,牛奶)之后

插入第二条记录(薯片,啤酒)

插入第三条记录(面包,sans-serif; line-height:20px"> 
估计你也知道怎么插了,最终生成的FP-Tree是:

上图中左边的那一叫做表头项,树中相同名称的节点要链接起来,链表的第一个元素就是表头项里的元素。
如果FP-Tree为空(只含一个虚的root节点),则FP-Growth函数返回。
此时输出表头项的每一项+postModel,支持度为表头项中对应项的计数。
第四步:从FP-Tree中找出频繁项。
遍历表头项中的每一项(我们拿“牛奶:6”为例),对于各项都执行以下(1)到(5)的操作:
(1)从FP-Tree中找到所有的“牛奶”节点,向上遍历它的祖先节点,得到4条路径:
薯片:7,鸡蛋:6,牛奶:1
薯片:6,面包:4,牛奶:3
薯片:7,面包:1,牛奶:1
面包:1
对于每一条路径上的节点,其count都设置为牛奶的count
1,鸡蛋: 3,鸡蛋: 3,面包: 3,牛奶: 1,面包: 因为每一项末尾都是牛奶,可以把牛奶去掉,得到条件模式基(Conditional Pattern Base,CPB),此时的后缀模式是:(牛奶)。
(2)我们把上面的结果当作原始的事务数据库,返回到第3步,递归迭代运行。
没讲清楚,你可以参考这篇博客,直接看核心代码吧:
- public?void?FPGrowth(List<List<String>>?transRecords,??
- ????????List<String>?postPattern,Context?context)?throws?IOException,?InterruptedException?{??
- ??????
- ????ArrayList<TreeNode>?HeaderTable?=?buildHeaderTable(transRecords);??
- //?构建FP-Tree??
- ????TreeNode?treeRoot?=?buildFPTree(transRecords,?HeaderTable);??
- //?如果FP-Tree为空则返回??
- ????if?(treeRoot.getChildren()==null?||?treeRoot.getChildren().size()?==?0)??
- ????????return;??
- ??????
- ????if(postPattern!=null){??
- ????????for?(TreeNode?header?:?HeaderTable)?{??
- ????????????String?outStr=header.getName();??
- ????????????int?count=header.getCount();??
- ????????????for?(String?ele?:?postPattern)??
- ????????????????outStr+="t"?+?ele;??
- ????????????context.write(new?IntWritable(count),?new?Text(outStr));??
- ????????}??
- ????}??
- //?找到项头表的每一项的条件模式基,进入递归迭代??
- for?(TreeNode?header?:?HeaderTable)?{??
- ??????????
- ????????List<String>?newPostPattern?=?new?LinkedList<String>();??
- ????????newPostPattern.add(header.getName());??
- if?(postPattern?!=?null)??
- ????????????newPostPattern.addAll(postPattern);??
- ??????????
- ????????List<List<String>>?newTransRecords?=?new?LinkedList<List<String>>();??
- ????????TreeNode?backnode?=?header.getNextHomonym();??
- while?(backnode?!=?null)?{??
- int?counter?=?backnode.getCount();??
- ????????????List<String>?prenodes?=?new?ArrayList<String>();??
- ????????????TreeNode?parent?=?backnode;??
- ??????????????
- while?((parent?=?parent.getParent()).getName()?!=?null)?{??
- ????????????????prenodes.add(parent.getName());??
- ????????????}??
- while?(counter--?>?0)?{??
- ????????????????newTransRecords.add(prenodes);??
- ????????????}??
- ????????????backnode?=?backnode.getNextHomonym();??
- //?递归迭代??
- ????????FPGrowth(newTransRecords,?newPostPattern,context);??
- }??
对于FP-Tree已经是单枝的情况,就没有必要再递归调用FPGrowth了,直接输出整条路径上所有节点的各种组合+postModel就可了。例如当FP-Tree为:

我们直接输出:
3 A+postModel
3 B+postModel
3 A+B+postModel
就可以了。
如何按照上面代码里的做法,是先输出:
然后把B插入到postModel的头部,重新建立一个FP-Tree,这时Tree中只含A,于是输出
3 A+(B+postModel)
两种方法结果是一样的,但毕竟重新建立FP-Tree计算量大些。
Java实现
FP树节点定义
package ?fptree;??
????
import?java.util.ArrayList;??
import?java.util.List;??
????
class?TreeNode?implements?Comparable<TreeNode>?{??
private?String?name;???
private?int?count;???
private?TreeNode?parent;???
private?List<TreeNode>?children;???
private?TreeNode?nextHomonym;???
public?TreeNode()?{??
????}??
public?TreeNode(String?name)?{??
this.name?=?name;??
public?String?getName()?{??
return?name;??
void?setName(String?name)?{??
int?getCount()?{??
return?count;??
void?setCount(int?count)?{??
this.count?=?count;??
public?TreeNode?getParent()?{??
return?parent;??
void?setParent(TreeNode?parent)?{??
this.parent?=?parent;??
public?List<TreeNode>?getChildren()?{??
return?children;??
void?addChild(TreeNode?child)?{??
if?(this.getChildren()?==? ????????????List<TreeNode>?list?=?new?ArrayList<TreeNode>();??
????????????list.add(child);??
this.setChildren(list);??
????????}?else?{??
this.getChildren().add(child);??
????????}??
public?TreeNode?findChild(String?name)?{??
????????List<TreeNode>?children?=?this.getChildren();??
if?(children?!=?for?(TreeNode?child?:?children)?{??
????????????????if?(child.getName().equals(name))?{??
????????????????????return?child;??
????????????????}??
return?null;??
void?setChildren(List<TreeNode>?children)?{??
this.children?=?children;??
void?printChildrenName()?{??
????????????????System.out.print(child.getName()?+?"?");??
????????}?else?{??
????????????System.out.print("null");??
public?TreeNode?getNextHomonym()?{??
return?nextHomonym;??
void?setNextHomonym(TreeNode?nextHomonym)?{??
this.nextHomonym?=?nextHomonym;??
void?countIncrement(int?n)?{??
this.count?+=?n;??
????@Override??
int?compareTo(TreeNode?arg0)?{??
//?TODO?Auto-generated?method?stub??
int?count0?=?arg0.getCount();??
//?跟默认的比较大小相反,导致调用Arrays.sort()时是按降序排列??
return?count0?-?this.count;??
}??
挖掘频繁模式
???
- import?java.io.BufferedReader;??
- import?java.io.FileReader;??
- import?java.io.IOException;??
- import?java.util.ArrayList;??
- import?java.util.Collections;??
- import?java.util.Comparator;??
- import?java.util.HashMap;??
- import?java.util.LinkedList;??
- import?java.util.List;??
- import?java.util.Map;??
- import?java.util.Map.Entry;??
- import?java.util.Set;??
- ???
- class?FPTree?{??
- int?minSuport;??
- int?getMinSuport()?{??
- return?minSuport;??
- void?setMinSuport(int?minSuport)?{??
- this.minSuport?=?minSuport;??
- //?从若干个文件中读入Transaction?Record??
- public?List<List<String>>?readTransRocords(String...?filenames)?{??
- ????????List<List<String>>?transaction?=?null;??
- if?(filenames.length?>?0)?{??
- ????????????transaction?=?for?(String?filename?:?filenames)?{??
- try?{??
- ????????????????????FileReader?fr?=?new?FileReader(filename);??
- ????????????????????BufferedReader?br?=?new?BufferedReader(fr);??
- try?{??
- ????????????????????????String?line;??
- ????????????????????????List<String>?record;??
- ????????????????????????while?((line?=?br.readLine())?!=? ????????????????????????????if(line.trim().length()>0){??
- ????????????????????????????????String?str[]?=?line.split(",");??
- ????????????????????????????????record?=? ????????????????????????????????for?(String?w?:?str)??
- ????????????????????????????????????record.add(w);??
- ????????????????????????????????transaction.add(record);??
- ????????????????????????????}??
- ????????????????????????}??
- ????????????????????}?finally?{??
- ????????????????????????br.close();??
- ????????????????????}??
- ????????????????}?catch?(IOException?ex)?{??
- ????????????????????System.out.println("Read?transaction?records?failed."??
- ????????????????????????????+?ex.getMessage());??
- ????????????????????System.exit(1);??
- return?transaction;??
- //?FP-Growth算法??
- ????????????List<String>?postPattern)?{??
- ????????ArrayList<TreeNode>?HeaderTable?=?buildHeaderTable(transRecords);??
- ????????TreeNode?treeRoot?=?buildFPTree(transRecords,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????System.out.print(header.getCount()?+?"t"?+?header.getName());??
- for?(String?ele?:?postPattern)??
- ????????????????????System.out.print("t"?+?ele);??
- ????????????????System.out.println();??
- //?找到项头表的每一项的条件模式基,进入递归迭代??
- ??????????????
- ????????????List<String>?newPostPattern?=?new?LinkedList<String>();??
- ????????????newPostPattern.add(header.getName());??
- null)??
- ????????????????newPostPattern.addAll(postPattern);??
- //?寻找header的条件模式基CPB,放入newTransRecords中??
- ????????????List<List<String>>?newTransRecords?=?new?LinkedList<List<String>>();??
- ????????????TreeNode?backnode?=?header.getNextHomonym();??
- int?counter?=?backnode.getCount();??
- ????????????????List<String>?prenodes?=?new?ArrayList<String>();??
- ????????????????TreeNode?parent?=?backnode;??
- ??????????????????
- ????????????????????prenodes.add(parent.getName());??
- ???????????????? ????????????????????newTransRecords.add(prenodes);??
- ????????????????}??
- ????????????????backnode?=?backnode.getNextHomonym();??
- //?递归迭代??
- ????????????FPGrowth(newTransRecords,?newPostPattern);??
- public?ArrayList<TreeNode>?buildHeaderTable(List<List<String>>?transRecords)?{??
- ????????ArrayList<TreeNode>?F1?=?if?(transRecords.size()?>? ????????????F1?=?new?ArrayList<TreeNode>();??
- ????????????Map<String,?TreeNode>?map?=?new?HashMap<String,?TreeNode>();??
- //?计算事务数据库中各项的支持度??
- for?(List<String>?record?:?transRecords)?{??
- for?(String?item?:?record)?{??
- ????????????????????if?(!map.keySet().contains(item))?{??
- ????????????????????????TreeNode?node?=?new?TreeNode(item);??
- ????????????????????????node.setCount(1);??
- ????????????????????????map.put(item,?node);??
- ????????????????????}? ????????????????????????map.get(item).countIncrement( ????????????????????}??
- //?把支持度大于(或等于)minSup的项加入到F1中??
- ????????????Set<String>?names?=?map.keySet();??
- for?(String?name?:?names)?{??
- ????????????????TreeNode?tnode?=?map.get(name);??
- if?(tnode.getCount()?>=?minSuport)?{??
- ????????????????????F1.add(tnode);??
- ????????????Collections.sort(F1);??
- return?F1;??
- //?构建FP-Tree??
- public?TreeNode?buildFPTree(List<List<String>>?transRecords,248); line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????ArrayList<TreeNode>?F1)?{??
- ????????TreeNode?root?=?new?TreeNode();???
- for?(List<String>?transRecord?:?transRecords)?{??
- ????????????LinkedList<String>?record?=?sortByF1(transRecord,?F1);??
- ????????????TreeNode?subTreeRoot?=?root;??
- ????????????TreeNode?tmpRoot?=?if?(root.getChildren()?!=?while?(!record.isEmpty()??
- ????????????????????????&&?(tmpRoot?=?subTreeRoot.findChild(record.peek()))?!=? ????????????????????tmpRoot.countIncrement( ????????????????????subTreeRoot?=?tmpRoot;??
- ????????????????????record.poll();??
- ????????????addNodes(subTreeRoot,?record,?F1);??
- return?root;??
- //?把交易记录按项的频繁程序降序排列??
- public?LinkedList<String>?sortByF1(List<String>?transRecord,??
- ????????????ArrayList<TreeNode>?F1)?{??
- ????????Map<String,?Integer>?map?=?for?(String?item?:?transRecord)?{??
- //?由于F1已经是按降序排列的,??
- for?(int?i?=?0;?i?<?F1.size();?i++)?{??
- ????????????????TreeNode?tnode?=?F1.get(i);??
- if?(tnode.getName().equals(item))?{??
- ????????????????????map.put(item,?i);??
- ????????ArrayList<Entry<String,?Integer>>?al?=?new?ArrayList<Entry<String,?Integer>>(??
- ????????????????map.entrySet());??
- ????????Collections.sort(al,?new?Comparator<Map.Entry<String,?Integer>>()?{??
- ????????????int?compare(Entry<String,?Integer>?arg0,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????????Entry<String,?Integer>?arg1)?{??
- ??????????????????
- return?arg0.getValue()?-?arg1.getValue();??
- ????????});??
- ????????LinkedList<String>?rest?=?for?(Entry<String,?Integer>?entry?:?al)?{??
- ????????????rest.add(entry.getKey());??
- return?rest;??
- //?把record作为ancestor的后代插入树中??
- void?addNodes(TreeNode?ancestor,?LinkedList<String>?record,153); background-color:inherit; font-weight:bold">if?(record.size()?>?while?(record.size()?>? ????????????????String?item?=?record.poll();??
- ????????????????TreeNode?leafnode?=? ????????????????leafnode.setCount( ????????????????leafnode.setParent(ancestor);??
- ????????????????ancestor.addChild(leafnode);??
- for?(TreeNode?f1?:?F1)?{??
- if?(f1.getName().equals(item))?{??
- while?(f1.getNextHomonym()?!=? ????????????????????????????f1?=?f1.getNextHomonym();??
- ????????????????????????f1.setNextHomonym(leafnode);??
- break;??
- ????????????????addNodes(leafnode,153); background-color:inherit; font-weight:bold">static?void?main(String[]?args)?{??
- ????????FPTree?fptree?=?new?FPTree();??
- ????????fptree.setMinSuport(3);??
- ????????List<List<String>>?transRecords?=?fptree??
- ????????????????.readTransRocords("/home/orisun/test/market");??
- ????????fptree.FPGrowth(transRecords,153); background-color:inherit; font-weight:bold">null);??
- 输入文件
牛奶,鸡蛋,面包,薯片
鸡蛋,爆米花,薯片,啤酒
鸡蛋,面包,薯片
牛奶,鸡蛋,面包,爆米花,薯片,啤酒
牛奶,面包,啤酒
鸡蛋,面包,啤酒
牛奶,面包,薯片
牛奶,鸡蛋,面包,黄油,薯片
牛奶,鸡蛋,黄油,薯片
输出
6 薯片 鸡蛋
5 薯片 面包
5 鸡蛋 面包
4 薯片 鸡蛋 面包
5 薯片 牛奶
5 面包 牛奶
4 鸡蛋 牛奶
4 薯片 面包 牛奶
4 薯片 鸡蛋 牛奶
3 面包 鸡蛋 牛奶
3 薯片 面包 鸡蛋 牛奶
3 鸡蛋 啤酒
3 面包 啤酒
用Hadoop来实现
在上面的代码我们把整个事务数据库放在一个List<List<String>>里面传给FPGrowth,在实际中这是不可取的,因为内存不可能容下整个事务数据库,我们可能需要从关系关系数据库中一条一条地读入来建立FP-Tree。但无论如何?FP-Tree是肯定需要放在内存中的,但内存如果容不下怎么办?另外FPGrowth仍然是非常耗时的,你想提高速度怎么办?解决办法:分而治之,并行计算。
我们把原始事务数据库分成N部分,在N个节点上并行地进行FPGrowth挖掘,最后把关联规则汇总到一起就可以了。关键问题是怎么“划分”才会不遗露任何一条关联规则呢?参见这篇博客。这里为了达到并行计算的目的,采用了一种“冗余”的划分方法,即各部分的并集大于原来的集合。这种方法最终求出来的关联规则也是有冗余的,比如在节点1上得到一条规则(6:啤酒,尿布),在节点2上得到一条规则(3:尿布,啤酒),显然节点2上的这条规则是冗余的,需要采用后续步骤把冗余的规则去掉。
代码:
Record.java
import ?java.io.DataInput;??
import?java.io.DataOutput;??
import?java.util.Collections;??
import?java.util.LinkedList;??
import?org.apache.hadoop.io.WritableComparable;??
class?Record?implements?WritableComparable<Record>{??
???????
????LinkedList<String>?list;??
public?Record(){??
????????list=public?Record(String[]?arr){??
for(int?i=0;i<arr.length;i++)??
????????????list.add(arr[i]);??
public?String?toString(){??
????????String?str=list.get(0);??
1;i<list.size();i++)??
????????????str+="t"+list.get(i);??
return?str;??
void?readFields(DataInput?in)?throws?IOException?{??
????????list.clear();??
????????String?line=in.readUTF();??
????????String?[]arr=line.split("s+");??
0;i<arr.length;i++)??
????????????list.add(arr[i]);??
????@Override??
void?write(DataOutput?out)?throws?IOException?{??
????????out.writeUTF(this.toString());??
int?compareTo(Record?obj)?{??
????????Collections.sort(list);??
????????Collections.sort(obj.list);??
this.toString().compareTo(obj.toString());??
DC_FPTree.java
import?java.io.IOException;??
- import?java.io.InputStreamReader;??
- import?java.util.BitSet;??
- import?java.util.Comparator;??
- import?java.util.HashMap;??
- import?java.util.Map;??
- import?java.util.Map.Entry;??
- import?java.util.Set;??
- import?org.apache.hadoop.conf.Configuration;??
- import?org.apache.hadoop.conf.Configured;??
- import?org.apache.hadoop.fs.FSDataInputStream;??
- import?org.apache.hadoop.fs.FileSystem;??
- import?org.apache.hadoop.fs.Path;??
- import?org.apache.hadoop.io.IntWritable;??
- import?org.apache.hadoop.io.LongWritable;??
- import?org.apache.hadoop.io.Text;??
- import?org.apache.hadoop.mapreduce.Job;??
- import?org.apache.hadoop.mapreduce.Mapper;??
- import?org.apache.hadoop.mapreduce.Reducer;??
- import?org.apache.hadoop.mapreduce.lib.input.FileInputFormat;??
- import?org.apache.hadoop.mapreduce.lib.input.TextInputFormat;??
- import?org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;??
- import?org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;??
- import?org.apache.hadoop.util.Tool;??
- import?org.apache.hadoop.util.ToolRunner;??
- class?DC_FPTree?extends?Configured?implements?Tool?{??
- final?int?GroupNum?=?10;??
- int?minSuport=6;??
- class?GroupMapper?extends??
- ????????????Mapper<LongWritable,?Text,?IntWritable,?Record>?{??
- ????????List<String>?freq?=?new?LinkedList<String>();???
- ????????List<List<String>>?freq_group?=?new?LinkedList<List<String>>();???
- ????????void?setup(Context?context)?//?从文件读入频繁1项集??
- ????????????FileSystem?fs?=?FileSystem.get(context.getConfiguration());??
- ????????????Path?freqFile?=?new?Path("/user/orisun/input/F1");??
- ????????????FSDataInputStream?in?=?fs.open(freqFile);??
- ????????????InputStreamReader?isr?=?new?InputStreamReader(in);??
- ????????????BufferedReader?br?=?new?BufferedReader(isr);??
- ????????????????String?line;??
- ????????????????????String[]?str?=?line.split("s+");??
- ????????????????????String?word?=?str[0];??
- ????????????????????freq.add(word);??
- ????????????}?finally?{??
- ????????????????br.close();??
- //?对频繁1项集进行分组??
- ????????????Collections.shuffle(freq);???
- int?cap?=?freq.size()?/?GroupNum;???
- 0;?i?<?GroupNum;?i++)?{??
- ????????????????List<String>?list?=?int?j?=?0;?j?<?cap;?j++)?{??
- ????????????????????list.add(freq.get(i?*?cap?+?j));??
- ????????????????freq_group.add(list);??
- int?remainder?=?freq.size()?%?GroupNum;??
- int?base?=?GroupNum?*?cap;??
- 0;?i?<?remainder;?i++)?{??
- ????????????????freq_group.get(i).add(freq.get(base?+?i));??
- ????????void?map(LongWritable?key,?Text?value,?Context?context)??
- ????????????String[]?arr?=?value.toString().split("s+");??
- ????????????Record?record?=?new?Record(arr);??
- ????????????LinkedList<String>?list?=?record.list;??
- ????????????BitSet?bs=new?BitSet(freq_group.size());??
- ????????????bs.clear();??
- while?(record.list.size()?>? ????????????????String?item?=?list.peekLast();???
- 0;??
- for?(;?i?<?freq_group.size();?i++)?{??
- if(bs.get(i))??
- ????????????????????????continue;??
- if?(freq_group.get(i).contains(item))?{??
- ????????????????????????bs.set(i);??
- if(i<freq_group.size()){???????
- ????????????????????context.write(new?IntWritable(i),?record);????
- ????????????????record.list.pollLast();??
- class?FPReducer?extends?Reducer<IntWritable,Record,IntWritable,Text>{??
- void?reduce(IntWritable?key,Iterable<Record>?values,Context?context) ????????????List<List<String>>?trans=while(values.iterator().hasNext()){??
- ????????????????Record?record=values.iterator().next();??
- ????????????????LinkedList<String>?list=for(String?ele:record.list)??
- ????????????????????list.add(ele);??
- ????????????????trans.add(list);??
- ????????????FPGrowth(trans,153); background-color:inherit; font-weight:bold">null,context);??
- //?FP-Growth算法??
- ????????????List<String>?postPattern,?InterruptedException?{??
- //?构建项头表,同时也是频繁1项集??
- ????????ArrayList<TreeNode>?HeaderTable?=?buildHeaderTable(transRecords);??
- ????????TreeNode?treeRoot?=?buildFPTree(transRecords,?HeaderTable);??
- //?如果FP-Tree为空则返回??
- 0)??
- return;??
- //输出项头表的每一项+postPattern??
- null){??
- ????????????????String?outStr=header.getName();??
- int?count=header.getCount();??
- ????????????????????outStr+="t"?+?ele;??
- ????????????????context.write(new?Text(outStr));??
- ????????????ArrayList<TreeNode>?F1?=? ????????????????F1?=? ????????????????Map<String,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????????????????TreeNode?node?=? ????????????????????????????node.setCount( ????????????????????????????map.put(item,248); line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????????????}? ????????????????????????????map.get(item).countIncrement( ????????????????Set<String>?names?=?map.keySet();??
- ????????????????????TreeNode?tnode?=?map.get(name);??
- ????????????????????????F1.add(tnode);??
- ????????????????Collections.sort(F1);??
- ????????????}? ????????????????ArrayList<TreeNode>?F1)?{??
- ????????????TreeNode?root?=? ????????????????LinkedList<String>?record?=?sortByF1(transRecord,248); line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????TreeNode?subTreeRoot?=?root;??
- ????????????????TreeNode?tmpRoot?=? ????????????????????????????&&?(tmpRoot?=?subTreeRoot.findChild(record.peek()))?!=? ????????????????????????tmpRoot.countIncrement( ????????????????????????subTreeRoot?=?tmpRoot;??
- ????????????????????????record.poll();??
- ????????????????addNodes(subTreeRoot,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????ArrayList<TreeNode>?F1)?{??
- ????????????????????TreeNode?tnode?=?F1.get(i);??
- ????????????????????????map.put(item,248); line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????ArrayList<Entry<String,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????????map.entrySet());??
- ????????????Collections.sort(al,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ???????????????? ????????????????????????Entry<String,248); line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????????????
- ????????????LinkedList<String>?rest?=? ????????????????rest.add(entry.getKey());??
- ????????????????????String?item?=?record.poll();??
- ????????????????????TreeNode?leafnode?=? ????????????????????leafnode.setCount( ????????????????????leafnode.setParent(ancestor);??
- ????????????????????ancestor.addChild(leafnode);??
- ???????????????????????????? ????????????????????????????????f1?=?f1.getNextHomonym();??
- ????????????????????????????}??
- ????????????????????????????f1.setNextHomonym(leafnode);??
- ????????????????????????}??
- ????????????????????addNodes(leafnode,108); list-style:decimal-leading-zero outside; color:inherit; line-height:17.600000381469727px; margin:0px!important; padding:0px 3px 0px 10px!important"> ???????
- class?InverseMapper? ????????????String?[]arr=value.toString().split("s+");??
- int?count=Integer.parseInt(arr[0]);??
- ????????????Record?record=new?Record();??
- 1;i<arr.length;i++){??
- ????????????????record.list.add(arr[i]);??
- ????????????context.write(record,153); background-color:inherit; font-weight:bold">new?IntWritable(count));??
- class?MaxReducer?extends?Reducer<Record,Record>{??
- void?reduce(Record?key,Iterable<IntWritable>?values,153); background-color:inherit; font-weight:bold">int?max=-1;??
- for(IntWritable?value:values){??
- int?i=value.get();??
- if(i>max)??
- ????????????????????max=i;??
- new?IntWritable(max),?key);??
- int?run(String[]?arg0)?throws?Exception?{??
- ????????Configuration?conf=getConf();??
- ????????conf.set("mapred.task.timeout",?"6000000");??
- ????????Job?job=new?Job(conf);??
- ????????job.setJarByClass(DC_FPTree.class);??
- ????????FileSystem?fs=FileSystem.get(getConf());??
- ???????????
- ????????FileInputFormat.setInputPaths(job,?"/user/orisun/input/data");??
- ????????Path?outDir=new?Path("/user/orisun/output");??
- ????????fs.delete(outDir,true);??
- ????????FileOutputFormat.setOutputPath(job,?outDir);??
- ???????????
- ????????job.setMapperClass(GroupMapper. ????????job.setReducerClass(FPReducer.class);??
- ????????job.setInputFormatClass(TextInputFormat. ????????job.setOutputFormatClass(TextOutputFormat. ????????job.setMapOutputKeyClass(IntWritable. ????????job.setMapOutputValueClass(Record. ????????job.setOutputKeyClass(IntWritable. ????????job.setOutputKeyClass(Text.boolean?success=job.waitForCompletion(true);??
- ????????job=new?Job(conf);??
- ????????job.setJarByClass(DC_FPTree."/user/orisun/output/part-r-*");??
- ????????Path?outDir2=new?Path("/user/orisun/output2");??
- ????????fs.delete(outDir2,?outDir2);??
- ????????job.setMapperClass(InverseMapper. ????????job.setReducerClass(MaxReducer.//job.setNumReduceTasks(0);??
- ????????job.setInputFormatClass(TextInputFormat. ????????job.setOutputFormatClass(TextOutputFormat. ????????job.setMapOutputKeyClass(Record. ????????job.setMapOutputValueClass(IntWritable. ????????job.setOutputKeyClass(IntWritable. ????????job.setOutputKeyClass(Record. ????????success?|=?job.waitForCompletion(return?success?0:1;??
- void?main(String[]?args)?throws?Exception{??
- int?res=ToolRunner.run(new?Configuration(),153); background-color:inherit; font-weight:bold">new?DC_FPTree(),?args);??
- ????????System.exit(res);??
- }??
结束语
在实践中,关联规则挖掘可能并不像人们期望的那么有用。一方面是因为支持度置信度框架会产生过多的规则,并不是每一个规则都是有用的。另一方面大部分的关联规则并不像“啤酒与尿布”这种经典故事这么普遍。关联规则分析是需要技巧的,有时需要用更严格的统计学知识来控制规则的增殖。
原文来自:博客园(华夏35度)http://www.cnblogs.com/zhangchaoyang 作者:Orisun
(编辑:李大同)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|