关联规则挖掘之apriori算法
发布时间:2020-12-14 02:29:40 所属栏目:大数据 来源:网络整理
导读:前言: 众所周知,关联规则挖掘是数据挖掘中重要的一部分,如著名的啤酒和尿布的问题。今天要学习的是经典的关联规则挖掘算法——Apriori算法 一、算法的基本原理 由k项频繁集去导出k+1项频繁集。 二、算法流程 1.扫描事务数据库,找出1项集,并根据最小支持
前言: 众所周知,关联规则挖掘是数据挖掘中重要的一部分,如著名的啤酒和尿布的问题。今天要学习的是经典的关联规则挖掘算法——Apriori算法 一、算法的基本原理 由k项频繁集去导出k+1项频繁集。 二、算法流程 1.扫描事务数据库,找出1项集,并根据最小支持度计数,剪枝得出频繁1项集。k=1. 2.由频繁k项集进行连接步操作,形成候选的k+1项集,并扫描数据库,得出每一项的支持度计数,并根据最小支持度计数,剪枝得到频繁k+1项集。 迭代的进行第2步直到频繁k项集是空的。 3.由频繁项集构造关联规则。先列出所有可能的关联规则,然后计算相应的置信度,最终筛选出满足最小置信度的强关联规则。 三、算法实现 测试数据集:data.txt 1,7,15,44,49 2,1,19 3,19 4,3,4,18,35,44 5,2,9,23 6,14,21,44 7,12,31,36,48 8,27,28 9,28 10,35 11,23,24,40,41,43 12,20,43,48 13,49 14,19,26 15,5,22,39 16,16,32,45 17,6,10,22 18,23 19,11,37,45 20,35 21,8,47 22,34,39,44 23,13,19 24,38 25,48 26,14 27,43 28,14 29,42 30,35 31,23 32,25,38 33,46 34,43 35,17,29,47 36,36 37,26,44 38,30,45,47 39,46 40,26 41,14 42,36 43,28 44,40 45,32 46,44 47,8 48,46 49,33,42 50,28,39 51,28 52,19 53,34 54,46 55,45 56,49 57,46 58,19 59,46 60,40 61,39 62,43 63,28 64,40 65,44 66,28 67,36 68,48 69,35 70,16 71,49 72,30 73,30 74,36 75,49 76,38 77,31 78,26 79,43 80,42 81,32 82,45 83,41 84,36 85,48 86,45 87,48 88,44 89,43 90,46 91,47,48 92,48 93,35 94,45 95,45 96,30 97,19 98,1 99,37 100,40 101,31 102,43 103,48 104,48 105,45 106,48 107,27 108,42 109,35 110,49 111,48 112,44 113,35 114,46 115,41 116,28 117,33 118,41 119,49 120,44 121,46 122,35 123,48 124,45 125,27 126,22 127,42,46,47 128,41 129,23 130,42 131,44 132,48 133,47 134,47 135,36 136,11 137,35 138,42 139,46 140,38,49 141,46 142,25 143,22 144,48 145,47 146,15 147,22 148,46 149,41 150,39 151,46 152,42 153,46 154,45 155,15 156,49 157,37 158,46 159,48 160,35 161,35 162,49 163,17 164,42 165,48 166,44 167,42 168,9 169,42 170,28 171,32 172,43 173,42 174,31 175,42 176,48 177,43 178,26 179,45 180,46 181,36 182,43 183,37 184,31 185,29 186,46 187,46 188,48 189,28 190,43 191,24 192,28 193,43 194,28 195,22 196,22 197,47 198,22 199,46 200,30 201,42 202,40 203,36 204,41 205,32 206,45 207,48 208,45 209,28 210,37 211,48 212,45 213,42 214,41 215,44 216,45 217,45 218,23 219,35 220,46 221,46 222,47 223,36 224,47 225,22 226,30 227,49 228,45 229,45 230,48 231,28 232,44 233,39 234,46 235,48 236,42 237,22 238,25 239,48 240,35 241,12 242,45 243,48 244,47 245,35 246,9 247,45 248,22 249,48 250,42 251,35 252,3 253,27 254,19 255,35 256,48 257,41 258,35 259,49 260,46 261,6 262,31 263,40 264,46 265,43 266,44 267,49 268,24 269,48 270,27 271,39 272,49 273,48 274,42 275,42 276,43 277,48 278,32 279,44 280,45 281,45 282,44 283,29 284,46 285,39 286,19 287,42 288,47 289,36 290,28 291,28 292,49 293,44 294,41 295,42 296,24 297,46 298,29 299,47 300,45 301,43 302,46 303,46 304,34 305,42 306,36 307,42 308,36 309,45 310,48 311,22 312,45 313,45 314,37 315,9 316,45 317,46 318,47 319,49 320,34 321,49 322,28 323,19 324,46 325,19 326,45 327,48 328,24 329,44 330,13 331,48 332,46 333,46 334,46 335,29 336,45 337,44 338,47 339,47 340,49 341,41 342,37 343,26 344,44 345,46 346,30 347,47 348,42 349,45 350,47 351,35 352,45 353,31 354,36 355,47 356,41 357,44 358,35 359,16 360,37 361,31 362,25 363,44 364,28 365,22 366,22 367,29 368,45 369,34 370,28 371,49 372,48,49 373,43 374,9 375,28 376,47 377,41 378,41 379,48 380,47 381,49 382,49 383,47 384,29 385,21 386,41 387,44 388,25 389,9 390,47 391,35 392,45 393,40 394,15 395,48 396,46 397,49 398,45 399,49 400,48 401,2 402,49 403,10 404,36 405,46 406,22 407,42 408,46 409,48 410,42 411,28 412,28 413,21 414,31 415,49 416,42 417,47 418,30 419,42 420,31 421,49 422,45 423,32 424,13 425,44 426,24 427,26 428,43 429,14 430,47 431,42 432,47 433,47 434,46 435,47 436,39 437,24 438,32 439,47 440,19 441,40 442,46 443,42 444,45 445,42 446,47 447,36 448,40 449,23 450,33 451,47 452,49 453,35 454,28 455,36 456,28 457,34 458,47 459,44 460,45 461,16 462,49 463,44 464,32 465,49 466,49 467,41 468,46 469,48 470,47 471,45 472,13 473,47 474,32 475,46 476,41 477,49 478,47 479,35 480,44 481,46 482,9 483,48 484,19 485,42 486,34 487,43 488,22 489,31 490,47 491,47 492,31 493,44 494,38 495,47 496,48 497,28 498,47 499,44 500,46 501,39 502,44 503,36 504,18 505,45 506,20 507,30 508,32 509,8 510,46 511,37 512,37 513,23 514,32 515,44 516,46 517,47 518,45 519,35 520,48 521,44 522,24 523,23 524,45 525,39 526,45 527,46 528,42 529,30 530,49 531,16 532,48 533,28 534,18 535,49 536,48 537,28 538,35 539,48 540,42 541,45 542,43 543,45 544,48 545,14 546,38 547,40 548,47 549,38 550,35 551,22 552,34 553,9 554,35 555,37 556,41 557,44 558,43 559,47 560,45 561,42 562,43 563,47 564,42 565,46 566,35 567,40 568,34 569,44 570,28 571,32 572,22 573,41 574,47 575,22 576,7 577,49 578,15 579,46 580,42 581,31 582,18 583,22 584,28 585,42 586,44 587,28 588,47 589,22 590,37 591,45 592,42 593,41 594,31 595,48 596,43 597,19 598,35 599,35 600,48 601,19 602,49 603,27 604,13 605,26 606,35 607,19 608,22 609,19 610,39 611,34 612,44 613,47 614,45 615,36 616,46 617,10 618,39 619,45 620,44 621,30 622,49 623,32 624,31 625,45 626,32 627,48 628,40 629,43 630,47 631,38 632,46 633,35 634,47 635,43 636,35 637,49 638,48 639,47 640,33 641,42 642,46 643,47 644,46 645,45 646,42 647,44 648,39 649,49 650,44 651,48 652,48 653,46 654,44 655,9 656,22 657,42 658,28 659,5 660,42 661,49 662,37 663,47 664,46 665,42 666,48 667,28 668,49 669,40 670,48 671,45 672,25 673,42 674,45 675,46 676,42 677,42 678,48 679,49 680,48 681,49 682,49 683,45 684,43 685,22 686,28 687,46 688,45 689,39 690,23 691,43 692,47 693,19 694,45 695,44 696,37 697,29 698,45 699,45 700,49 701,45 702,35 703,46 704,44 705,46 706,45 707,40 708,43 709,49 710,48 711,24 712,32 713,32 714,44 715,35 716,45 717,48 718,9 719,19 720,31 721,33 722,44 723,34 724,36 725,42 726,29 727,37 728,27 729,39 730,28 731,44 732,22 733,46 734,45 735,44 736,30 737,30 738,23 739,22 740,35 741,47 742,48 743,35 744,10 745,15 746,49 747,28 748,19 749,21 750,48 751,45 752,33 753,3 754,40 755,49 756,45 757,43 758,44 759,47 760,29 761,46 762,20 763,32 764,26 765,20 766,48 767,45 768,46 769,28 770,41 771,29 772,26 773,28 774,29 775,45 776,47 777,39 778,47 779,49 780,48 781,45 782,47 783,48 784,46 785,36 786,47 787,33 788,46 789,35 790,1 791,46 792,27 793,39 794,27 795,43 796,16 797,48 798,44 799,47 800,31 801,43 802,42 803,44 804,47 805,46 806,45 807,44 808,48 809,19 810,49 811,30 812,47 813,48 814,22 815,9 816,45 817,40 818,19 819,26 820,42 821,46 822,40 823,28 824,42 825,22 826,44 827,45 828,48 829,39 830,36 831,49 832,40 833,49 834,9 835,36 836,48 837,18 838,34 839,42 840,36 841,49 842,43 843,49 844,41 845,34 846,41 847,42 848,45 849,46 850,19 851,45 852,9 853,39 854,39 855,32 856,47 857,48 858,15 859,46 860,22 861,35 862,48 863,35 864,41 865,22 866,49 867,44 868,41 869,43 870,25 871,42 872,47 873,20 874,44 875,35 876,42 877,18 878,27 879,24 880,47 881,46 882,45 883,43 884,47 885,32 886,46 887,43 888,35 889,31 890,44 891,35 892,30 893,44 894,43 895,49 896,46 897,43 898,36 899,27 900,22 901,48 902,43 903,38 904,40 905,43 906,49 907,31 908,44 909,30 910,48 911,48 912,21 913,41 914,42 915,31 916,42 917,15 918,45 919,43 920,40 921,19 922,44 923,39 924,47 925,36 926,46 927,44 928,9 929,20 930,22 931,26 932,42 933,47 934,28 935,39 936,32 937,36 938,45 939,42 940,44 941,44 942,43 943,48 944,46 945,35 946,48 947,41 948,44 949,39 950,16 951,46 952,43 953,38 954,47 955,43 956,16 957,42 958,44 959,28 960,43 961,19 962,34 963,45 964,29 965,49 966,48 967,49 968,31 969,45 970,45 971,41 972,45 973,45 974,36 975,22 976,40 977,41 978,28 979,45 980,28 981,13 982,41 983,48 984,35 985,36 986,45 987,9 988,39 989,46 990,34 991,46 992,0 993,13 994,44 995,45 996,46 997,32 998,33 999,35 1000,47 package second; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.util.HashMap; import java.util.HashSet; import java.util.Iterator; import java.util.Map; import java.util.Set; /* * date:2015-06-10 * by wenbaoli * */ public class Apriori { private float min_sup;//minimum support private float min_conf;//minimum confidence private Map<Integer,Set<String>> TransDataBase;//transaction database private Integer DBnum; private Map<Integer,Map<Set<String>,Float>> freqItermSet;//frequent iterm set,from 1 to k... private Map<Set<String>,Set<Map<Set<String>,Float>>> associationRules;//the final associate rules public Apriori(Map<Integer,Set<String>> DB,float minSup,float minConf){ this.TransDataBase = DB; this.min_conf = minConf; this.min_sup = minSup; this.DBnum = DB.size(); freqItermSet =new HashMap<Integer,Float>>(); associationRules = new HashMap<Set<String>,Float>>>(); } /** * @param args */ public static void main(String[] args) { // TODO Auto-generated method stub //initial String file = "data/data.txt"; String delimeter = ","; //load data Map<Integer,Set<String>> data = new HashMap<Integer,Set<String>>(); int num = 0; try { File mFile = new File(file); FileReader fr = new FileReader(mFile); BufferedReader br = new BufferedReader(fr); String line; while((line=br.readLine())!= null) { line = line.trim(); String[] str = line.split(delimeter); // int i = Integer.parseInt(str[0]); Set<String> set = new HashSet<String>(); for (int i = 1; i < str.length; i++) { set.add(str[i].trim()); } num++; data.put(num,set); } br.close(); } catch(IOException ex) { ex.printStackTrace(); } Apriori ap = new Apriori(data,0.005f,0.6f); ap.findAllFreqItermSet(); ap.findAssociationRules(); } public void findAllFreqItermSet(){ //找出频繁一项集 Map<Set<String>,Float> f1 = new HashMap<Set<String>,Float>(); Map<Set<String>,Integer> OneItermSet = new HashMap<Set<String>,Integer>(); Iterator<Map.Entry<Integer,Set<String>>> it = this.TransDataBase.entrySet().iterator(); while(it.hasNext()){ Map.Entry<Integer,Set<String>> entry = it.next(); Set<String> itermSet = entry.getValue(); for(String iterm:itermSet){ Set<String> key = new HashSet<String>(); key.add(iterm); if(!OneItermSet.containsKey(key)){ OneItermSet.put(key,1); }else{ int value = 1 + OneItermSet.get(key); OneItermSet.put(key,value); } } } //找出支持度大于minSup的频繁一项集 Iterator<Map.Entry<Set<String>,Integer>> iter = OneItermSet.entrySet().iterator(); while(iter.hasNext()){ Map.Entry<Set<String>,Integer> entry = iter.next(); //计算支持度 Float support = new Float(entry.getValue().toString())/new Float(this.DBnum); if(support >= this.min_sup){ f1.put(entry.getKey(),support); } } System.out.println("频繁1" + "项集:" + f1);//打印频繁1-项集 System.out.println("-------------------------------------------"); this.freqItermSet.put(1,f1); //由频繁k项集得到频繁k+1项集 int k = 2; while(true){ Set<Set<String>> candFreItermSets = this.apriori_gen(k,this.freqItermSet.get(k-1).keySet()); Map<Set<String>,Float> freqKItermSetMap = this.getFreqKItermSet(k,candFreItermSets); if(!freqKItermSetMap.isEmpty()){ this.freqItermSet.put(k,freqKItermSetMap); } else { break; } System.out.println("频繁" + k + "项集:" + freqKItermSetMap); System.out.println("-------------------------------------------"); k++; } } public Map<Set<String>,Float> getFreqKItermSet(int k,Set<Set<String>> candFreItermSets) { Map<Set<String>,Integer> candFreqKItermSetMap = new HashMap<Set<String>,Integer>(); //扫描事物数据库 Iterator<Map.Entry<Integer,Set<String>>> it = this.TransDataBase.entrySet().iterator(); //统计支持度计数 while (it.hasNext()){ Map.Entry<Integer,Set<String>> entry = it.next(); Iterator<Set<String>> iter = candFreItermSets.iterator(); while(iter.hasNext()){ Set<String> set = iter.next(); if(entry.getValue().containsAll(set)){ if(!candFreqKItermSetMap.containsKey(set)){ candFreqKItermSetMap.put(set,1); }else { int value = 1+ candFreqKItermSetMap.get(set); candFreqKItermSetMap.put(set,value); } } } } Iterator<Map.Entry<Set<String>,Integer>> iter = candFreqKItermSetMap.entrySet().iterator(); Map<Set<String>,Float> freqKIntermSet = new HashMap<Set<String>,Float>(); while(iter.hasNext()){ Map.Entry<Set<String>,Integer> entry = iter.next(); float support = new Float(entry.getValue().toString())/this.DBnum; if(support < this.min_sup){ iter.remove(); } else { freqKIntermSet.put(entry.getKey(),support); } } return freqKIntermSet; } public Set<Set<String>> apriori_gen(int k,Set<Set<String>> fk){ Set<Set<String>> ck = new HashSet<Set<String>>(); Iterator<Set<String>> it1 = fk.iterator(); while (it1.hasNext()) { Set<String> itermSet1 = it1.next(); Iterator<Set<String>> it2 = fk.iterator(); while (it2.hasNext()) { Set<String> itermSet2 = it2.next(); if(!itermSet1.equals(itermSet2)) { //连接步 Set<String> commIterms = new HashSet<String>(); commIterms.addAll(itermSet1); commIterms.retainAll(itermSet2); if(commIterms.size() == (k - 2)){ Set<String> candIterms = new HashSet<String>(); candIterms.addAll(itermSet1); candIterms.removeAll(itermSet2); candIterms.addAll(itermSet2); //剪枝步骤 if(!this.has_infrequent_subset(candIterms,fk)){ ck.add(candIterms); } } } } } System.out.println(ck.size()); return ck; } public boolean has_infrequent_subset(Set<String> set,Set<Set<String>> fk){ Set<Set<String>> subItermSet = new HashSet<Set<String>>(); Iterator<String> itr = set.iterator(); while(itr.hasNext()){ //深拷贝 Set<String> subItem = new HashSet<String>(); Iterator<String> it = set.iterator(); while(it.hasNext()){ subItem.add(it.next()); } //去掉一个项后为k-1子集 subItem.remove(itr.next()); subItermSet.add(subItem); } Iterator<Set<String>> it = subItermSet.iterator(); while(it.hasNext()){ if(!fk.contains(it.next())){ return true; } } return false; } public void findAssociationRulesTemp(){ } public void findAssociationRules(){ Iterator<Map.Entry<Integer,Float>>> it = this.freqItermSet.entrySet().iterator(); while (it.hasNext()) { Map.Entry<Integer,Float>> entry = it.next(); for (Set<String> itemSet : entry.getValue().keySet()) { int n = itemSet.size() / 2; for (int i = 1; i <= n; i++) { Set<Set<String>> subset = this.getProperSubset(i,itemSet); for(Set<String> conditionSet:subset){ Set<String> conclusionSet = new HashSet<String>(); conclusionSet.addAll(itemSet); conclusionSet.removeAll(conditionSet); int s1 = conditionSet.size(); int s2 = conclusionSet.size(); float supF = this.freqItermSet.get(s1).get(conditionSet); float supS = this.freqItermSet.get(s2).get(conclusionSet); float sup = this.freqItermSet.get(s1+s2).get(itemSet); float conf1 = sup/supF; float conf2 = sup/supS; if(conf1 >= this.min_conf){ if(this.associationRules.get(conditionSet) == null){ Set<Map<Set<String>,Float>> conclusionSetSet = new HashSet<Map<Set<String>,Float>>(); Map<Set<String>,Float> sets = new HashMap<Set<String>,Float>(); sets.put(conclusionSet,conf1); conclusionSetSet.add(sets); this.associationRules.put(conditionSet,conclusionSetSet); } else { Map<Set<String>,conf1); associationRules.get(conditionSet).add(sets); } } if(conf2 >= this.min_conf){ if(this.associationRules.get(conditionSet) == null){ Set<Map<Set<String>,conf2); conclusionSetSet.add(sets); this.associationRules.put(conditionSet,conf2); associationRules.get(conditionSet).add(sets); } } } } } } System.out.println("关联规则(强规则):" + associationRules); } private Set<Set<String>> getProperSubset(int i,Set<String> itemSet) { Set<Set<String>> subset = new HashSet<Set<String>>(); if(itemSet.size() <= 1){ return null; }else if(itemSet.size() == 2){ for(String s: itemSet){ Set<String> set = new HashSet<String>(); set.add(s); if(!subset.contains(s)){ subset.add(set); } } return subset; }else { Iterator<String> it = itemSet.iterator(); String s = it.next(); Set<String> temp = new HashSet<String>(itemSet); temp.remove(s); //包含s的子集 Set<Set<String>> subset0 = new HashSet<Set<String>>(); subset0 = this.getProperSubset(i-1,temp); subset.addAll(subset0); //不包含s的子集 Set<Set<String>> subset1 = new HashSet<Set<String>>(); subset1 = this.getProperSubset(i,temp); subset.addAll(subset1); return subset; } } } (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |