加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 大数据 > 正文

数据相似性检测算法

发布时间:2020-12-14 04:12:40 所属栏目:大数据 来源:网络整理
导读:?数据相似性检测算法 1、引言 "数据同步算法研究"一文研究了在网络上高效同步数据的方法,其中有个前提是文件A和B非常相似,即两者之间存在大量相同的数据。如果两个文件相似性很低,虽然这种方法依然可以正常工作,但数据同步性能却不会得到提高,甚至会有
?数据相似性检测算法

1、引言
  "数据同步算法研究"一文研究了在网络上高效同步数据的方法,其中有个前提是文件A和B非常相似,即两者之间存在大量相同的数据。如果两个文件相似性很低,虽然这种方法依然可以正常工作,但数据同步性能却不会得到提高,甚至会有所降低。因为会产生部分元数据和网络通信消耗,这在两个文件完全不相关时尤为明显。因此,同步数据前需要计算种子文件(seed file)与目标文件之间的相似性,如果相似性大于指定阈值(通常应大于50%)则应用该数据同步算法,否则接传输文件即可。如此,可使得数据同步算法则具有较好的自适应性,在数据具有不同相似性的情形下均可进行高性能的数据同步。另外,在数据相似性检测的基础之上,可对于相似性高的数据进行数据编码处理(如Delta编码),通过一个文件给另一个文件编码的方式进行数据压缩,这是一种基于相似数据检测与编码的重复数据删除技术。
2、相似性计算
  Unix diff对文档进行逐行对比来检测相似文件,它采用经典的LCS(Longest Common Subsequence,最长公共子串)算法,运用动态规划方法来计算相似性。LCS的含义是同时包含在字符串里的一个最长字符序列,LCS的长度作为这两个字符串相似性的度量。Diff算法以整行作为"字符"来计算最长公共子串,性能上比字符级的LCS算法快很多。这种方法效率很低,而且只适用文本文件的相似比较,不能直接适用于二进制文件。

  目前通常的做法是将文件相似性问题转换为集合相似性问题,如基于shingle的计算方法和基于bloom filter的计算方法,这两种方法都可适用于任何格式的数据文件。这种方式的核心思想是为每个文件提取组特征值,以特征值集合来计算相似性,从而降低计算复杂性来提高性能。shingle用特征值交集来计算相似性会导致高计算和空间开销,bloom filter技术在计算开销和匹配精度上更具势。Bloom filter所定义的集合元素是文件按照CDC(content-defined chunking)算法所切分数据块的指纹值,其相似性定义如下:
?????????????????????????? ?|fingerprints(f1) ∩ fingerprints(f2)|
  Sim(f1,f2) = ---------------------------------------------?? (公式1)
?????????????????????????? ?|fingerprints(f1) ∪ fingerprints(f2)|

  另外一种方法,是将二进制文件进行切块,使用数据块指纹来表示数据块,然后将数据块映射为"字符",再应用LCS算法寻找最大公共子串并计算出相似度。其相似性定义如下:
  ????????????????????? 2 * length(LCS(fingerprints(f1),fingerprints(f2)))
??? Sim(f1,f2) = ------------------------------------------------------------------ (公式2)
????????????????????????? ?length(fingerprints(f1)) + length(fingerprints(f2))

  上面两种相似性算法中均采用数据切分技术,数据块可以是定长或变长。为了相似性计算的精确性,实现中采用以数据块长度作为权的加权计算方法。

3、Bloom filter算法
  该文件相似性计算流程如下:
????? (1) 采用CDC算法将文件切分成数据块集,并为每个数据块计算MD5指纹;
????? (2) 计算两个指纹集合的交集和并集,通过hashtable来实现;
????? (3) 按照公式1计算文件相似性,考虑重复数据块和数据块长度来提高计算精确度。
????? 详细参见附录bsim源码中的file_chunk,chunk_file_process和similarity_detect函数实现。

4、LCS算法
  该文件相似性计算流程如下:
????(1) 采用CDC算法将文件切分成数据块集,并为每个数据块计算MD5指纹;
??? (2) 将MD5指纹串映射为"字符",则文件转换为"字符串"表示;
??? (3) 应用LCS算法计算出最长公共子串,并计算其加权长度;
??? (4) 按照公式2计算文件相似性,考虑重复数据块和数据块长度来提高计算精确度。
??? 详细参见附录bsim源码中的file_chunk,chunk_file_process,LCS和similarity_detect函数实现。

5、算法分析比较
 两种算法都对文件进行切分操作,假设文件f1切为m个块,文件f2切分成n个块。Bloom filter算法没有考虑数据块顺序,因此在相似性精确度方面要低于LCS算法,其时间和空间复杂性都是O(m + n)。相反,LCS算法考虑了数据块顺序问题,相似性度量相对精确,然而其时间和空间复杂性是O(mn),这个大大限制了应用规模。综合来看,Bloom filter算法精确度比LCS算法要低,但计算消耗要小很多,性能和适用性非常好。LCS比较适合精确的文件相似性计算,这些文件往往比较小,50MB以内比较合适。对于重复数据删除和网络数据同步来说,消重效果和性能与数据块顺序性无关,因此Bloom filter算法计算的数据相似性更适用,性能也更高。

附录:bsim.c源码
(完整源码请参见deduputil源码)

[cpp]? view plain copy print ?
  1. /*?Copyright?(C)?2010?Aigui?Liu?
  2. ?*?
  3. ?*?This?program?is?free?software;?you?can?redistribute?it?and/or?modify?
  4. ?*?it?under?the?terms?of?the?GNU?General?Public?License?as?published?by?
  5. ?*?the?Free?Software?Foundation;?either?version?3?of?the?License,?or?
  6. ?*?(at?your?option)?any?later?version.?
  7. ?*?
  8. ?*?This?program?is?distributed?in?the?hope?that?it?will?be?useful,?
  9. ?*?but?WITHOUT?ANY?WARRANTY;?without?even?the?implied?warranty?of?
  10. ?*?MERCHANTABILITY?or?FITNESS?FOR?A?PARTICULAR?PURPOSE.??See?the?
  11. ?*?GNU?General?Public?License?for?more?details.?
  12. ?*?You?should?have?received?a?copy?of?the?GNU?General?Public?License?along?
  13. ?*?with?this?program;?if?not,?visit?the?http://fsf.org?website.?
  14. ?*/??
  15. #include?<stdio.h>??
  16. #include?<stdlib.h>??
  17. #include?<string.h>??
  18. #include?<sys/types.h>??
  19. #include?<sys/stat.h>??
  20. #include?<fcntl.h>??
  21. #include?<unistd.h>??
  22. #include?"hashtable.h"??
  23. #include?"sync.h"??
  24. #define?NEITHER???????0??
  25. #define?UP????????????1??
  26. #define?LEFT??????????2??
  27. #define?UP_AND_LEFT???3??
  28. #define?MAX(x,?y)?(((x)?>?(y))???(x)?:?(y))??
  29. #define?MIN(x,?y)?(((x)?<?(y))???(x)?:?(y))??
  30. #define?MD5_LEN?17??
  31. enum?{??
  32. ????FILE1?=?0,??
  33. ????FILE2??
  34. };??
  35. ????LCS_NOT?=?0,108); list-style:decimal-leading-zero outside; line-height:18px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????LCS_YES??
  36. typedef?struct?{??
  37. ????uint32_t?nr1;??
  38. ????uint32_t?nr2;??
  39. ????uint32_t?len;??
  40. }?hash_entry;??
  41. struct?{??
  42. ????char?**str;??
  43. }?lcs_entry;??
  44. static?uint32_t?sim_union?=?0;??
  45. static?uint32_t?sim_intersect?=?0;??
  46. static?void?usage()??
  47. {??
  48. ????fprintf(stderr,?"Usage:?bsim?FILE1?FILE2?CHUNK_ALGO?LCS/n/n");??
  49. ????fprintf(stderr,?"Similarity?detect?between?FILE1?and?FILE2?based?on?block?level./n");??
  50. "CHUNK_ALGO:/n");??
  51. "??FSP?-?fixed-size?partition/n");??
  52. "??CDC?-?content-defined?chunking/n");??
  53. "??SBC?-?slide?block?chunking/n/n");??
  54. "LCS:/n");??
  55. "??LCS_NOT?-?do?not?use?LCS(longest?lommon?subsequence)?algorithms/n");??
  56. "??LCS_YES?-?use?LCS?algorithms/n/n");??
  57. "Report?bugs?to?<Aigui.Liu@gmail.com>./n");??
  58. }??
  59. static?int?parse_arg(char?*argname)??
  60. {??
  61. ????if?(0?==?strcmp(argname,?"FSP"))??
  62. ????????return?CHUNK_FSP;??
  63. else?"CDC"))??
  64. return?CHUNK_CDC;??
  65. "SBC"))??
  66. return?CHUNK_SBC;??
  67. "LCS_NOT"))??
  68. return?LCS_NOT;??
  69. "LCS_YES"))??
  70. return?LCS_YES;??
  71. else??
  72. return?-1;??
  73. }??
  74. char?**alloc_2d_array(int?row,?int?col)??
  75. ????????int?i;??
  76. ????????char?*p,?**pp;??
  77. ????????p?=?(char?*)malloc(row?*?col?*?sizeof(char));??
  78. ????????pp?=?(char?**)malloc(row?*?char?*));??
  79. if?(p?==?NULL?||?pp?==?NULL)??
  80. ????????????????return?NULL;??
  81. for?(i?=?0;?i?<?row;?i++)?{??
  82. ????????????????pp[i]?=?p?+?col?*?i;??
  83. ????????}??
  84. ????????return?pp;??
  85. void?free_2d_array(char?**str)??
  86. ????free(str[0]);??
  87. ????free(str);??
  88. void?show_md5_hex(unsigned?char?md5_checksum[16])??
  89. for?(i?=?0;?i?<?16;?i++)?{??
  90. ????????????????printf("%02x",?md5_checksum[i]);??
  91. ????????}??
  92. ????????printf("/n");??
  93. int?chunk_file_process(char?*chunk_file,?hashtable?*htab,87); background-color:inherit; font-weight:bold">int?which,87); background-color:inherit; font-weight:bold">int?sim_algo,?lcs_entry?*le)??
  94. ????int?fd,?i,?ret?=?0;??
  95. ????ssize_t?rwsize;??
  96. ????chunk_file_header?chunk_file_hdr;??
  97. ????chunk_block_entry?chunk_bentry;??
  98. ????hash_entry?*he?=?NULL;??
  99. ????/*?parse?chunk?file?*/??
  100. ????fd?=?open(chunk_file,?O_RDONLY);??
  101. if?(-1?==?fd)?{??
  102. ????}??
  103. ????rwsize?=?read(fd,?&chunk_file_hdr,?CHUNK_FILE_HEADER_SZ);??
  104. if?(rwsize?!=?CHUNK_FILE_HEADER_SZ)?{??
  105. ????????ret?=?-1;??
  106. goto?_CHUNK_FILE_PROCESS_EXIT;??
  107. ????}??
  108. if?(sim_algo?==?LCS_YES)?{??
  109. ????????le->str?=?alloc_2d_array(chunk_file_hdr.block_nr,?MD5_LEN);??
  110. if?(le->str?==?NULL)?{??
  111. ????????????ret?=?-1;??
  112. ???????????? ????????le->len?=?chunk_file_hdr.block_nr;??
  113. for(i?=?0;?i?<?chunk_file_hdr.block_nr;?i++)?{??
  114. ????????rwsize?=?read(fd,?&chunk_bentry,?CHUNK_BLOCK_ENTRY_SZ);??
  115. if?(rwsize?!=?CHUNK_BLOCK_ENTRY_SZ)?{??
  116. ????????he?=?(hash_entry?*)hash_value((void?*)chunk_bentry.md5,?htab);??
  117. if?(he?==?NULL)?{??
  118. ????????????he?=?(hash_entry?*)malloc(sizeof(hash_entry));??
  119. ????????????he->nr1?=?he->nr2?=?0;??
  120. ????????????he->len?=?chunk_bentry.len;??
  121. ????????(which?==?FILE1)???he->nr1++?:?he->nr2++;??
  122. ????????/*?insert?or?update?hash?entry?*/??
  123. ????????hash_insert((void?*)strdup(chunk_bentry.md5),?(void?*)he,153); background-color:inherit; font-weight:bold">if?(sim_algo?==?LCS_YES)?{??
  124. ????????????memcpy(le->str[i],?chunk_bentry.md5,?MD5_LEN);??
  125. _CHUNK_FILE_PROCESS_EXIT:??
  126. ????close(fd);??
  127. ????return?ret;??
  128. uint32_t?LCS(char**?a,87); background-color:inherit; font-weight:bold">int?n,87); background-color:inherit; font-weight:bold">char**?b,87); background-color:inherit; font-weight:bold">int?m,?hashtable?*htab)???
  129. int**?S;??
  130. int**?R;??
  131. int?ii;??
  132. int?jj;??
  133. int?pos;??
  134. ????????uint32_t?len?=?0;??
  135. /*?Memory?allocation?*/??
  136. ????????S?=?(int?**)malloc(?(n+1)?*?int?*)?);??
  137. ????????R?=?(int?*)?);??
  138. if?(S?==?NULL?||?R?==?NULL)?{??
  139. ????????perror("malloc?for?S?and?R?in?LCS");??
  140. ????????exit(0);??
  141. for(ii?=?0;?ii?<=?n;?++ii)?{??
  142. ????????????????S[ii]?=?(int*)?malloc(?(m+1)?*?int)?);??
  143. ????????????????R[ii]?=?(int)?);??
  144. if?(S[ii]?==?NULL?||?R[ii]?==?NULL)?{??
  145. ????????????perror("malloc?for?S[ii]?and?R[ii]?in?LCS");??
  146. ????????????exit(0);??
  147. /*?It?is?important?to?use?<=,?not?<.??The?next?two?for-loops?are?initialization?*/??
  148. for(ii?=?0;?ii?<=?n;?++ii)?{??
  149. ????????????????S[ii][0]?=?0;??
  150. ????????????????R[ii][0]?=?UP;??
  151. for(jj?=?0;?jj?<=?m;?++jj)?{??
  152. ????????????????S[0][jj]?=?0;??
  153. ????????????????R[0][jj]?=?LEFT;??
  154. ????????/*?This?is?the?main?dynamic?programming?loop?that?computes?the?score?and?*/??
  155. /*?backtracking?arrays.?*/??
  156. for(ii?=?1;?ii?<=?n;?++ii)?{??
  157. ????????????????for(jj?=?1;?jj?<=?m;?++jj)?{??
  158. ????????????????????????if?(strcmp(a[ii-1],?b[jj-1])?==?0)?{??
  159. ????????????????????????????????S[ii][jj]?=?S[ii-1][jj-1]?+?1;??
  160. ????????????????????????????????R[ii][jj]?=?UP_AND_LEFT;??
  161. ????????????????????????}??
  162. else?{??
  163. ????????????????????????????????S[ii][jj]?=?S[ii-1][jj-1]?+?0;??
  164. ????????????????????????????????R[ii][jj]?=?NEITHER;??
  165. if(?S[ii-1][jj]?>=?S[ii][jj]?)?{??
  166. ????????????????????????????????S[ii][jj]?=?S[ii-1][jj];??
  167. ????????????????????????????????R[ii][jj]?=?UP;??
  168. if(?S[ii][jj-1]?>=?S[ii][jj]?)?{??
  169. ????????????????????????????????S[ii][jj]?=?S[ii][jj-1];??
  170. ????????????????????????????????R[ii][jj]?=?LEFT;??
  171. ????????????????}??
  172. /*?The?length?of?the?longest?substring?is?S[n][m]?*/??
  173. ????????ii?=?n;??
  174. ????????jj?=?m;??
  175. ????????pos?=?S[ii][jj];??
  176. /*?Trace?the?backtracking?matrix.?*/??
  177. while(?ii?>?0?||?jj?>?0?)?{??
  178. if(?R[ii][jj]?==?UP_AND_LEFT?)?{??
  179. ????????????????????????ii--;??
  180. ????????????????????????jj--;??
  181. ????????????????????????//lcs[pos--]?=?a[ii];??
  182. ????????????he?=?(hash_entry?*)hash_value((void?*)a[ii],108); list-style:decimal-leading-zero outside; color:inherit; line-height:18px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????len?+=?((he?==?NULL)???0:?he->len);??
  183. if(?R[ii][jj]?==?UP?)?{??
  184. ????????????????????????ii--;??
  185. ????????????????}??
  186. if(?R[ii][jj]?==?LEFT?)?{??
  187. ????????????????????????jj--;??
  188. for(ii?=?0;?ii?<=?n;?++ii?)?{??
  189. ????????????????free(S[ii]);??
  190. ????????????????free(R[ii]);??
  191. ????????free(S);??
  192. ????????free(R);??
  193. return?len;??
  194. int?hash_callback(void?*key,?void?*data)??
  195. ????hash_entry?*he?=?(hash_entry?*)data;??
  196. ????sim_union?+=?(he->len?*?(he->nr1?+?he->nr2));??
  197. ????sim_intersect?+=?(he->len?*?MIN(he->nr1,?he->nr2));??
  198. float?similarity_detect(hashtable?*htab,87); background-color:inherit; font-weight:bold">char?**str1,87); background-color:inherit; font-weight:bold">char?**str2,87); background-color:inherit; font-weight:bold">int?sim_algo)??
  199. ????uint32_t?lcs_len?=?0;??
  200. ????hash_for_each_do(htab,?hash_callback);??
  201. ????????lcs_len?=?LCS(str1,?n,?str2,?m,?htab);??
  202. return?lcs_len?*?2.0?/?sim_union;??
  203. ????}?else?{?/*?LCS_NOT?*/??
  204. return?sim_intersect?*?2.0?/?sim_union;??
  205. int?main(int?argc,87); background-color:inherit; font-weight:bold">char?*argv[])??
  206. int?chunk_algo?=?CHUNK_CDC;??
  207. int?sim_algo?=?LCS_NOT;??
  208. char?*file1?=?NULL;??
  209. char?*file2?=?NULL;??
  210. ????lcs_entry?le1,?le2;??
  211. char?tmpname[NAME_MAX_SZ]?=?{0};??
  212. char?template[]?=?"deduputil_bsim_XXXXXX";??
  213. ????hashtable?*htab?=?NULL;??
  214. int?ret?=?0;??
  215. if?(argc?<?5)?{??
  216. ????????usage();??
  217. return?-1;??
  218. /*?parse?chunk?algorithms?*/??
  219. ????file1?=?argv[1];??
  220. ????file2?=?argv[2];??
  221. ????chunk_algo?=?parse_arg(argv[3]);??
  222. ????sim_algo?=?parse_arg(argv[4]);??
  223. if?(chunk_algo?==?-1?||?sim_algo?==?-1)?{??
  224. ????????usage();??
  225. ????htab?=?create_hashtable(HASHTABLE_BUCKET_SZ);??
  226. if?(htab?==?NULL)?{??
  227. ????????fprintf(stderr,?"create?hashtabke?failed/n");??
  228. /*?chunk?file1?and?file2?into?blocks?*/??
  229. ????sprintf(tmpname,?"/tmp/%s_%d",?mktemp(template),?getpid());??
  230. ????ret?=?file_chunk(file1,?tmpname,?chunk_algo);??
  231. if?(0?!=?ret)?{??
  232. ????????fprintf(stderr,?"chunk?%s?failed/n",?file1);??
  233. goto?_BENCODE_EXIT;??
  234. ????le1.str?=?NULL;??
  235. ????ret?=?chunk_file_process(tmpname,?htab,?FILE1,?sim_algo,?&le1);??
  236. if?(ret?!=?0)?{??
  237. "pasre?%s?failed/n",108); list-style:decimal-leading-zero outside; color:inherit; line-height:18px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????ret?=?file_chunk(file2,?chunk_algo);??
  238. if?(0?!=?ret){??
  239. goto?_BENCODE_EXIT;??
  240. ????le2.str?=?NULL;??
  241. ????ret?=?chunk_file_process(tmpname,?FILE2,?&le2);??
  242. if?(ret?!=?0)?{??
  243. "similarity?=?%.4f/n",?similarity_detect(htab,?le1.str,?le1.len,?le2.str,?le2.len,?sim_algo));??
  244. _BENCODE_EXIT:??
  245. ????unlink(tmpname);??
  246. ????hash_free(htab);??
  247. if?(le1.str)?free_2d_array(le1.str);??
  248. if?(le2.str)?free_2d_array(le2.str);??
  249. return?ret;??
  250. }??

转载于: http://blog.csdn.net/liuben/article/details/5870297

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读