?数据相似性检测算法
1、引言 "数据同步算法研究"一文研究了在网络上高效同步数据的方法,其中有个前提是文件A和B非常相似,即两者之间存在大量相同的数据。如果两个文件相似性很低,虽然这种方法依然可以正常工作,但数据同步性能却不会得到提高,甚至会有所降低。因为会产生部分元数据和网络通信消耗,这在两个文件完全不相关时尤为明显。因此,同步数据前需要计算种子文件(seed file)与目标文件之间的相似性,如果相似性大于指定阈值(通常应大于50%)则应用该数据同步算法,否则接传输文件即可。如此,可使得数据同步算法则具有较好的自适应性,在数据具有不同相似性的情形下均可进行高性能的数据同步。另外,在数据相似性检测的基础之上,可对于相似性高的数据进行数据编码处理(如Delta编码),通过一个文件给另一个文件编码的方式进行数据压缩,这是一种基于相似数据检测与编码的重复数据删除技术。 2、相似性计算 Unix diff对文档进行逐行对比来检测相似文件,它采用经典的LCS(Longest Common Subsequence,最长公共子串)算法,运用动态规划方法来计算相似性。LCS的含义是同时包含在字符串里的一个最长字符序列,LCS的长度作为这两个字符串相似性的度量。Diff算法以整行作为"字符"来计算最长公共子串,性能上比字符级的LCS算法快很多。这种方法效率很低,而且只适用文本文件的相似比较,不能直接适用于二进制文件。
目前通常的做法是将文件相似性问题转换为集合相似性问题,如基于shingle的计算方法和基于bloom filter的计算方法,这两种方法都可适用于任何格式的数据文件。这种方式的核心思想是为每个文件提取组特征值,以特征值集合来计算相似性,从而降低计算复杂性来提高性能。shingle用特征值交集来计算相似性会导致高计算和空间开销,bloom filter技术在计算开销和匹配精度上更具势。Bloom filter所定义的集合元素是文件按照CDC(content-defined chunking)算法所切分数据块的指纹值,其相似性定义如下: ?????????????????????????? ?|fingerprints(f1) ∩ fingerprints(f2)| Sim(f1,f2) = ---------------------------------------------?? (公式1) ?????????????????????????? ?|fingerprints(f1) ∪ fingerprints(f2)|
另外一种方法,是将二进制文件进行切块,使用数据块指纹来表示数据块,然后将数据块映射为"字符",再应用LCS算法寻找最大公共子串并计算出相似度。其相似性定义如下: ????????????????????? 2 * length(LCS(fingerprints(f1),fingerprints(f2))) ??? Sim(f1,f2) = ------------------------------------------------------------------ (公式2) ????????????????????????? ?length(fingerprints(f1)) + length(fingerprints(f2))
上面两种相似性算法中均采用数据切分技术,数据块可以是定长或变长。为了相似性计算的精确性,实现中采用以数据块长度作为权的加权计算方法。
3、Bloom filter算法 该文件相似性计算流程如下: ????? (1) 采用CDC算法将文件切分成数据块集,并为每个数据块计算MD5指纹; ????? (2) 计算两个指纹集合的交集和并集,通过hashtable来实现; ????? (3) 按照公式1计算文件相似性,考虑重复数据块和数据块长度来提高计算精确度。 ????? 详细参见附录bsim源码中的file_chunk,chunk_file_process和similarity_detect函数实现。
4、LCS算法 该文件相似性计算流程如下: ????(1) 采用CDC算法将文件切分成数据块集,并为每个数据块计算MD5指纹; ??? (2) 将MD5指纹串映射为"字符",则文件转换为"字符串"表示; ??? (3) 应用LCS算法计算出最长公共子串,并计算其加权长度; ??? (4) 按照公式2计算文件相似性,考虑重复数据块和数据块长度来提高计算精确度。 ??? 详细参见附录bsim源码中的file_chunk,chunk_file_process,LCS和similarity_detect函数实现。
5、算法分析比较 两种算法都对文件进行切分操作,假设文件f1切为m个块,文件f2切分成n个块。Bloom filter算法没有考虑数据块顺序,因此在相似性精确度方面要低于LCS算法,其时间和空间复杂性都是O(m + n)。相反,LCS算法考虑了数据块顺序问题,相似性度量相对精确,然而其时间和空间复杂性是O(mn),这个大大限制了应用规模。综合来看,Bloom filter算法精确度比LCS算法要低,但计算消耗要小很多,性能和适用性非常好。LCS比较适合精确的文件相似性计算,这些文件往往比较小,50MB以内比较合适。对于重复数据删除和网络数据同步来说,消重效果和性能与数据块顺序性无关,因此Bloom filter算法计算的数据相似性更适用,性能也更高。
附录:bsim.c源码 (完整源码请参见deduputil源码)
- ?
- ?
- ?*?This?program?is?free?software;?you?can?redistribute?it?and/or?modify?
- ?*?it?under?the?terms?of?the?GNU?General?Public?License?as?published?by?
- ?*?the?Free?Software?Foundation;?either?version?3?of?the?License,?or?
- ?*?(at?your?option)?any?later?version.?
- ?*?
- ?*?This?program?is?distributed?in?the?hope?that?it?will?be?useful,?
- ?*?but?WITHOUT?ANY?WARRANTY;?without?even?the?implied?warranty?of?
- ?*?MERCHANTABILITY?or?FITNESS?FOR?A?PARTICULAR?PURPOSE.??See?the?
- ?*?GNU?General?Public?License?for?more?details.?
- ?*?You?should?have?received?a?copy?of?the?GNU?General?Public?License?along?
- ?*?with?this?program;?if?not,?visit?the?http://fsf.org?website.?
- ?*/??
- #include?<stdio.h>??
- #include?<stdlib.h>??
- #include?<string.h>??
- #include?<sys/types.h>??
- #include?<sys/stat.h>??
- #include?<fcntl.h>??
- #include?<unistd.h>??
- #include?"hashtable.h"??
- #include?"sync.h"??
- #define?NEITHER???????0??
- #define?UP????????????1??
- #define?LEFT??????????2??
- #define?UP_AND_LEFT???3??
- #define?MAX(x,?y)?(((x)?>?(y))???(x)?:?(y))??
- #define?MIN(x,?y)?(((x)?<?(y))???(x)?:?(y))??
- #define?MD5_LEN?17??
- enum?{??
- ????FILE1?=?0,??
- ????FILE2??
- };??
- ????LCS_NOT?=?0,108); list-style:decimal-leading-zero outside; line-height:18px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????LCS_YES??
- typedef?struct?{??
- ????uint32_t?nr1;??
- ????uint32_t?nr2;??
- ????uint32_t?len;??
- }?hash_entry;??
- struct?{??
- ????char?**str;??
- }?lcs_entry;??
- static?uint32_t?sim_union?=?0;??
- static?uint32_t?sim_intersect?=?0;??
- static?void?usage()??
- {??
- ????fprintf(stderr,?"Usage:?bsim?FILE1?FILE2?CHUNK_ALGO?LCS/n/n");??
- ????fprintf(stderr,?"Similarity?detect?between?FILE1?and?FILE2?based?on?block?level./n");??
- "CHUNK_ALGO:/n");??
- "??FSP?-?fixed-size?partition/n");??
- "??CDC?-?content-defined?chunking/n");??
- "??SBC?-?slide?block?chunking/n/n");??
- "LCS:/n");??
- "??LCS_NOT?-?do?not?use?LCS(longest?lommon?subsequence)?algorithms/n");??
- "??LCS_YES?-?use?LCS?algorithms/n/n");??
- "Report?bugs?to?<Aigui.Liu@gmail.com>./n");??
- }??
- static?int?parse_arg(char?*argname)??
- {??
- ????if?(0?==?strcmp(argname,?"FSP"))??
- ????????return?CHUNK_FSP;??
- else?"CDC"))??
- return?CHUNK_CDC;??
- "SBC"))??
- return?CHUNK_SBC;??
- "LCS_NOT"))??
- return?LCS_NOT;??
- "LCS_YES"))??
- return?LCS_YES;??
- else??
- return?-1;??
- }??
- char?**alloc_2d_array(int?row,?int?col)??
- ????????int?i;??
- ????????char?*p,?**pp;??
- ????????p?=?(char?*)malloc(row?*?col?*?sizeof(char));??
- ????????pp?=?(char?**)malloc(row?*?char?*));??
- if?(p?==?NULL?||?pp?==?NULL)??
- ????????????????return?NULL;??
- for?(i?=?0;?i?<?row;?i++)?{??
- ????????????????pp[i]?=?p?+?col?*?i;??
- ????????}??
- ????????return?pp;??
- void?free_2d_array(char?**str)??
- ????free(str[0]);??
- ????free(str);??
- void?show_md5_hex(unsigned?char?md5_checksum[16])??
- for?(i?=?0;?i?<?16;?i++)?{??
- ????????????????printf("%02x",?md5_checksum[i]);??
- ????????}??
- ????????printf("/n");??
- int?chunk_file_process(char?*chunk_file,?hashtable?*htab,87); background-color:inherit; font-weight:bold">int?which,87); background-color:inherit; font-weight:bold">int?sim_algo,?lcs_entry?*le)??
- ????int?fd,?i,?ret?=?0;??
- ????ssize_t?rwsize;??
- ????chunk_file_header?chunk_file_hdr;??
- ????chunk_block_entry?chunk_bentry;??
- ????hash_entry?*he?=?NULL;??
- ??????
- ????fd?=?open(chunk_file,?O_RDONLY);??
- if?(-1?==?fd)?{??
- ????}??
- ????rwsize?=?read(fd,?&chunk_file_hdr,?CHUNK_FILE_HEADER_SZ);??
- if?(rwsize?!=?CHUNK_FILE_HEADER_SZ)?{??
- ????????ret?=?-1;??
- goto?_CHUNK_FILE_PROCESS_EXIT;??
- ????}??
- if?(sim_algo?==?LCS_YES)?{??
- ????????le->str?=?alloc_2d_array(chunk_file_hdr.block_nr,?MD5_LEN);??
- if?(le->str?==?NULL)?{??
- ????????????ret?=?-1;??
- ???????????? ????????le->len?=?chunk_file_hdr.block_nr;??
- for(i?=?0;?i?<?chunk_file_hdr.block_nr;?i++)?{??
- ????????rwsize?=?read(fd,?&chunk_bentry,?CHUNK_BLOCK_ENTRY_SZ);??
- if?(rwsize?!=?CHUNK_BLOCK_ENTRY_SZ)?{??
- ????????he?=?(hash_entry?*)hash_value((void?*)chunk_bentry.md5,?htab);??
- if?(he?==?NULL)?{??
- ????????????he?=?(hash_entry?*)malloc(sizeof(hash_entry));??
- ????????????he->nr1?=?he->nr2?=?0;??
- ????????????he->len?=?chunk_bentry.len;??
- ????????(which?==?FILE1)???he->nr1++?:?he->nr2++;??
- ??????????
- ????????hash_insert((void?*)strdup(chunk_bentry.md5),?(void?*)he,153); background-color:inherit; font-weight:bold">if?(sim_algo?==?LCS_YES)?{??
- ????????????memcpy(le->str[i],?chunk_bentry.md5,?MD5_LEN);??
- _CHUNK_FILE_PROCESS_EXIT:??
- ????close(fd);??
- ????return?ret;??
- uint32_t?LCS(char**?a,87); background-color:inherit; font-weight:bold">int?n,87); background-color:inherit; font-weight:bold">char**?b,87); background-color:inherit; font-weight:bold">int?m,?hashtable?*htab)???
- int**?S;??
- int**?R;??
- int?ii;??
- int?jj;??
- int?pos;??
- ????????uint32_t?len?=?0;??
- /*?Memory?allocation?*/??
- ????????S?=?(int?**)malloc(?(n+1)?*?int?*)?);??
- ????????R?=?(int?*)?);??
- if?(S?==?NULL?||?R?==?NULL)?{??
- ????????perror("malloc?for?S?and?R?in?LCS");??
- ????????exit(0);??
- for(ii?=?0;?ii?<=?n;?++ii)?{??
- ????????????????S[ii]?=?(int*)?malloc(?(m+1)?*?int)?);??
- ????????????????R[ii]?=?(int)?);??
- if?(S[ii]?==?NULL?||?R[ii]?==?NULL)?{??
- ????????????perror("malloc?for?S[ii]?and?R[ii]?in?LCS");??
- ????????????exit(0);??
- /*?It?is?important?to?use?<=,?not?<.??The?next?two?for-loops?are?initialization?*/??
- for(ii?=?0;?ii?<=?n;?++ii)?{??
- ????????????????S[ii][0]?=?0;??
- ????????????????R[ii][0]?=?UP;??
- for(jj?=?0;?jj?<=?m;?++jj)?{??
- ????????????????S[0][jj]?=?0;??
- ????????????????R[0][jj]?=?LEFT;??
- ??????????
- /*?backtracking?arrays.?*/??
- for(ii?=?1;?ii?<=?n;?++ii)?{??
- ????????????????for(jj?=?1;?jj?<=?m;?++jj)?{??
- ????????????????????????if?(strcmp(a[ii-1],?b[jj-1])?==?0)?{??
- ????????????????????????????????S[ii][jj]?=?S[ii-1][jj-1]?+?1;??
- ????????????????????????????????R[ii][jj]?=?UP_AND_LEFT;??
- ????????????????????????}??
- else?{??
- ????????????????????????????????S[ii][jj]?=?S[ii-1][jj-1]?+?0;??
- ????????????????????????????????R[ii][jj]?=?NEITHER;??
- if(?S[ii-1][jj]?>=?S[ii][jj]?)?{??
- ????????????????????????????????S[ii][jj]?=?S[ii-1][jj];??
- ????????????????????????????????R[ii][jj]?=?UP;??
- if(?S[ii][jj-1]?>=?S[ii][jj]?)?{??
- ????????????????????????????????S[ii][jj]?=?S[ii][jj-1];??
- ????????????????????????????????R[ii][jj]?=?LEFT;??
- ????????????????}??
- /*?The?length?of?the?longest?substring?is?S[n][m]?*/??
- ????????ii?=?n;??
- ????????jj?=?m;??
- ????????pos?=?S[ii][jj];??
- /*?Trace?the?backtracking?matrix.?*/??
- while(?ii?>?0?||?jj?>?0?)?{??
- if(?R[ii][jj]?==?UP_AND_LEFT?)?{??
- ????????????????????????ii--;??
- ????????????????????????jj--;??
- ??????????????????????????
- ????????????he?=?(hash_entry?*)hash_value((void?*)a[ii],108); list-style:decimal-leading-zero outside; color:inherit; line-height:18px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????????????len?+=?((he?==?NULL)???0:?he->len);??
- if(?R[ii][jj]?==?UP?)?{??
- ????????????????????????ii--;??
- ????????????????}??
- if(?R[ii][jj]?==?LEFT?)?{??
- ????????????????????????jj--;??
- for(ii?=?0;?ii?<=?n;?++ii?)?{??
- ????????????????free(S[ii]);??
- ????????????????free(R[ii]);??
- ????????free(S);??
- ????????free(R);??
- return?len;??
- int?hash_callback(void?*key,?void?*data)??
- ????hash_entry?*he?=?(hash_entry?*)data;??
- ????sim_union?+=?(he->len?*?(he->nr1?+?he->nr2));??
- ????sim_intersect?+=?(he->len?*?MIN(he->nr1,?he->nr2));??
- float?similarity_detect(hashtable?*htab,87); background-color:inherit; font-weight:bold">char?**str1,87); background-color:inherit; font-weight:bold">char?**str2,87); background-color:inherit; font-weight:bold">int?sim_algo)??
- ????uint32_t?lcs_len?=?0;??
- ????hash_for_each_do(htab,?hash_callback);??
- ????????lcs_len?=?LCS(str1,?n,?str2,?m,?htab);??
- return?lcs_len?*?2.0?/?sim_union;??
- ????}?else?{???
- return?sim_intersect?*?2.0?/?sim_union;??
- int?main(int?argc,87); background-color:inherit; font-weight:bold">char?*argv[])??
- int?chunk_algo?=?CHUNK_CDC;??
- int?sim_algo?=?LCS_NOT;??
- char?*file1?=?NULL;??
- char?*file2?=?NULL;??
- ????lcs_entry?le1,?le2;??
- char?tmpname[NAME_MAX_SZ]?=?{0};??
- char?template[]?=?"deduputil_bsim_XXXXXX";??
- ????hashtable?*htab?=?NULL;??
- int?ret?=?0;??
- if?(argc?<?5)?{??
- ????????usage();??
- return?-1;??
- /*?parse?chunk?algorithms?*/??
- ????file1?=?argv[1];??
- ????file2?=?argv[2];??
- ????chunk_algo?=?parse_arg(argv[3]);??
- ????sim_algo?=?parse_arg(argv[4]);??
- if?(chunk_algo?==?-1?||?sim_algo?==?-1)?{??
- ????????usage();??
- ????htab?=?create_hashtable(HASHTABLE_BUCKET_SZ);??
- if?(htab?==?NULL)?{??
- ????????fprintf(stderr,?"create?hashtabke?failed/n");??
- /*?chunk?file1?and?file2?into?blocks?*/??
- ????sprintf(tmpname,?"/tmp/%s_%d",?mktemp(template),?getpid());??
- ????ret?=?file_chunk(file1,?tmpname,?chunk_algo);??
- if?(0?!=?ret)?{??
- ????????fprintf(stderr,?"chunk?%s?failed/n",?file1);??
- goto?_BENCODE_EXIT;??
- ????le1.str?=?NULL;??
- ????ret?=?chunk_file_process(tmpname,?htab,?FILE1,?sim_algo,?&le1);??
- if?(ret?!=?0)?{??
- "pasre?%s?failed/n",108); list-style:decimal-leading-zero outside; color:inherit; line-height:18px; margin:0px!important; padding:0px 3px 0px 10px!important"> ????ret?=?file_chunk(file2,?chunk_algo);??
- if?(0?!=?ret){??
- goto?_BENCODE_EXIT;??
- ????le2.str?=?NULL;??
- ????ret?=?chunk_file_process(tmpname,?FILE2,?&le2);??
- if?(ret?!=?0)?{??
- "similarity?=?%.4f/n",?similarity_detect(htab,?le1.str,?le1.len,?le2.str,?le2.len,?sim_algo));??
- _BENCODE_EXIT:??
- ????unlink(tmpname);??
- ????hash_free(htab);??
- if?(le1.str)?free_2d_array(le1.str);??
- if?(le2.str)?free_2d_array(le2.str);??
- return?ret;??
- }??
转载于:
http://blog.csdn.net/liuben/article/details/5870297
(编辑:李大同)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|