【E2LSH源码分析】E2LSH源码综述及主要数据结构
上一小节,我们对p稳定分布LSH的基本原理进行了介绍(http://www.voidcn.com/article/p-cxnxpivy-z.html),在接下来的博文中,我将以E2LSH开源代码为基础,对E2LSH的源码进行注解学习,从而为掌握LSH的基本原理以及未来对相似性搜索的扩展学习打下基础。 1、代码概况 E2LSH的核心代码可以分为3部分:
其他代码说明:
2、主要的数据结构 (1)RNearNeighborStructureT(LocalitySensitiveHashing.h中定义)——R near neighbor数据结构。该结构包含构建数据结构的参数、哈希函数族gi的描述、结构中数据点的索引和用于存储哈希桶的L个哈希表的指针。 typedef struct _RNearNeighborStructT { IntT dimension; // dimension of points. IntT parameterK; // parameter K of the algorithm. IntT parameterL; // parameter L of the algorithm. RealT parameterW; // parameter W of the algorithm. IntT parameterT; // parameter T of the algorithm. RealT parameterR; // parameter R of the algorithm. RealT parameterR2; // = parameterR^2 // Whether to use <u> hash functions instead of usual <g> // functions. When this flag is set to TRUE,<u> functions are // generated (which are roughly k/2-tuples of LSH),and a <g> // function is a pair of 2 different <u> functions. BooleanT useUfunctions; // the number of tuples of hash functions used (= # of rows of // <lshFunctions>). When useUfunctions == FALSE,this field is equal // to parameterL,otherwise,to <m>,the number of <u> hash // functions (in this case,parameterL = m*(m-1)/2 = nHFTuples*(nHFTuples-1)/2 IntT nHFTuples; // How many LSH functions each of the tuple has (it is <k> when // useUfunctions == FALSE,and <k/2> when useUfunctions == TRUE). IntT hfTuplesLength; // number of points in the data set Int32T nPoints; // The array of pointers to the points that are contained in the // structure. Some types of this structure (of UHashStructureT,// actually) use indeces in this array to refer to points (as // opposed to using pointers). PPointT *points; // The size of the array <points> Int32T pointsArraySize; // If <reportingResult> == FALSE,no points are reported back in a // <get*> function. In particular any point that is found in the // bucket is considered to be outside the R-ball of the query point // (the distance is still computed). If <reportingResult> == TRUE,// then the structure behaves normally. BooleanT reportingResult; // This table stores the LSH functions. There are <nHFTuples> rows // of <hfTuplesLength> LSH functions. LSHFunctionT **lshFunctions; // Precomputed hashes of each of the <nHFTuples> of <u> functions // (to be used by the bucket hashing module). Uns32T **precomputedHashesOfULSHs; // The set of non-empty buckets (which are hashed using // PUHashStructureT). PUHashStructureT *hashedBuckets; // *** // The following vectors are used only for temporary operations // within this R-NN structure during a query operation. // *** // This vector is used to store the values of hash functions <u> // (<hfTuplesLength>-tuple of LSH fuctions). One <g> function is a concatenation // of <nHFTuples> of <u> LSH functions. Uns32T **pointULSHVectors; // A vector of length <dimension> to store the reduced point (point // with coordinates divided by <parameterR>). RealT *reducedPoint; // This vector is used for storing marked points in a query // operation (for computing distances to a point at most once). If // markedPoints[i]=TRUE then point <i> was examined already. BooleanT *markedPoints; // This vector stored the indeces in the vector <markedPoints> of all // TRUE entries. Int32T *markedPointsIndeces; // the size of <markedPoints> and of <markedPointsIndeces> IntT sizeMarkedPoints; } RNearNeighborStructT,*PRNearNeighborStructT; (2)UHashStructureT(BucketHashing.h中定义)——该结构定义了用于映射哈希桶的哈希表。通过链表方式解决冲突的问题。 主要有两种哈希表类型:HT_LINKED_LIST 和HT_HYBRID_CHAINS。 HT_LINKED_LIST对应哈希表的链表版本:对于每一个哈希表,我们存储一个指向由桶(buckets)构成的单向链表(singly-linked list)的指针;对于每一个桶,我们存储该桶的指纹h2(·)和指向由数据点构成的单向链表的指针。 HT_HYBRID_CHAINS对应混合存储的哈希表:该方案会对数据集P中的所有数据点建立索引,并且存储所有桶(all buckets with values h2(·))和数据点。在哈希表Hi中,我们在索引l中保存混合存储表Y的一些索引ek,其中ek标识了链表k的起始位置;对于一个链表(chain),分别存储着链中第一个桶的h2(·)值和桶中所有数据点的索引,链中第二个桶的h2(·)值和桶中所有数据点的索引,以此类推。 两种哈希表都具有指向哈希函数h1(·)和h2(·)的指针。 typedef struct _UHashStructureT { // The type of the hash table (can take values HT_*). when // <typeHT>=HT_LINKED_LIST,chains&buckets are linked lists. when // <typeHT>=HT_PACKED,chains&buckets are static arrays. when // <typeHT>=HT_STATISTICS,chains are static arrays and buckets only // count # of elements. when <typeHT>=HT_HYBRID_CHAINS,a chain is // a "hybrid" array that contains both the buckets and the points // (the an element of the chain array is of type // <HybridChainEntryT>). all chains are conglamerated in the same // array <hybridChainsStorage>. IntT typeHT; // The array containing the hash slots of the universal hashing. union _hashTableT { PGBucketT *llHashTable; PackedGBucketT **packedHashTable; LinkPackedGBucketT **linkHashTable; PHybridChainEntryT *hybridHashTable; } hashTable; // The sizes of each of the chains of the hashtable (used only when // typeHT=HT_PACKED or HT_STATISTICS. IntT *chainSizes; union _bucketPoints{ PPointT *pointsArray; PointsListEntryT *pointsList; } bucketPoints; HybridChainEntryT *hybridChainsStorage; // The size of hashTable. Int32T hashTableSize; // Number of elements(buckets) stored in the hash table in total (that // is the number of non-empty buckets). Int32T nHashedBuckets; Int32T nHashedPoints; // Unused (but allocated) instances of the corresponding // structs. May be reused if needed (instead of allocated new // memory). PGBucketT unusedPGBuckets; PBucketEntryT unusedPBucketEntrys; Uns32T prime; // the prime used for the universal hash functions. IntT hashedDataLength;// the number of IntT's in an element from U (U is the set of values to hash). // The hash functions used for the universal hashing. // The main hash function (that defines the index // of the slot in the table). // The type of the hash function is: h_{a}(k) = ((acdot k)mod p)mod hashTableSize. Uns32T *mainHashA; // Control hash functions: used to compute/check the <controlValue>s // of <GBucket>s. // The type of the hash function is: h_{a}(k) = (acdot k)mod p Uns32T *controlHash1; } UHashStructureT,*PUHashStructureT; (3)RNNParametersT(LocalitySensitiveHashing.h中定义)——包含构建RNearNeighborStructureT数据结构的必要参数的结构体。 typedef struct _RNNParametersT { RealT parameterR; // parameter R of the algorithm. RealT successProbability; // the success probability 1-delta IntT dimension; // dimension of points. RealT parameterR2; // = parameterR^2 // Whether to use <u> hash functions instead of usual <g> // functions. When this flag is set to TRUE,and a <g> // function is a pair of 2 different <u> functions. BooleanT useUfunctions; IntT parameterK; // parameter K of the algorithm. // parameter M (# of independent tuples of LSH functions) // if useUfunctions==TRUE,parameterL = parameterM * (parameterM - 1) / 2 // if useUfunctions==FALSE,parameterL = parameterM IntT parameterM; IntT parameterL; // parameter L of the algorithm. RealT parameterW; // parameter W of the algorithm. IntT parameterT; // parameter T of the algorithm. // The type of the hash table used for storing the buckets (of the // same <g> function). IntT typeHT; } RNNParametersT,*PRNNParametersT; (4)PPoint(Geometry.h中定义)——用于存储数据点的结构体。该结构包含数据的坐标(coordinates),数据点范数的平方和该数据点在数据集P的下标索引。 typedef struct _PointT { IntT index; // the index of this point in the dataset list of points RealT *coordinates; RealT sqrLength; // the square of the length of the vector } PointT,*PPointT; 转载请注明作者及文章出处:http://www.voidcn.com/article/p-hhrafddr-z.html E2LSH源代码下载地址:http://download.csdn.net/detail/jasonding1354/7704277 参考资料: 1、M.Datar,N.Immorlica,P.Indyk,and V.Mirrokni,“Locality-SensitiveHashing Scheme Based on p-Stable Distributions,”Proc.Symp. ComputationalGeometry,2004. 2、A.Andoni,P.Indyk.E2lsh:Exact Euclidean locality-sensitive hashing.http://web.mit.edu/andoni/www/LSH/.2004. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |