加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 大数据 > 正文

【E2LSH源码分析】E2LSH源码综述及主要数据结构

发布时间:2020-12-14 03:07:13 所属栏目:大数据 来源:网络整理
导读:上一小节,我们对p稳定分布LSH的基本原理进行了介绍(http://www.voidcn.com/article/p-cxnxpivy-z.html),在接下来的博文中,我将以E2LSH开源代码为基础,对E2LSH的源码进行注解学习,从而为掌握LSH的基本原理以及未来对相似性搜索的扩展学习打下基

上一小节,我们对p稳定分布LSH的基本原理进行了介绍(http://www.voidcn.com/article/p-cxnxpivy-z.html),在接下来的博文中,我将以E2LSH开源代码为基础,对E2LSH的源码进行注解学习,从而为掌握LSH的基本原理以及未来对相似性搜索的扩展学习打下基础。


1、代码概况

E2LSH的核心代码可以分为3部分:

  • LocalitySensitiveHashing.cpp——主要包含基于LSH的RNN(R-near neighbor)数据结构。其主要功能是根据参数构建数据结构进行查询数据对象的功能;
  • BucketHashing.cpp——主要包含对于哈希桶的普通哈希表。其主要功能是构建哈希表,添加哈希桶到表中和查询哈希桶;
  • SelfTuning.cpp——包含计算RNN数据结构最佳参数的函数。

其他代码说明:

  • Geometry.h——包含对数据点的定义(数据类型PPoint);
  • NearNeighbors.cpp,NearNeighbors.h——包含E2LSH核心代码的函数接口;
  • Random.cpp,Random.h——包含伪随机数产生器;
  • BasicDefinitions.h——通用的类型定义和宏定义;
  • Utils.cpp,Utils.h——包含一些通用的函数(如复制向量)。

2、主要的数据结构

(1)RNearNeighborStructureT(LocalitySensitiveHashing.h中定义)——R near neighbor数据结构。该结构包含构建数据结构的参数、哈希函数族gi的描述、结构中数据点的索引和用于存储哈希桶的L个哈希表的指针。

typedef struct _RNearNeighborStructT {
  IntT dimension; // dimension of points.
  IntT parameterK; // parameter K of the algorithm.
  IntT parameterL; // parameter L of the algorithm.
  RealT parameterW; // parameter W of the algorithm.
  IntT parameterT; // parameter T of the algorithm.
  RealT parameterR; // parameter R of the algorithm.
  RealT parameterR2; // = parameterR^2

  // Whether to use <u> hash functions instead of usual <g>
  // functions. When this flag is set to TRUE,<u> functions are
  // generated (which are roughly k/2-tuples of LSH),and a <g>
  // function is a pair of 2 different <u> functions.
  BooleanT useUfunctions;

  // the number of tuples of hash functions used (= # of rows of
  // <lshFunctions>). When useUfunctions == FALSE,this field is equal
  // to parameterL,otherwise,to <m>,the number of <u> hash
  // functions (in this case,parameterL = m*(m-1)/2 = nHFTuples*(nHFTuples-1)/2
  IntT nHFTuples;
  // How many LSH functions each of the tuple has (it is <k> when
  // useUfunctions == FALSE,and <k/2> when useUfunctions == TRUE).
  IntT hfTuplesLength;

  // number of points in the data set
  Int32T nPoints;

  // The array of pointers to the points that are contained in the
  // structure. Some types of this structure (of UHashStructureT,// actually) use indeces in this array to refer to points (as
  // opposed to using pointers).
  PPointT *points;

  // The size of the array <points>
  Int32T pointsArraySize;

  // If <reportingResult> == FALSE,no points are reported back in a
  // <get*> function. In particular any point that is found in the
  // bucket is considered to be outside the R-ball of the query point
  // (the distance is still computed).  If <reportingResult> == TRUE,// then the structure behaves normally.
  BooleanT reportingResult;
  
  // This table stores the LSH functions. There are <nHFTuples> rows
  // of <hfTuplesLength> LSH functions.
  LSHFunctionT **lshFunctions;

  // Precomputed hashes of each of the <nHFTuples> of <u> functions
  // (to be used by the bucket hashing module).
  Uns32T **precomputedHashesOfULSHs;

  // The set of non-empty buckets (which are hashed using
  // PUHashStructureT).
  PUHashStructureT *hashedBuckets;

  // ***
  // The following vectors are used only for temporary operations
  // within this R-NN structure during a query operation.
  // ***

  // This vector is used to store the values of hash functions <u>
  // (<hfTuplesLength>-tuple of LSH fuctions). One <g> function is a concatenation
  // of <nHFTuples> of <u> LSH functions.
  Uns32T **pointULSHVectors;
  
  // A vector of length <dimension> to store the reduced point (point
  // with coordinates divided by <parameterR>).
  RealT *reducedPoint;

  // This vector is used for storing marked points in a query
  // operation (for computing distances to a point at most once). If
  // markedPoints[i]=TRUE then point <i> was examined already.
  BooleanT *markedPoints;
  // This vector stored the indeces in the vector <markedPoints> of all
  // TRUE entries.
  Int32T *markedPointsIndeces;
  // the size of <markedPoints> and of <markedPointsIndeces>
  IntT sizeMarkedPoints;
} RNearNeighborStructT,*PRNearNeighborStructT;

(2)UHashStructureT(BucketHashing.h中定义)——该结构定义了用于映射哈希桶的哈希表。通过链表方式解决冲突的问题。

主要有两种哈希表类型:HT_LINKED_LISTHT_HYBRID_CHAINS

HT_LINKED_LIST对应哈希表的链表版本:对于每一个哈希表,我们存储一个指向由桶(buckets)构成的单向链表(singly-linked list)的指针;对于每一个桶,我们存储该桶的指纹h2(·)和指向由数据点构成的单向链表的指针

HT_HYBRID_CHAINS对应混合存储的哈希表:该方案会对数据集P中的所有数据点建立索引,并且存储所有桶(all buckets with values h2(·))和数据点。在哈希表Hi中,我们在索引l中保存混合存储表Y的一些索引ek,其中ek标识了链表k的起始位置;对于一个链表(chain),分别存储着链中第一个桶的h2(·)值和桶中所有数据点的索引,链中第二个桶的h2(·)值和桶中所有数据点的索引,以此类推。

两种哈希表都具有指向哈希函数h1(·)和h2(·)的指针。

typedef struct _UHashStructureT {
  // The type of the hash table (can take values HT_*). when
  // <typeHT>=HT_LINKED_LIST,chains&buckets are linked lists. when
  // <typeHT>=HT_PACKED,chains&buckets are static arrays. when
  // <typeHT>=HT_STATISTICS,chains are static arrays and buckets only
  // count # of elements.  when <typeHT>=HT_HYBRID_CHAINS,a chain is
  // a "hybrid" array that contains both the buckets and the points
  // (the an element of the chain array is of type
  // <HybridChainEntryT>). all chains are conglamerated in the same
  // array <hybridChainsStorage>.
  IntT typeHT;

  // The array containing the hash slots of the universal hashing.
  union _hashTableT {
    PGBucketT *llHashTable;
    PackedGBucketT **packedHashTable;
    LinkPackedGBucketT **linkHashTable;
    PHybridChainEntryT *hybridHashTable;
  } hashTable;

  // The sizes of each of the chains of the hashtable (used only when
  // typeHT=HT_PACKED or HT_STATISTICS.
  IntT *chainSizes;

  union _bucketPoints{
    PPointT *pointsArray;
    PointsListEntryT *pointsList;
  } bucketPoints;

  HybridChainEntryT *hybridChainsStorage;

  // The size of hashTable.
  Int32T hashTableSize;

  // Number of elements(buckets) stored in the hash table in total (that
  // is the number of non-empty buckets).
  Int32T nHashedBuckets;

  Int32T nHashedPoints;

  // Unused (but allocated) instances of the corresponding
  // structs. May be reused if needed (instead of allocated new
  // memory).
  PGBucketT unusedPGBuckets;
  PBucketEntryT unusedPBucketEntrys;

  Uns32T prime; // the prime used for the universal hash functions.
  IntT hashedDataLength;// the number of IntT's in an element from U (U is the set of values to hash).

  // The hash functions used for the universal hashing.  

  // The main hash function (that defines the index
  // of the slot in the table).
  // The type of the hash function is: h_{a}(k) = ((acdot k)mod p)mod hashTableSize.
  Uns32T *mainHashA;

  // Control hash functions: used to compute/check the <controlValue>s
  // of <GBucket>s.
  // The type of the hash function is: h_{a}(k) = (acdot k)mod p
  Uns32T *controlHash1;
} UHashStructureT,*PUHashStructureT;

(3)RNNParametersT(LocalitySensitiveHashing.h中定义)——包含构建RNearNeighborStructureT数据结构的必要参数的结构体。

typedef struct _RNNParametersT {
  RealT parameterR; // parameter R of the algorithm.
  RealT successProbability; // the success probability 1-delta
  IntT dimension; // dimension of points.
  RealT parameterR2; // = parameterR^2

  // Whether to use <u> hash functions instead of usual <g>
  // functions. When this flag is set to TRUE,and a <g>
  // function is a pair of 2 different <u> functions.
  BooleanT useUfunctions;

  IntT parameterK; // parameter K of the algorithm.
  
  // parameter M (# of independent tuples of LSH functions)
  // if useUfunctions==TRUE,parameterL = parameterM * (parameterM - 1) / 2
  // if useUfunctions==FALSE,parameterL = parameterM
  IntT parameterM;

  IntT parameterL; // parameter L of the algorithm.
  RealT parameterW; // parameter W of the algorithm.
  IntT parameterT; // parameter T of the algorithm.

  // The type of the hash table used for storing the buckets (of the
  // same <g> function).
  IntT typeHT;
} RNNParametersT,*PRNNParametersT;

(4)PPoint(Geometry.h中定义)——用于存储数据点的结构体。该结构包含数据的坐标(coordinates),数据点范数的平方和该数据点在数据集P的下标索引。

typedef struct _PointT {
  IntT index; // the index of this point in the dataset list of points
  RealT *coordinates;
  RealT sqrLength; // the square of the length of the vector
} PointT,*PPointT;

转载请注明作者及文章出处:http://www.voidcn.com/article/p-hhrafddr-z.html

E2LSH源代码下载地址:http://download.csdn.net/detail/jasonding1354/7704277

参考资料:

1、M.Datar,N.Immorlica,P.Indyk,and V.Mirrokni,“Locality-SensitiveHashing Scheme Based on p-Stable Distributions,”Proc.Symp. ComputationalGeometry,2004.

2、A.Andoni,P.Indyk.E2lsh:Exact Euclidean locality-sensitive hashing.http://web.mit.edu/andoni/www/LSH/.2004.

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读