PostgreSQL的9.6黑科技绽放算法索引

发布时间：2020-12-13 18:17:00 所属栏目：百科来源：网络整理

导读：《PostgreSQL的9.6黑科技绽放算法索引》要点：本文介绍了PostgreSQL的9.6黑科技绽放算法索引，希望对您有用。如果有疑问，可以联系我们。 PostgreSQL 确实是学术界和工业界的璀璨明珠,它总是喜欢将学术界的一些玩意工业化,这次的bloom又是一个代表. 在PG很

《PostgreSQL的9.6黑科技绽放算法索引》要点：
本文介绍了PostgreSQL的9.6黑科技绽放算法索引，希望对您有用。如果有疑问，可以联系我们。

PostgreSQL 确实是学术界和工业界的璀璨明珠,它总是喜欢将学术界的一些玩意工业化,这次的bloom又是一个代表.

在PG很多的地方都能看到学术的影子,好比pgbench支持产生泊松分布,高斯分布的随机值.

bloom filter是一个有损过滤器,使用有限的比特位存储一些唯一值集合所发生的bits.

通过这些bits可以满足这样的场景需求,给定一个值,判断这个值是否属于这个集合.

例如

create table test(c1 int);
insert into test select trunc(random()*100000) from generate_series(1,10000);

使用所有的 test.c1 值,通过bloom filter算法生成一个值val.

然后给定一个值例如 100,判断100是否在test.c1中.

select * from test where c1=100;

通过bloom filter可以快速得到,不必要遍历表来得到.

判断办法是使用100和val以及统一的bloom算法.

可能得到的成果是true or false.

true表现100在这里面,false表现100不在这里面.

必须注意,由于bloom filter是有损过滤器,并且真的不必定为真,但是假的必定为假.

PostgreSQL 9.6使用custom access methods接口定义了一个索引接口bloom,使用到它的特性：真的不必定为真,但是假的必定为假.

目前已实现的场景是,支持=查询,但是这个=会包括一些假的值,所以需要recheck.反过来,它如果要支持<>也是很方便的,并且不需要recheck.

使用PostgreSQL 函数接口也能实现bloom过滤器.

bloom必要m个bit位.

添加元素时,必要k个hash函数,通过每一个hash和传入的值计算得到另一个值([0,m]).

得到的值用于设置对应的bit位为1.

例子

创立一个类型,存储bloom.

CREATE TYPE dumbloom AS (
m integer,-- bit 位数
k integer,-- hash 函数数量
-- Our bit array is actually an array of integers
bits integer[] -- bit
);

创立一个空的bloom,设置false值异常设置为TRUE的概率p,设置期望存储多少个唯一值n .

CREATE FUNCTION dumbloom_empty (
-- 2% false probability
p float8 DEFAULT 0.02,
-- 100k expected uniques
n integer DEFAULT 100000
) RETURNS dumbloom AS
$$
DECLARE
m integer;
k integer;
i integer;
bits integer[];
BEGIN
-- Getting m and k from n and p is some math sorcery
-- See: https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions
m := abs(ceil(n * ln(p) / (ln(2) ^ 2)))::integer;
k := round(ln(2) * m / n)::integer;
bits := NULL;
-- Initialize all bits to 0
FOR i in 1 .. ceil(m / 32.0) LOOP
bits := array_append(bits,0);
END LOOP;
RETURN (m,k,bits)::dumbloom;
END;
$$
LANGUAGE 'plpgsql' IMMUTABLE;

创立一个指纹函数,存储使用K个哈希函数得到的值,存入数组.

PostgreSQL的9.6黑科技绽放算法索引

CREATE FUNCTION dumbloom_fingerprint (
b dumbloom,
item text
) RETURNS integer[] AS
$$
DECLARE
h1 bigint;
h2 bigint;
i integer;
fingerprint integer[];
BEGIN
h1 := abs(hashtext(upper(item)));
-- If lower(item) and upper(item) are the same,h1 and h2 will be identical too,
-- let's add some random salt
h2 := abs(hashtext(lower(item) || 'yo dawg!'));
finger := NULL;
FOR i IN 1 .. b.k LOOP
-- This combinatorial approach works just as well as using k independent
-- hash functions,but is obviously much faster
-- See: http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf
fingerprint := array_append(fingerprint,((h1 + i * h2) % b.m)::integer);
END LOOP;
RETURN fingerprint;
END;
$$
LANGUAGE 'plpgsql' IMMUTABLE;

添加元素的函数

同样也是设置对应的bit为1

CREATE FUNCTION dumbloom_add (
b dumbloom,
item text,
) RETURNS dumbloom AS
$$
DECLARE
i integer;
idx integer;
BEGIN
IF b IS NULL THEN
b := dumbloom_empty(); -- 生成空bloom
END IF;
FOREACH i IN ARRAY dumbloom_fingerprint(b,item) LOOP -- 设置k个哈希发生的值对应的bit位为1
-- Postgres uses 1-indexing,hence the + 1 here
idx := i / 32 + 1;
b.bits[idx] := b.bits[idx] | (1 << (i % 32));
END LOOP;
RETURN b;
END;
$$
LANGUAGE 'plpgsql' IMMUTABLE;

是否包括某元素

PostgreSQL的9.6黑科技绽放算法索引

CREATE FUNCTION dumbloom_contains (
b dumbloom,
item text
) RETURNS boolean AS
$$
DECLARE
i integer;
idx integer;
BEGIN
IF b IS NULL THEN
RETURN FALSE;
END IF;
FOREACH i IN ARRAY dumbloom_fingerprint(b,item) LOOP -- 计算k个哈希发生的值,判断是否有非1的bit,有则返回false,如果全部为1则返回true.
idx := i / 32 + 1;
IF NOT (b.bits[idx] & (1 << (i % 32)))::boolean THEN
RETURN FALSE;
END IF;
END LOOP;
RETURN TRUE;
END;
$$
LANGUAGE 'plpgsql' IMMUTABLE;

测试

CREATE TABLE t (
users dumbloom
);
INSERT INTO t VALUES (dumbloom_empty());
UPDATE t SET users = dumbloom_add(users,'usmanm');
UPDATE t SET users = dumbloom_add(users,'billyg');
UPDATE t SET users = dumbloom_add(users,'pipeline');
-- This first three will return true
SELECT dumbloom_contains(users,'usmanm') FROM t;
SELECT dumbloom_contains(users,'billyg') FROM t;
SELECT dumbloom_contains(users,'pipeline') FROM t;
-- This will return false because we never added 'unknown' to the Bloom filter
SELECT dumbloom_contains(users,'unknown') FROM t;

以上例子来自pipelinedb文档,用C语言写成聚合函数,可以实现流式计算.

https://www.pipelinedb.com/blog/making-postgres-bloom

接下来是PostgreSQL 9.6的例子,9.6是将它做成了索引,而不是聚合.

(如果有朋友想使用pipelinedb的bloom聚合,可以看看PIPELINEDB的代码,PORT过来)

postgres=# create table test(id int);
CREATE TABLE
postgres=# insert into test select trunc(100000000*(random())) from generate_series(1,100000000);
INSERT 0 100000000
postgres=# create index idx_test_id on test using bloom(id);
CREATE INDEX
postgres=# select * from test limit 10;
id
----------
16567697
17257165
78384532
96331329
62449166
3965065
80439767
54772860
34960167
30594730
(10 rows)
位图扫
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where id=16567697;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=946080.00..946082.03 rows=2 width=4) (actual time=524.545..561.168 rows=3 loops=1)
Output: id
Recheck Cond: (test.id = 16567697)
Rows Removed by Index Recheck: 30870
Heap Blocks: exact=29846
Buffers: shared hit=225925
-> Bitmap Index Scan on idx_test_id (cost=0.00..946080.00 rows=2 width=0) (actual time=517.448..517.448 rows=30873 loops=1)
Index Cond: (test.id = 16567697)
Buffers: shared hit=196079
Planning time: 0.084 ms
Execution time: 561.535 ms
(11 rows)
全表扫
postgres=# explain (analyze,buffers) select * from test where id=16567697;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on public.test (cost=0.00..1692478.00 rows=2 width=4) (actual time=0.017..8270.536 rows=3 loops=1)
Output: id
Filter: (test.id = 16567697)
Rows Removed by Filter: 99999997
Buffers: shared hit=442478
Planning time: 0.077 ms
Execution time: 8270.564 ms
(7 rows)

多个字段测试,16个字段,任意测试组合,速度和recheck有关,recheck越少越好.

不仅仅支持and还支持or,只是OR的条件是bitmap的,会慢一些.

postgres=# create table test1(c1 int,c2 int,c3 int,c4 int,c5 int,c6 int,c7 int,c8 int,c9 int,c10 int,c11 int,c12 int,c13 int,c14 int,c15 int,c16 int);
CREATE TABLE
postgres=# insert into test1 select i,i+1,i-1,i+2,i-2,i+3,i-3,i+4,i-4,i+5,i-5,i+6,i-6,i+7,i-7,i+8 from (select trunc(100000000*(random())) i from generate_series(1,10000000)) t;
INSERT 0 10000000
postgres=# create index idx_test1_1 on test1 using bloom (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15);
CREATE INDEX
postgres=# dt+ test1
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-------+-------+----------+--------+-------------
public | test1 | table | postgres | 888 MB |
(1 row)
postgres=# di+ idx_test1_1
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-------------+-------+----------+-------+--------+-------------
public | idx_test1_1 | index | postgres | test1 | 153 MB |
(1 row)
postgres=# select * from test1 limit 10;
c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | c10 | c11 | c12 | c13 | c14 | c15 | c16
----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------
68747916 | 68747917 | 68747915 | 68747918 | 68747914 | 68747919 | 68747913 | 68747920 | 68747912 | 68747921 | 68747911 | 68747922 | 68747910 | 68747923 | 68747909 | 68747924
36630121 | 36630122 | 36630120 | 36630123 | 36630119 | 36630124 | 36630118 | 36630125 | 36630117 | 36630126 | 36630116 | 36630127 | 36630115 | 36630128 | 36630114 | 36630129
72139701 | 72139702 | 72139700 | 72139703 | 72139699 | 72139704 | 72139698 | 72139705 | 72139697 | 72139706 | 72139696 | 72139707 | 72139695 | 72139708 | 72139694 | 72139709
35950519 | 35950520 | 35950518 | 35950521 | 35950517 | 35950522 | 35950516 | 35950523 | 35950515 | 35950524 | 35950514 | 35950525 | 35950513 | 35950526 | 35950512 | 35950527
15285103 | 15285104 | 15285102 | 15285105 | 15285101 | 15285106 | 15285100 | 15285107 | 15285099 | 15285108 | 15285098 | 15285109 | 15285097 | 15285110 | 15285096 | 15285111
43537916 | 43537917 | 43537915 | 43537918 | 43537914 | 43537919 | 43537913 | 43537920 | 43537912 | 43537921 | 43537911 | 43537922 | 43537910 | 43537923 | 43537909 | 43537924
38702018 | 38702019 | 38702017 | 38702020 | 38702016 | 38702021 | 38702015 | 38702022 | 38702014 | 38702023 | 38702013 | 38702024 | 38702012 | 38702025 | 38702011 | 38702026
59069936 | 59069937 | 59069935 | 59069938 | 59069934 | 59069939 | 59069933 | 59069940 | 59069932 | 59069941 | 59069931 | 59069942 | 59069930 | 59069943 | 59069929 | 59069944
6608034 | 6608035 | 6608033 | 6608036 | 6608032 | 6608037 | 6608031 | 6608038 | 6608030 | 6608039 | 6608029 | 6608040 | 6608028 | 6608041 | 6608027 | 6608042
35486917 | 35486918 | 35486916 | 35486919 | 35486915 | 35486920 | 35486914 | 35486921 | 35486913 | 35486922 | 35486912 | 35486923 | 35486911 | 35486924 | 35486910 | 35486925
(10 rows)

任意字段组合查询都能用上这个索引

postgres=# explain (analyze,buffers) select * from test1 where c8=68747920 and c10=68747921 and c16=68747924 and c7=68747913 and c5=68747914;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test1 (cost=169609.00..169610.02 rows=1 width=64) (actual time=101.724..102.317 rows=1 loops=1)
Output: c1,c15,c16
Recheck Cond: ((test1.c5 = 68747914) AND (test1.c7 = 68747913) AND (test1.c8 = 68747920) AND (test1.c10 = 68747921))
Rows Removed by Index Recheck: 425
Filter: (test1.c16 = 68747924)
Heap Blocks: exact=425
Buffers: shared hit=20033
-> Bitmap Index Scan on idx_test1_1 (cost=0.00..169609.00 rows=1 width=0) (actual time=101.636..101.636 rows=426 loops=1)
Index Cond: ((test1.c5 = 68747914) AND (test1.c7 = 68747913) AND (test1.c8 = 68747920) AND (test1.c10 = 68747921))
Buffers: shared hit=19608
Planning time: 0.129 ms
Execution time: 102.364 ms
(12 rows)
postgres=# explain (analyze,buffers) select * from test1 where c8=68747920 and c10=68747921 and c16=68747924 and c7=68747913 and c5=68747914 and c12=68747922;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test1 (cost=194609.00..194610.03 rows=1 width=64) (actual time=54.702..54.746 rows=1 loops=1)
Output: c1,c16
Recheck Cond: ((test1.c5 = 68747914) AND (test1.c7 = 68747913) AND (test1.c8 = 68747920) AND (test1.c10 = 68747921) AND (test1.c12 = 68747922))
Rows Removed by Index Recheck: 27
Filter: (test1.c16 = 68747924)
Heap Blocks: exact=28
Buffers: shared hit=19636
-> Bitmap Index Scan on idx_test1_1 (cost=0.00..194609.00 rows=1 width=0) (actual time=54.667..54.667 rows=28 loops=1)
Index Cond: ((test1.c5 = 68747914) AND (test1.c7 = 68747913) AND (test1.c8 = 68747920) AND (test1.c10 = 68747921) AND (test1.c12 = 68747922))
Buffers: shared hit=19608
Planning time: 0.141 ms
Execution time: 54.814 ms
(12 rows)

如果使用其他的索引办法,任意条件组合查询,需要为每一种组合创建一个索引来支持.

而使用bloom索引办法,只需要创建一个索引就够了.http://www.postgresql.org/docs/9.6/static/bloom.html

更多深度技术内容,请存眷云栖社区微信公众号：yunqiinsight.

编程之家PHP培训学院每天发布《PostgreSQL的9.6黑科技绽放算法索引》等实战技能，PHP、MYSQL、LINUX、APP、JS,CSS全面培养人才。

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!