postgresql – 使用pg_trgm在3亿个地址中搜索

发布时间：2020-12-13 16:00:55 所属栏目：百科来源：网络整理

导读：我的PostgreSQL 9.3 DB中有3亿个地址,我想用pg_trgm模糊搜索行.最终目的是实现与谷歌地图搜索一样的搜索功能. 当我使用pg_trgm搜索这些地址时,获得结果需要大约30秒.有许多行符合0.3的默认相似性阈值条件,但我只需要大约5或10个结果.我创建了一个trigram GiS

我的PostgreSQL 9.3 DB中有3亿个地址,我想用pg_trgm模糊搜索行.最终目的是实现与谷歌地图搜索一样的搜索功能.

当我使用pg_trgm搜索这些地址时,获得结果需要大约30秒.有许多行符合0.3的默认相似性阈值条件,但我只需要大约5或10个结果.我创建了一个trigram GiST索引：

CREATE INDEX addresses_trgm_index ON addresses USING gist (address gist_trgm_ops);

这是我的查询：

SELECT address,similarity(address,'981 maun st') AS sml 
FROM addresses 
WHERE address % '981 maun st' 
ORDER BY sml DESC 
LIMIT 10;

生产环境的测试表已被删除.我从测试环境中显示EXPLAIN输出.大约有700万行,它需要大约1.6秒来获得结果.拥有3亿,需要30多个.

ebdb=> explain analyse select address,'781 maun st') as sml from addresses where address % '781 maun st' order by sml desc limit 10;
                                    QUERY PLAN                                                                            
————————————————————————————————————————————————————————————————————————————————    
 Limit  (cost=7615.83..7615.86 rows=10 width=16) (actual time=1661.004..1661.010 rows=10 loops=1)
 ->  Sort  (cost=7615.83..7634.00 rows=7268 width=16) (actual time=1661.003..1661.005 rows=10 loops=1)
     Sort Key: (similarity((address)::text,'781 maun st'::text))
     Sort Method: top-N heapsort  Memory: 25kB
     ->  Index Scan using addresses_trgm_index on addresses  (cost=0.41..7458.78 rows=7268 width=16) (actual time=0.659..1656.386 rows=5241 loops=1)
           Index Cond: ((address)::text % '781 maun st'::text)
 Total runtime: 1661.066 ms
(7 rows)

是否有提高性能的好方法,还是进行表分区的好计划？

解决方法

PostgreSQL 9.3 … Is there a good way to improve the performance or is it a good plan to do table partitioning?

表分区根本没用.

但是,有一个很好的方法：升级到Postgres的当前版本. GiST索引有很多改进,特别是对于pg_trgm模块和一般的大数据.使用Postgres 9.6或即将推出的Postgres 10(目前为测试版)应该会快得多.

你“最近邻居”看起来是正确的但是对于一个小的LIMIT使用这个等价的查询：

SELECT address,'981 maun st') AS sml 
FROM   addresses 
WHERE  address % '981 maun st' 
ORDER  BY address &lt-> '981 maun st'
LIMIT  10;

Quoting the manual:

It will usually beat the first formulation when only a small number of the closest matches is wanted.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!