postgresql – 提高大型PostgresSQL表中COUNT / GROUP-BY的性能

发布时间：2020-12-13 16:23:04 所属栏目：百科来源：网络整理

导读：我正在运行PostgresSQL 9.2并且具有12列关系,大约6,700,000行.它包含3D空间中的节点,每个节点都引用一个用户(创建它的用户).要查询哪个用户创建了多少个节点,我将执行以下操作(添加了解释分析以获取更多信息)： EXPLAIN ANALYZE SELECT user_id,count(user_i

我正在运行PostgresSQL 9.2并且具有12列关系,大约6,700,000行.它包含3D空间中的节点,每个节点都引用一个用户(创建它的用户).要查询哪个用户创建了多少个节点,我将执行以下操作(添加了解释分析以获取更多信息)：

EXPLAIN ANALYZE SELECT user_id,count(user_id) FROM treenode WHERE project_id=1 GROUP BY user_id;
                                                    QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=253668.70..253669.07 rows=37 width=8) (actual time=1747.620..1747.623 rows=38 loops=1)
   ->  Seq Scan on treenode  (cost=0.00..220278.79 rows=6677983 width=8) (actual time=0.019..886.803 rows=6677983 loops=1)
         Filter: (project_id = 1)
 Total runtime: 1747.653 ms

如您所见,这需要大约1.7秒.考虑到数据量,这不是太糟糕,但我想知道这是否可以改进.我试图在用户列上添加BTree索引,但这没有任何帮助.

你有其他建议吗？

为了完整起见,这是包含所有索引的完整表定义(没有外键约束,引用和触发器)：

Column     |           Type           |                      Modifiers                    
---------------+--------------------------+------------------------------------------------------
 id            | bigint                   | not null default nextval('concept_id_seq'::regclass)
 user_id       | bigint                   | not null
 creation_time | timestamp with time zone | not null default now()
 edition_time  | timestamp with time zone | not null default now()
 project_id    | bigint                   | not null
 location      | double3d                 | not null
 reviewer_id   | integer                  | not null default (-1)
 review_time   | timestamp with time zone |
 editor_id     | integer                  |
 parent_id     | bigint                   |
 radius        | double precision         | not null default 0
 confidence    | integer                  | not null default 5
 skeleton_id   | bigint                   |
Indexes:
    "treenode_pkey" PRIMARY KEY,btree (id)
    "treenode_id_key" UNIQUE CONSTRAINT,btree (id)
    "skeleton_id_treenode_index" btree (skeleton_id)
    "treenode_editor_index" btree (editor_id)
    "treenode_location_x_index" btree (((location).x))
    "treenode_location_y_index" btree (((location).y))
    "treenode_location_z_index" btree (((location).z))
    "treenode_parent_id" btree (parent_id)
    "treenode_user_index" btree (user_id)

编辑：这是结果,当我使用@ypercube提出的查询(和索引)时(查询大约需要5.3秒而没有EXPLAIN ANALYZE)：

EXPLAIN ANALYZE SELECT u.id,( SELECT COUNT(*) FROM treenode AS t WHERE t.project_id=1 AND t.user_id = u.id ) AS number_of_nodes FROM auth_user As u;
                                                                        QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on auth_user u  (cost=0.00..6987937.85 rows=46 width=4) (actual time=29.934..5556.147 rows=46 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=151911.65..151911.66 rows=1 width=0) (actual time=120.780..120.780 rows=1 loops=46)
           ->  Bitmap Heap Scan on treenode t  (cost=4634.41..151460.44 rows=180486 width=0) (actual time=13.785..114.021 rows=145174 loops=46)
                 Recheck Cond: ((project_id = 1) AND (user_id = u.id))
                 Rows Removed by Index Recheck: 461076
                 ->  Bitmap Index Scan on treenode_user_index  (cost=0.00..4589.29 rows=180486 width=0) (actual time=13.082..13.082 rows=145174 loops=46)
                       Index Cond: ((project_id = 1) AND (user_id = u.id))
 Total runtime: 5556.190 ms
(9 rows)

Time: 5556.804 ms

编辑2：这是结果,当我使用project_id上的索引,user_id(但尚未进行模式优化)@ erwin-brandstetter建议时(查询以与原始查询相同的速度运行1.5秒)：

EXPLAIN ANALYZE SELECT user_id,count(user_id) as ct FROM treenode WHERE project_id=1 GROUP BY user_id;
                                                        QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=253670.88..253671.24 rows=37 width=8) (actual time=1807.334..1807.339 rows=38 loops=1)
   ->  Seq Scan on treenode  (cost=0.00..220280.62 rows=6678050 width=8) (actual time=0.183..893.491 rows=6678050 loops=1)
         Filter: (project_id = 1)
 Total runtime: 1807.368 ms
(4 rows)

主要问题是缺少索引.但还有更多.

SELECT user_id,count(*) AS ct
FROM   treenode
WHERE  project_id = 1
GROUP  BY user_id;

>你有很多bigint专栏.可能是矫枉过正.通常,integer对于像project_id和user_id这样的列来说已经足够了.这也有助于下一个项目.
在优化表定义时,请考虑此相关答案,重点是数据对齐和填充.但其余大多数也适用：

> Configuring PostgreSQL for read performance

>房间里的大象：project_id上没有索引.创建一个.这比其他答案更重要.
在此期间,将其作为多列索引：

CREATE INDEX treenode_project_id_user_id_index ON treenode (project_id,user_id);

如果您遵循我的建议,整数将是完美的：

> Is a composite index also good for queries on the first field?

> user_id定义为NOT NULL,因此count(user_id)等效于count(*),但后者更短更快. (在此特定查询中,这甚至可以在不将user_id定义为NOT NULL的情况下应用.)
> id已经是主键,额外的UNIQUE约束是无用的镇流器.算了吧：

"treenode_pkey" PRIMARY KEY,btree (id)

  

  
   
  "treenode_id_key" UNIQUE CONSTRAINT,btree (id)

旁白：永远不要使用id作为列名.使用像treenode_id这样的描述性内容.

添加信息

问：有多少个不同的project_id和user_id？
答：不超过五个不同的project_id.

这意味着Postgres必须读取整个表的大约20％才能满足您的查询.除非它可以使用仅索引扫描,否则表上的顺序扫描将比涉及任何索引更快.除了通过优化表和服务器设置之外,没有更多的性能可以获得.

至于仅索引扫描：要查看效果如何,运行VACUUM ANALYZE甚至CLUSTER(使用上面的索引),如果你能负担得起(独占锁定表).然后再次尝试您的查询.现在应该只使用索引适度更快.首先阅读此相关答案：

> Optimize simple query using ORDER BY date and text

以及the manual page added with Postgres 9.6和Postgres Wiki on index-only scans.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!