sql – 在WHERE语句中使用OR进行慢速JOIN查询

发布时间：2020-12-12 07:26:55 所属栏目：MsSql教程来源：网络整理

导读：这是我的问题的一个简单示例： CREATE TABLE test1 (id SERIAL,key TEXT UNIQUE,value TEXT);CREATE TABLE test2 (id SERIAL,value TEXT);INSERT INTO test1 (key,value) SELECT i::TEXT,'ABC' || i::TEXT FROM generate_series(0,1000000) AS i;INSERT INTO

这是我的问题的一个简单示例：

CREATE TABLE test1 (id SERIAL,key TEXT UNIQUE,value TEXT);
CREATE TABLE test2 (id SERIAL,value TEXT);

INSERT INTO test1 (key,value) 
SELECT i::TEXT,'ABC' || i::TEXT 
FROM generate_series(0,1000000) AS i;

INSERT INTO test2 (key,'ABC' || (i+1000)::TEXT 
FROM generate_series(0,600000) AS i;

INSERT INTO test2 (key,'ABC' || (i+1000)::TEXT 
FROM generate_series(1000000,1200000) AS i;

CREATE INDEX test1_key ON test1 (key);
CREATE INDEX test1_value ON test1 (value);
CREATE INDEX test2_key ON test2 (key);
CREATE INDEX test2_value ON test2 (value);

VACUUM FULL VERBOSE ANALYZE test1;
VACUUM FULL VERBOSE ANALYZE test2;

这是我目前使用的查询,但需要超过6秒.

EXPLAIN ANALYZE 
SELECT test1.key AS key1,test1.value AS value1,test2.key AS key2,test2.value AS value2
FROM test1 
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234' OR test2.value = 'ABC1234';

 key1 | value1  | key2 | value2
------+---------+------+---------
 234  | ABC234  | 234  | ABC1234
 1234 | ABC1234 | 1234 | ABC2234
(2 rows)

                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=27344.05..79728.10 rows=2 width=32) (actual time=5428.635..6097.098 rows=2 loops=1)
   Hash Cond: (test1.key = test2.key)
   Filter: ((test1.value = 'ABC1234'::text) OR (test2.value = 'ABC1234'::text))
   ->  Seq Scan on test1  (cost=0.00..16321.01 rows=1000001 width=15) (actual time=0.009..1057.315 rows=1000001 loops=1)
   ->  Hash  (cost=13047.02..13047.02 rows=800002 width=17) (actual time=2231.964..2231.964 rows=800002 loops=1)
         Buckets: 65536  Batches: 2  Memory Usage: 14551kB
         ->  Seq Scan on test2  (cost=0.00..13047.02 rows=800002 width=17) (actual time=0.010..980.232 rows=800002 loops=1)
 Total runtime: 6109.042 ms
(8 rows)

在这两个表中,只有极少数数据集符合要求,但似乎没有观察到这一事实.我可以改为使用这样的查询：

EXPLAIN ANALYZE 
SELECT coalesce(test1.key,test3.key1) AS key1,coalesce(test1.value,test3.value1) AS value1,coalesce(test2.key,test3.key2) AS key2,coalesce(test2.value,test3.value2) AS value2
FROM (SELECT test1.key AS key1,test2.value AS value2
      FROM (SELECT key,value FROM test1 WHERE value = 'ABC1234') AS test1
      FULL JOIN (SELECT key,value FROM test2 WHERE value = 'ABC1234') AS test2
      ON (test1.key = test2.key)) AS test3
LEFT OUTER JOIN test1 ON (test1.key = test3.key2)
LEFT OUTER JOIN test2 ON (test2.key = test3.key1)
WHERE test1.key IS NOT NULL;

 key1 | value1  | key2 | value2
------+---------+------+---------
 1234 | ABC1234 | 1234 | ABC2234
 234  | ABC234  | 234  | ABC1234
(2 rows)

                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..33.56 rows=1 width=64) (actual time=0.075..0.083 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..25.19 rows=1 width=47) (actual time=0.066..0.072 rows=1 loops=1)
         ->  Nested Loop Left Join  (cost=0.00..16.80 rows=1 width=32) (actual time=0.051..0.054 rows=1 loops=1)
               ->  Index Scan using test2_value_key on test2  (cost=0.00..8.41 rows=1 width=17) (actual time=0.026..0.027 rows=1 loops=1)
                     Index Cond: (value = 'ABC1234'::text)
               ->  Index Scan using test1_key on test1  (cost=0.00..8.38 rows=1 width=15) (actual time=0.020..0.020 rows=0 loops=1)
                     Index Cond: (public.test1.key = public.test2.key)
                     Filter: (public.test1.value = 'ABC1234'::text)
         ->  Index Scan using test1_key on test1  (cost=0.00..8.38 rows=1 width=15) (actual time=0.011..0.013 rows=1 loops=1)
               Index Cond: ((public.test1.key IS NOT NULL) AND (public.test1.key = public.test2.key))
   ->  Index Scan using test2_key on test2  (cost=0.00..8.36 rows=1 width=17) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (public.test2.key = public.test1.key)
 Total runtime: 0.139 ms

以下查询更简单,但仍然太慢：

EXPLAIN ANALYZE
SELECT test1.key AS key1,test2.value AS value2
FROM test1 
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234'
   OR EXISTS (SELECT 1 FROM test2 t WHERE t.key = test1.key AND t.value = 'ABC1234');

 key1 | value1  | key2 | value2
------+---------+------+---------
 1234 | ABC1234 | 1234 | ABC2234
 234  | ABC234  | 234  | ABC1234
(2 rows)

                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=0.00..8446826.32 rows=500001 width=32) (actual time=615.706..1651.370 rows=2 loops=1)
   Merge Cond: (test1.key = test2.key)
   ->  Index Scan using test1_key on test1  (cost=0.00..8398983.25 rows=500001 width=15) (actual time=28.449..734.567 rows=2 loops=1)
         Filter: ((value = 'ABC1234'::text) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
         SubPlan 1
           ->  Index Scan using test2_key on test2 t  (cost=0.00..8.36 rows=1 width=0) (never executed)
                 Index Cond: (key = $0)
                 Filter: (value = 'ABC1234'::text)
         SubPlan 2
           ->  Index Scan using test2_value on test2 t  (cost=0.00..8.37 rows=1 width=7) (actual time=0.376..0.380 rows=1 loops=1)
                 Index Cond: (value = 'ABC1234'::text)
   ->  Index Scan using test2_key on test2  (cost=0.00..39593.05 rows=800002 width=17) (actual time=0.019..498.456 rows=348894 loops=1)
 Total runtime: 1651.453 ms
(13 rows)

所以我的问题是：是否存在一个简单的查询,它将导致类似于第二个查询的类似快速执行计划,或者可能是规划者的索引或某种提示.

(我知道在这个例子中,只有一个表中包含两个值是合理的.但实际上表格更复杂,表格方案不能轻易改变.)

PostgreSQL Version: 9.0.3
shared_buffers = 64MB
effective_cache_size = 32MB
work_mem = 16MB
maintenance_work_mem = 32MB
temp_buffers = 8MB
wal_buffers= 1MB

编辑：根据Kipotlov的建议,这里是UNION版本.为什么正常的OR查询没有选择这么好的计划？

EXPLAIN ANALYZE
SELECT test1.key AS key1,test2.value AS value2
FROM test1 
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234'
UNION
SELECT test1.key AS key1,test2.value AS value2
FROM test1 
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test2.value = 'ABC1234';

 key1 | value1  | key2 | value2
------+---------+------+---------
 1234 | ABC1234 | 1234 | ABC2234
 234  | ABC234  | 234  | ABC1234
(2 rows)

                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=33.64..33.66 rows=2 width=32) (actual time=0.114..0.119 rows=2 loops=1)
   ->  Sort  (cost=33.64..33.64 rows=2 width=32) (actual time=0.111..0.113 rows=2 loops=1)
         Sort Key: public.test1.key,public.test1.value,public.test2.key,public.test2.value
         Sort Method:  quicksort  Memory: 17kB
         ->  Append  (cost=0.00..33.63 rows=2 width=32) (actual time=0.046..0.097 rows=2 loops=1)
               ->  Nested Loop Left Join  (cost=0.00..16.81 rows=1 width=32) (actual time=0.044..0.050 rows=1 loops=1)
                     ->  Index Scan using test1_value_key on test1  (cost=0.00..8.44 rows=1 width=15) (actual time=0.023..0.024 rows=1 loops=1)
                           Index Cond: (value = 'ABC1234'::text)
                     ->  Index Scan using test2_key on test2  (cost=0.00..8.36 rows=1 width=17) (actual time=0.014..0.016 rows=1 loops=1)
                           Index Cond: (public.test1.key = public.test2.key)
               ->  Nested Loop  (cost=0.00..16.80 rows=1 width=32) (actual time=0.036..0.041 rows=1 loops=1)
                     ->  Index Scan using test2_value_key on test2  (cost=0.00..8.41 rows=1 width=17) (actual time=0.019..0.020 rows=1 loops=1)
                           Index Cond: (value = 'ABC1234'::text)
                     ->  Index Scan using test1_key on test1  (cost=0.00..8.38 rows=1 width=15) (actual time=0.013..0.015 rows=1 loops=1)
                           Index Cond: (public.test1.key = public.test2.key)
 Total runtime: 0.173 ms
(16 rows)

解决方法

首先,感谢非常详细的问题.在询问之前,很难找到研究过他们问题的人.

我一直在考虑这个问题,问题似乎是PostgreSQL希望加入所有行,因为test1中的每个非匹配行可能在test2中匹配 – 反之亦然.

解决方案是强制规划器分两步执行查询.一种方法是您已经尝试过的大型UNION查询 – 强制它在单独的查询中考虑每个表达式.

另一种方法是强制规划器首先找到匹配的密钥,然后执行连接,这样就不会有歧义：

EXPLAIN ANALYZE
SELECT test1.key AS key1,test2.value AS value2
FROM (
    SELECT key FROM test1 WHERE value='ABC1234'
    UNION SELECT key FROM test2 WHERE value='ABC1234'
) AS matching_keys
INNER JOIN test1 USING (key)
LEFT OUTER JOIN test2 USING (key);

 Nested Loop Left Join  (cost=16.84..34.44 rows=2 width=32) (actual time=0.211..0.280 rows=2 loops=1)
   ->  Nested Loop  (cost=16.84..33.65 rows=2 width=15) (actual time=0.175..0.212 rows=2 loops=1)
         ->  Unique  (cost=16.84..16.85 rows=2 width=6) (actual time=0.132..0.136 rows=2 loops=1)
               ->  Sort  (cost=16.84..16.85 rows=2 width=6) (actual time=0.131..0.132 rows=2 loops=1)
                     Sort Key: public.test1.key
                     Sort Method: quicksort  Memory: 25kB
                     ->  Append  (cost=0.00..16.83 rows=2 width=6) (actual time=0.058..0.110 rows=2 loops=1)
                           ->  Index Scan using test1_value on test1  (cost=0.00..8.42 rows=1 width=6) (actual time=0.056..0.058 rows=1 loops=1)
                                 Index Cond: (value = 'ABC1234'::text)
                           ->  Index Scan using test2_value on test2  (cost=0.00..8.39 rows=1 width=7) (actual time=0.046..0.047 rows=1 loops=1)
                                 Index Cond: (value = 'ABC1234'::text)
         ->  Index Scan using test1_key on test1  (cost=0.00..8.38 rows=1 width=15) (actual time=0.032..0.033 rows=1 loops=2)
               Index Cond: (key = public.test1.key)
   ->  Index Scan using test2_key on test2  (cost=0.00..0.38 rows=1 width=17) (actual time=0.028..0.029 rows=1 loops=2)
         Index Cond: (public.test1.key = key)
 Total runtime: 0.390 ms
(16 rows)

同样,UNION扮演OR的角色.不幸的是,对于像>’ABC1234’这样的查询,这种方法仍然表现不佳.你可以通过提高work_mem来改善它.我在这里不知所措.

至于你的上一个问题：

Why does the normal OR query not choose such a good plan?

因为PostgreSQL规划器目前缺乏将OR’ed表达式拆分为单独的UNION查询的能??力.有一些警告使它比看起来更难.

PostgreSQL规划器已经相当精细,但到目前为止,利用已经可以手动重写SQL的优化并不是一个重要的优先事项.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!