加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 站长学院 > MsSql教程 > 正文

sql – 在WHERE语句中使用OR进行慢速JOIN查询

发布时间:2020-12-12 07:26:55 所属栏目:MsSql教程 来源:网络整理
导读:这是我的问题的一个简单示例: CREATE TABLE test1 (id SERIAL,key TEXT UNIQUE,value TEXT);CREATE TABLE test2 (id SERIAL,value TEXT);INSERT INTO test1 (key,value) SELECT i::TEXT,'ABC' || i::TEXT FROM generate_series(0,1000000) AS i;INSERT INTO
这是我的问题的一个简单示例:
CREATE TABLE test1 (id SERIAL,key TEXT UNIQUE,value TEXT);
CREATE TABLE test2 (id SERIAL,value TEXT);

INSERT INTO test1 (key,value) 
SELECT i::TEXT,'ABC' || i::TEXT 
FROM generate_series(0,1000000) AS i;

INSERT INTO test2 (key,'ABC' || (i+1000)::TEXT 
FROM generate_series(0,600000) AS i;

INSERT INTO test2 (key,'ABC' || (i+1000)::TEXT 
FROM generate_series(1000000,1200000) AS i;

CREATE INDEX test1_key ON test1 (key);
CREATE INDEX test1_value ON test1 (value);
CREATE INDEX test2_key ON test2 (key);
CREATE INDEX test2_value ON test2 (value);

VACUUM FULL VERBOSE ANALYZE test1;
VACUUM FULL VERBOSE ANALYZE test2;

这是我目前使用的查询,但需要超过6秒.

EXPLAIN ANALYZE 
SELECT test1.key AS key1,test1.value AS value1,test2.key AS key2,test2.value AS value2
FROM test1 
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234' OR test2.value = 'ABC1234';

 key1 | value1  | key2 | value2
------+---------+------+---------
 234  | ABC234  | 234  | ABC1234
 1234 | ABC1234 | 1234 | ABC2234
(2 rows)

                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=27344.05..79728.10 rows=2 width=32) (actual time=5428.635..6097.098 rows=2 loops=1)
   Hash Cond: (test1.key = test2.key)
   Filter: ((test1.value = 'ABC1234'::text) OR (test2.value = 'ABC1234'::text))
   ->  Seq Scan on test1  (cost=0.00..16321.01 rows=1000001 width=15) (actual time=0.009..1057.315 rows=1000001 loops=1)
   ->  Hash  (cost=13047.02..13047.02 rows=800002 width=17) (actual time=2231.964..2231.964 rows=800002 loops=1)
         Buckets: 65536  Batches: 2  Memory Usage: 14551kB
         ->  Seq Scan on test2  (cost=0.00..13047.02 rows=800002 width=17) (actual time=0.010..980.232 rows=800002 loops=1)
 Total runtime: 6109.042 ms
(8 rows)

在这两个表中,只有极少数数据集符合要求,但似乎没有观察到这一事实.我可以改为使用这样的查询:

EXPLAIN ANALYZE 
SELECT coalesce(test1.key,test3.key1) AS key1,coalesce(test1.value,test3.value1) AS value1,coalesce(test2.key,test3.key2) AS key2,coalesce(test2.value,test3.value2) AS value2
FROM (SELECT test1.key AS key1,test2.value AS value2
      FROM (SELECT key,value FROM test1 WHERE value = 'ABC1234') AS test1
      FULL JOIN (SELECT key,value FROM test2 WHERE value = 'ABC1234') AS test2
      ON (test1.key = test2.key)) AS test3
LEFT OUTER JOIN test1 ON (test1.key = test3.key2)
LEFT OUTER JOIN test2 ON (test2.key = test3.key1)
WHERE test1.key IS NOT NULL;

 key1 | value1  | key2 | value2
------+---------+------+---------
 1234 | ABC1234 | 1234 | ABC2234
 234  | ABC234  | 234  | ABC1234
(2 rows)

                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..33.56 rows=1 width=64) (actual time=0.075..0.083 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..25.19 rows=1 width=47) (actual time=0.066..0.072 rows=1 loops=1)
         ->  Nested Loop Left Join  (cost=0.00..16.80 rows=1 width=32) (actual time=0.051..0.054 rows=1 loops=1)
               ->  Index Scan using test2_value_key on test2  (cost=0.00..8.41 rows=1 width=17) (actual time=0.026..0.027 rows=1 loops=1)
                     Index Cond: (value = 'ABC1234'::text)
               ->  Index Scan using test1_key on test1  (cost=0.00..8.38 rows=1 width=15) (actual time=0.020..0.020 rows=0 loops=1)
                     Index Cond: (public.test1.key = public.test2.key)
                     Filter: (public.test1.value = 'ABC1234'::text)
         ->  Index Scan using test1_key on test1  (cost=0.00..8.38 rows=1 width=15) (actual time=0.011..0.013 rows=1 loops=1)
               Index Cond: ((public.test1.key IS NOT NULL) AND (public.test1.key = public.test2.key))
   ->  Index Scan using test2_key on test2  (cost=0.00..8.36 rows=1 width=17) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (public.test2.key = public.test1.key)
 Total runtime: 0.139 ms

以下查询更简单,但仍然太慢:

EXPLAIN ANALYZE
SELECT test1.key AS key1,test2.value AS value2
FROM test1 
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234'
   OR EXISTS (SELECT 1 FROM test2 t WHERE t.key = test1.key AND t.value = 'ABC1234');

 key1 | value1  | key2 | value2
------+---------+------+---------
 1234 | ABC1234 | 1234 | ABC2234
 234  | ABC234  | 234  | ABC1234
(2 rows)

                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=0.00..8446826.32 rows=500001 width=32) (actual time=615.706..1651.370 rows=2 loops=1)
   Merge Cond: (test1.key = test2.key)
   ->  Index Scan using test1_key on test1  (cost=0.00..8398983.25 rows=500001 width=15) (actual time=28.449..734.567 rows=2 loops=1)
         Filter: ((value = 'ABC1234'::text) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
         SubPlan 1
           ->  Index Scan using test2_key on test2 t  (cost=0.00..8.36 rows=1 width=0) (never executed)
                 Index Cond: (key = $0)
                 Filter: (value = 'ABC1234'::text)
         SubPlan 2
           ->  Index Scan using test2_value on test2 t  (cost=0.00..8.37 rows=1 width=7) (actual time=0.376..0.380 rows=1 loops=1)
                 Index Cond: (value = 'ABC1234'::text)
   ->  Index Scan using test2_key on test2  (cost=0.00..39593.05 rows=800002 width=17) (actual time=0.019..498.456 rows=348894 loops=1)
 Total runtime: 1651.453 ms
(13 rows)

所以我的问题是:是否存在一个简单的查询,它将导致类似于第二个查询的类似快速执行计划,或者可能是规划者的索引或某种提示.

(我知道在这个例子中,只有一个表中包含两个值是合理的.但实际上表格更复杂,表格方案不能轻易改变.)

PostgreSQL Version: 9.0.3
shared_buffers = 64MB
effective_cache_size = 32MB
work_mem = 16MB
maintenance_work_mem = 32MB
temp_buffers = 8MB
wal_buffers= 1MB

编辑:根据Kipotlov的建议,这里是UNION版本.为什么正常的OR查询没有选择这么好的计划?

EXPLAIN ANALYZE
SELECT test1.key AS key1,test2.value AS value2
FROM test1 
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234'
UNION
SELECT test1.key AS key1,test2.value AS value2
FROM test1 
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test2.value = 'ABC1234';

 key1 | value1  | key2 | value2
------+---------+------+---------
 1234 | ABC1234 | 1234 | ABC2234
 234  | ABC234  | 234  | ABC1234
(2 rows)

                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=33.64..33.66 rows=2 width=32) (actual time=0.114..0.119 rows=2 loops=1)
   ->  Sort  (cost=33.64..33.64 rows=2 width=32) (actual time=0.111..0.113 rows=2 loops=1)
         Sort Key: public.test1.key,public.test1.value,public.test2.key,public.test2.value
         Sort Method:  quicksort  Memory: 17kB
         ->  Append  (cost=0.00..33.63 rows=2 width=32) (actual time=0.046..0.097 rows=2 loops=1)
               ->  Nested Loop Left Join  (cost=0.00..16.81 rows=1 width=32) (actual time=0.044..0.050 rows=1 loops=1)
                     ->  Index Scan using test1_value_key on test1  (cost=0.00..8.44 rows=1 width=15) (actual time=0.023..0.024 rows=1 loops=1)
                           Index Cond: (value = 'ABC1234'::text)
                     ->  Index Scan using test2_key on test2  (cost=0.00..8.36 rows=1 width=17) (actual time=0.014..0.016 rows=1 loops=1)
                           Index Cond: (public.test1.key = public.test2.key)
               ->  Nested Loop  (cost=0.00..16.80 rows=1 width=32) (actual time=0.036..0.041 rows=1 loops=1)
                     ->  Index Scan using test2_value_key on test2  (cost=0.00..8.41 rows=1 width=17) (actual time=0.019..0.020 rows=1 loops=1)
                           Index Cond: (value = 'ABC1234'::text)
                     ->  Index Scan using test1_key on test1  (cost=0.00..8.38 rows=1 width=15) (actual time=0.013..0.015 rows=1 loops=1)
                           Index Cond: (public.test1.key = public.test2.key)
 Total runtime: 0.173 ms
(16 rows)

解决方法

首先,感谢非常详细的问题.在询问之前,很难找到研究过他们问题的人.

我一直在考虑这个问题,问题似乎是PostgreSQL希望加入所有行,因为test1中的每个非匹配行可能在test2中匹配 – 反之亦然.

解决方案是强制规划器分两步执行查询.一种方法是您已经尝试过的大型UNION查询 – 强制它在单独的查询中考虑每个表达式.

另一种方法是强制规划器首先找到匹配的密钥,然后执行连接,这样就不会有歧义:

EXPLAIN ANALYZE
SELECT test1.key AS key1,test2.value AS value2
FROM (
    SELECT key FROM test1 WHERE value='ABC1234'
    UNION SELECT key FROM test2 WHERE value='ABC1234'
) AS matching_keys
INNER JOIN test1 USING (key)
LEFT OUTER JOIN test2 USING (key);

 Nested Loop Left Join  (cost=16.84..34.44 rows=2 width=32) (actual time=0.211..0.280 rows=2 loops=1)
   ->  Nested Loop  (cost=16.84..33.65 rows=2 width=15) (actual time=0.175..0.212 rows=2 loops=1)
         ->  Unique  (cost=16.84..16.85 rows=2 width=6) (actual time=0.132..0.136 rows=2 loops=1)
               ->  Sort  (cost=16.84..16.85 rows=2 width=6) (actual time=0.131..0.132 rows=2 loops=1)
                     Sort Key: public.test1.key
                     Sort Method: quicksort  Memory: 25kB
                     ->  Append  (cost=0.00..16.83 rows=2 width=6) (actual time=0.058..0.110 rows=2 loops=1)
                           ->  Index Scan using test1_value on test1  (cost=0.00..8.42 rows=1 width=6) (actual time=0.056..0.058 rows=1 loops=1)
                                 Index Cond: (value = 'ABC1234'::text)
                           ->  Index Scan using test2_value on test2  (cost=0.00..8.39 rows=1 width=7) (actual time=0.046..0.047 rows=1 loops=1)
                                 Index Cond: (value = 'ABC1234'::text)
         ->  Index Scan using test1_key on test1  (cost=0.00..8.38 rows=1 width=15) (actual time=0.032..0.033 rows=1 loops=2)
               Index Cond: (key = public.test1.key)
   ->  Index Scan using test2_key on test2  (cost=0.00..0.38 rows=1 width=17) (actual time=0.028..0.029 rows=1 loops=2)
         Index Cond: (public.test1.key = key)
 Total runtime: 0.390 ms
(16 rows)

同样,UNION扮演OR的角色.不幸的是,对于像>’ABC1234’这样的查询,这种方法仍然表现不佳.你可以通过提高work_mem来改善它.我在这里不知所措.

至于你的上一个问题:

Why does the normal OR query not choose such a good plan?

因为PostgreSQL规划器目前缺乏将OR’ed表达式拆分为单独的UNION查询的能??力.有一些警告使它比看起来更难.

PostgreSQL规划器已经相当精细,但到目前为止,利用已经可以手动重写SQL的优化并不是一个重要的优先事项.

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读