sql – 在另一个表中没有匹配的DELETE记录
有两个表链接到一个id:
item_tbl (id) link_tbl (item_id) 在item_tbl中有一些记录在link_tbl中没有匹配的行.将会计算其数量的选择将是: SELECT COUNT(*) FROM link_tbl lnk LEFT JOIN item_tbl itm ON lnk.item_id=itm.id WHERE itm.id IS NULL 我想从link_tbl中删除那些孤立的记录(那些没有匹配的记录),但我想到的唯一方法是: DELETE FROM link_tbl lnk WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm) 有 EXPLAIN DELETE FROM link_tbl lnk WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm) 返回: Delete on link lnk (cost=0.00..11395249378057.98 rows=131045918 width=6)
-> Seq Scan on link lnk (cost=0.00..11395249378057.98 rows=131045918 width=6)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..79298.10 rows=3063207 width=4)
-> Seq Scan on item itm (cost=0.00..52016.07 rows=3063207 width=4)
问题是: >有没有更好的方法如何从link_tbl删除孤立记录? >编辑:根据Erwin Brandstetter评论修正. > shared_buffers = 368MB 解析度: 感谢大家的意见,这是非常有帮助的.我终于使用了Erwin Brandstetter https://stackoverflow.com/a/15959896/1331340建议的删除,但我调整了一点: DELETE FROM link_tbl lnk
WHERE lnk.item_id BETWEEN 0 AND 10000
AND lnk.item_id NOT IN (SELECT itm.id FROM item itm
WHERE itm.id BETWEEN 0 AND 10000)
我比较了NOT IN和NOT EXISTS的结果,输出结果如下,尽管我使用COUNT而不是DELETE,我认为应该是相同的(我的意思是为了相对比较): EXPLAIN ANALYZE SELECT COUNT(*)
FROM link_tbl lnk
WHERE lnk.item_id BETWEEN 0 AND 20000
AND lnk.item_id NOT IN (SELECT itm.id
FROM item_tbl itm
WHERE itm.id BETWEEN 0 AND 20000);
QUERY PLAN
Aggregate (cost=6002667.56..6002667.57 rows=1 width=0) (actual time=226817.086..226817.088 rows=1 loops=1)
-> Seq Scan on link_tbl lnk (cost=1592.50..5747898.65 rows=101907564 width=0) (actual time=206.029..225289.570 rows=566625 loops=1)
Filter: ((item_id >= 0) AND (item_id <= 20000) AND (NOT (hashed SubPlan 1)))
SubPlan 1
-> Index Scan using item_tbl_pkey on item_tbl itm (cost=0.00..1501.95 rows=36221 width=4) (actual time=0.056..99.266 rows=17560 loops=1)
Index Cond: ((id >= 0) AND (id <= 20000))
Total runtime: 226817.211 ms
EXPLAIN ANALYZE SELECT COUNT(*)
FROM link_tbl lnk WHERE lnk.item_id>0 AND lnk.item_id<20000
AND NOT EXISTS (SELECT 1 FROM item_tbl itm WHERE itm.id=lnk.item_id);
QUERY PLAN
Aggregate (cost=8835772.00..8835772.01 rows=1 width=0)
(actual time=1209235.133..1209235.135 rows=1 loops=1)
-> Hash Anti Join (cost=102272.16..8835771.99 rows=1 width=0)
(actual time=19315.170..1207900.612 rows=566534 loops=1)
Hash Cond: (lnk.item_id = itm.id)
-> Seq Scan on link_tbl lnk (cost=0.00..5091076.55 rows=203815128 width=4) (actual time=0.016..599147.604 rows=200301872 loops=1)
Filter: ((item_id > 0) AND (item_id < 20000))
-> Hash (cost=52016.07..52016.07 rows=3063207 width=4) (actual time=19313.976..19313.976 rows=3033811 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 26672kB
-> Seq Scan on item_tbl itm (cost=0.00..52016.07 rows=3063207 width=4) (actual time=0.013..9274.158 rows=3033811 loops=1)
Total runtime: 1209260.228 ms
NOT EXISTS慢5倍. 实际删除的数据并不需要,只要我担心,我能够以5批(10000-20000,20000-100000,100000-200000,200000-1000000和1000000-1755441)删除它.起初我发现最大的item_id,我只需要经过一半的桌子. 当我尝试没有IN或EXISTS没有范围(选择计数)它甚至没有完成,我让它运行在夜间,它仍然在早上运行. 我想我正在用野猫寻求https://stackoverflow.com/a/15988033/1331340的DELETE,但是来得太晚了. DELETE FROM one o
USING (
SELECT o2.id
FROM one o2
LEFT JOIN two t ON t.one_id = o2.id
WHERE t.one_id IS NULL
) sq
WHERE sq.id = o.id
;
解决方法我对四个典型查询进行了基准测试,对于{work_mem,effective_cache_size,random_page_cost}的不同设置,这些设置对所选计划的影响最大.我首先使用我的默认设置进行“运行”以加快缓存.注意:测试集足够小以允许所有需要的页面存在于缓存中. 测试集 SET search_path=tmp;
/************************/
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE one
( id SERIAL NOT NULL PRIMARY KEY,payload varchar
);
CREATE TABLE two
( id SERIAL NOT NULL PRIMARY KEY,one_id INTEGER REFERENCES one,payload varchar
);
INSERT INTO one (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
INSERT INTO two (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
UPDATE two t
SET one_id = o.id
FROM one o
WHERE o.id = t.id
AND random() < 0.1;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
INSERT INTO two (one_id,payload FROM two;
VACUUM ANALYZE one;
VACUUM ANALYZE two;
/***************/
查询: echo NOT EXISTS()
EXPLAIN ANALYZE
DELETE FROM one o
WHERE NOT EXISTS ( SELECT * FROM two t
WHERE t.one_id = o.id
);
echo NOT IN()
EXPLAIN ANALYZE
DELETE FROM one o
WHERE o.id NOT IN ( SELECT one_id FROM two t)
;
echo USING (subquery self LEFT JOIN two where NULL)
EXPLAIN ANALYZE
DELETE FROM one o
USING (
SELECT o2.id
FROM one o2
LEFT JOIN two t ON t.one_id = o2.id
WHERE t.one_id IS NULL
) sq
WHERE sq.id = o.id
;
echo USING (subquery self WHERE NOT EXISTS(two)))
EXPLAIN ANALYZE
DELETE FROM one o
USING (
SELECT o2.id
FROM one o2
WHERE NOT EXISTS ( SELECT *
FROM two t WHERE t.one_id = o2.id
)
) sq
WHERE sq.id = o.id
;
结果(总结) NOT EXISTS() NOT IN() USING(LEFT JOIN NULL) USING(NOT EXISTS) 1) rpc=4.0.csz=1M wmm=64 80.358 14389.026 77.620 72.917 2) rpc=4.0.csz=1M wmm=64000 60.527 69.104 51.851 51.004 3) rpc=1.5.csz=1M wmm=64 69.804 10758.480 80.402 77.356 4) rpc=1.5.csz=1M wmm=64000 50.872 69.366 50.763 53.339 5) rpc=4.0.csz=1G wmm=64 84.117 7625.792 69.790 69.627 6) rpc=4.0.csz=1G wmm=64000 49.964 67.018 49.968 49.380 7) rpc=1.5.csz=1G wmm=64 68.567 3650.008 70.283 69.933 8) rpc=1.5.csz=1G wmm=64000 49.800 67.298 50.116 50.345 legend: rpc := "random_page_cost" csz := "effective_cache_size" wmm := "work_mem" 如你所见,NOT IN()变体对于work_mem的不足非常敏感.同意,设置64(KB)非常低,但这个“或多或少”对应于大型数据集,也不适用于哈希表. EXTRA:在暖身阶段,NOT EXISTS()查询遭受极端的FK触发争用.这是因为与真空脱水机的冲突的结果,在桌面设置后仍然活跃. PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu,compiled by gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1,64-bit
NOT EXISTS()
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Delete on one o (cost=6736.00..7623.94 rows=27962 width=12) (actual time=80.596..80.596 rows=0 loops=1)
-> Hash Anti Join (cost=6736.00..7623.94 rows=27962 width=12) (actual time=49.174..61.327 rows=27050 loops=1)
Hash Cond: (o.id = t.one_id)
-> Seq Scan on one o (cost=0.00..463.00 rows=30000 width=10) (actual time=0.003..5.156 rows=30000 loops=1)
-> Hash (cost=3736.00..3736.00 rows=240000 width=10) (actual time=49.121..49.121 rows=23600 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1015kB
-> Seq Scan on two t (cost=0.00..3736.00 rows=240000 width=10) (actual time=0.006..33.790 rows=240000 loops=1)
Trigger for constraint two_one_id_fkey: time=467720.117 calls=27050
Total runtime: 467824.652 ms
(9 rows) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |








