sql – 当索引扫描是更好的选择时Postgres不使用索引
我有一个简单的查询来连接两个非常慢的表.我发现查询计划对大表email_activities(~10m行)执行seq扫描,而我认为使用嵌套循环的索引实际上会更快.
我使用子查询重写了查询,试图强制使用索引,然后注意到一些有趣的东西.如果您查看下面的两个查询计划,您将看到当我将子查询的结果集限制为43k时,查询计划确实使用了email_activities上的索引,而将子查询中的限制设置为甚至44k将导致查询计划使用seq扫描email_activities.一个显然比另一个更有效,但Postgres似乎并不关心. 什么可能导致这个?如果其中一个集合大于特定大小,它是否在某处强制使用散列连接? explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 43000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=118261.50..118261.50 rows=1 width=4) (actual time=224.556..224.556 rows=1 loops=1) -> Nested Loop (cost=3699.03..118147.99 rows=227007 width=4) (actual time=32.586..209.076 rows=40789 loops=1) -> HashAggregate (cost=3698.94..3827.94 rows=43000 width=4) (actual time=32.572..47.276 rows=43000 loops=1) -> Limit (cost=0.09..3548.44 rows=43000 width=4) (actual time=0.017..22.547 rows=43000 loops=1) -> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.017..19.168 rows=43000 loops=1) Index Cond: (email_campaign_id = 1607) -> Index Only Scan using index_email_activities_on_email_recipient_id on email_activities (cost=0.09..2.64 rows=5 width=4) (actual time=0.003..0.003 rows=1 loops=43000) Index Cond: (email_recipient_id = email_recipients.id) Heap Fetches: 40789 Total runtime: 224.675 ms 和: explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 50000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=119306.25..119306.25 rows=1 width=4) (actual time=3050.612..3050.613 rows=1 loops=1) -> Hash Semi Join (cost=4451.08..119174.27 rows=263962 width=4) (actual time=1831.673..3038.683 rows=47935 loops=1) Hash Cond: (email_activities.email_recipient_id = email_recipients.id) -> Seq Scan on email_activities (cost=0.00..107490.96 rows=9359988 width=4) (actual time=0.003..751.988 rows=9360039 loops=1) -> Hash (cost=4276.08..4276.08 rows=50000 width=4) (actual time=34.058..34.058 rows=50000 loops=1) Buckets: 8192 Batches: 1 Memory Usage: 1758kB -> Limit (cost=0.09..4126.08 rows=50000 width=4) (actual time=0.016..27.302 rows=50000 loops=1) -> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.016..22.244 rows=50000 loops=1) Index Cond: (email_campaign_id = 1607) Total runtime: 3050.660 ms >版本:x86_64-unknown-linux-gnu上的PostgreSQL 9.3.10,由gcc编译(Ubuntu / Linaro 4.6.3-1ubuntu5)4.6.3,64位 解决方法索引扫描 – >位图索引扫描 – >顺序扫描对于少数行,运行索引扫描是值得的.如果要返回更多行(表的百分比更高,并且取决于数据分布,值频率和行宽),则更有可能在一个数据页上找到多行.然后切换到位图索引扫描是值得的.无论如何,一旦必须访问大部分数据页,运行顺序扫描,过滤剩余行并完全跳过索引的开销会更便宜. Postgres切换到顺序扫描,期望找到rows = 263962,这已经是整个表的3%. (虽然实际上只找到了行= 47935,但请参见下文.) 更多相关答案: > Efficient PostgreSQL query on timestamp using index or bitmap index scan? 小心强制查询计划 您不能直接在Postgres中强制使用某个计划程序方法,但是出于调试目的,您可以使其他方法看起来非常昂贵.请参见手册中的Planner Method Configuration. SET enable_seqscan = off(如另一个答案中建议的那样)对顺序扫描执行此操作.但这仅用于在会话中进行调试.除非您确切知道自己在做什么,否则不要将其用作生产中的常规设置.它可以强制荒谬的查询计划. Quoting the manual:
这已经是你需要的大部分建议了. > Keep PostgreSQL from sometimes choosing a bad query plan 在这种特殊情况下,Postgres预计email_activities.email_recipient_id的点击次数比实际发现次数多5-6次:
如果您经常运行此查询,则需要让ANALYZE查看更大的样本以获得有关特定列的更准确统计信息.你的桌子很大(约10M行),所以要: ALTER TABLE email_activities ALTER COLUMN email_recipient_id SET STATISTICS 3000; -- max 10000,default 100 然后ANALYZE email_activities; 最后的手段 在极少数情况下,您可能会在单独的事务中或在具有自己环境的函数中强制使用SET LOCAL enable_seqscan = off的索引.喜欢: CREATE OR REPLACE FUNCTION f_count_dist_recipients(_email_campaign_id int,_limit int) RETURNS bigint AS $func$ SELECT COUNT(DISTINCT a.email_recipient_id) FROM email_activities a WHERE a.email_recipient_id IN ( SELECT id FROM email_recipients WHERE email_campaign_id = $1 LIMIT $2) -- or consider query below $func$ LANGUAGE sql VOLATILE COST 100000 SET enable_seqscan = off; 该设置仅适用于函数的本地范围. 警告:这只是一个概念证明.从长远来看,即使是这种不那么激进的人工干预也可能会让你感到痛苦.基数,价值频率,您的架构,全局Postgres设置,一切都随着时间的推移而变化.您将升级到新的Postgres版本.您现在强制执行的查询计划可能会在以后成为一个非常糟糕的主意. 通常,这只是解决您的设置问题的方法.更好地找到并修复它. 替代查询 问题中缺少基本信息,但这个等效查询可能更快,更可能使用索引(email_recipient_id) – 对于更大的LIMIT越来越多. SELECT COUNT(*) AS ct FROM ( SELECT id FROM email_recipients WHERE email_campaign_id = 1607 LIMIT 43000 ) r WHERE EXISTS ( SELECT 1 FROM email_activities WHERE email_recipient_id = r.id); (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |
- ArcCatalog创建FeatureClass提示“Database user
- sql-server – SQL Server支持哪种版本的.NET框架
- Big impact when set property type in-properly
- 强烈推荐 SQL 编程助手 SQL Assistant
- sql – SELECT COUNT(*);
- sqlserver 各种判断是否存在(表名、函数、存储过
- sqlserver2005/2008----JDBC连接数据库方法
- SqlServer 2012 让人蛋疼的日志文件
- linux系统下安装配置解压版的MySQL数据库图解
- SqlServer数据类型、C#SqlDbType对应关系及转换