在PostgreSQL中对50M行表进行聚合查询
问题陈述
我有“event_statistics”表,其定义如下: CREATE TABLE public.event_statistics ( id int4 NOT NULL DEFAULT nextval('event_statistics_id_seq'::regclass),client_id int4 NULL,session_id int4 NULL,action_name text NULL,value text NULL,product_id int8 NULL,product_options jsonb NOT NULL DEFAULT '{}'::jsonb,url text NULL,url_options jsonb NOT NULL DEFAULT '{}'::jsonb,visit int4 NULL DEFAULT 0,date_update timestamptz NULL,CONSTRAINT event_statistics_pkey PRIMARY KEY (id),CONSTRAINT event_statistics_client_id_session_id_sessions_client_id_id_for FOREIGN KEY (client_id,session_id) REFERENCES <?>() ON DELETE CASCADE ON UPDATE CASCADE ) WITH ( OIDS=FALSE ) ; CREATE INDEX regdate ON public.event_statistics (date_update timestamptz_ops) ; 并表“客户”: CREATE TABLE public.clients ( id int4 NOT NULL DEFAULT nextval('clients_id_seq'::regclass),client_name text NULL,client_hash text NULL,CONSTRAINT clients_pkey PRIMARY KEY (id) ) WITH ( OIDS=FALSE ) ; CREATE INDEX clients_client_name_idx ON public.clients (client_name text_ops) ; 我需要的是在每个“action_name”类型的“event_statistics”表中获取特定“date_update”范围的事件计数,以及“action_name”和特定时间步骤以及特定客户端的所有特定时间步骤. 目标是在我们网站的仪表板上为每个客户提供所有相关事件的统计数据,并可选择报告日期,并根据间隔时间,图表中的步骤应该不同,如: >当天 – 每小时计数; 我做了什么: SELECT t.date,A.actionName,count(E.id) FROM generate_series(current_date - interval '1 week',now(),interval '1 day') as t(date) cross join (values ('page_open'),('product_add'),('product_buy'),('product_event'),('product_favourite'),('product_open'),('product_share'),('session_start')) as A(actionName) left join (select action_name,date_trunc('day',e.date_update) as dateTime,e.id from event_statistics as e where e.client_id = (select id from clients as c where c.client_name = 'client name') and (date_update between (current_date - interval '1 week') and now())) E on t.date = E.dateTime and A.actionName = E.action_name group by A.actionName,t.date order by A.actionName,t.date; 根据上周的事件类型和日期计算事件需要太长时间,超过10秒.我需要能够以更快的速度和更长的时间段(例如数周,数月,数年)以不同的组间隔(当天的每小时,每月的天数,然后是数周,数月)来做同样的事情. 查询计划: GroupAggregate (cost=171937.16..188106.84 rows=1600 width=44) Group Key: "*VALUES*".column1,t.date InitPlan 1 (returns $0) -> Seq Scan on clients c (cost=0.00..1.07 rows=1 width=4) Filter: (client_name = 'client name'::text) -> Merge Left Join (cost=171936.08..183784.31 rows=574060 width=44) Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date =(date_trunc('day'::text,e.date_update)))) -> Sort (cost=628.77..648.77 rows=8000 width=40) Sort Key: "*VALUES*".column1,t.date -> Nested Loop (cost=0.02..110.14 rows=8000 width=40) -> Function Scan on generate_series t (cost=0.02..10.02 rows=1000 width=8) -> Materialize (cost=0.00..0.14 rows=8 width=32) -> Values Scan on "*VALUES*" (cost=0.00..0.10 rows=8 width=32) -> Materialize (cost=171307.32..171881.38 rows=114812 width=24) -> Sort (cost=171307.32..171594.35 rows=114812 width=24) Sort Key: e.action_name,(date_trunc('day'::text,e.date_update)) -> Index Scan using regdate on event_statistics e (cost=0.57..159302.49 rows=114812 width=24) Index Cond: ((date_update > (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= now())) Filter: (client_id = $0) “event_statistics”表有超过5000万行,只有添加了客户端才会增长,并且不会更改记录. 我已经尝试了很多不同的查询计划和索引,但在汇总更广泛的日期范围时无法达到可接受的速度. >按client_id或日期范围进行分区 我还在event_statistics上尝试了btree索引(client_id asc,action_name asc,date_update asc,id).而且只使用索引扫描速度更快,但仍然不够,而且在磁盘空间使用方面也不是很好. 解决这个问题的最佳方法是什么? 更新 根据要求,解释(analyze,verbose)命令的输出: GroupAggregate (cost=860934.44..969228.46 rows=1600 width=44) (actual time=52388.678..54671.187 rows=64 loops=1) Output: t.date,"*VALUES*".column1,count(e.id) Group Key: "*VALUES*".column1,t.date InitPlan 1 (returns $0) -> Seq Scan on public.clients c (cost=0.00..1.07 rows=1 width=4) (actual time=0.058..0.059 rows=1 loops=1) Output: c.id Filter: (c.client_name = 'client name'::text) Rows Removed by Filter: 5 -> Merge Left Join (cost=860933.36..940229.77 rows=3864215 width=44) (actual time=52388.649..54388.698 rows=799737 loops=1) Output: t.date,e.id Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date = (date_trunc('day'::text,e.date_update)))) -> Sort (cost=628.77..648.77 rows=8000 width=40) (actual time=0.190..0.244 rows=64 loops=1) Output: t.date,"*VALUES*".column1 Sort Key: "*VALUES*".column1,t.date Sort Method: quicksort Memory: 30kB -> Nested Loop (cost=0.02..110.14 rows=8000 width=40) (actual time=0.059..0.080 rows=64 loops=1) Output: t.date,"*VALUES*".column1 -> Function Scan on pg_catalog.generate_series t (cost=0.02..10.02 rows=1000 width=8) (actual time=0.043..0.043 rows=8 loops=1) Output: t.date Function Call: generate_series(((('now'::cstring)::date - '7 days'::interval))::timestamp with time zone,'1 day'::interval) -> Materialize (cost=0.00..0.14 rows=8 width=32) (actual time=0.002..0.003 rows=8 loops=8) Output: "*VALUES*".column1 -> Values Scan on "*VALUES*" (cost=0.00..0.10 rows=8 width=32) (actual time=0.004..0.005 rows=8 loops=1) Output: "*VALUES*".column1 -> Materialize (cost=860304.60..864168.81 rows=772843 width=24) (actual time=52388.441..54053.748 rows=799720 loops=1) Output: e.id,e.date_update,e.action_name,e.date_update)) -> Sort (cost=860304.60..862236.70 rows=772843 width=24) (actual time=52388.432..53703.531 rows=799720 loops=1) Output: e.id,e.date_update)) Sort Key: e.action_name,e.date_update)) Sort Method: external merge Disk: 39080kB -> Index Scan using regdate on public.event_statistics e (cost=0.57..753018.26 rows=772843 width=24) (actual time=31.423..44284.363 rows=799720 loops=1) Output: e.id,date_trunc('day'::text,e.date_update) Index Cond: ((e.date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (e.date_update <= now())) Filter: (e.client_id = $0) Rows Removed by Filter: 2983424 Planning time: 7.278 ms Execution time: 54708.041 ms 解决方法
第一步:在子查询中执行预聚合:
EXPLAIN SELECT cal.theday,act.action_name,SUM(sub.the_count) FROM generate_series(current_date - interval '1 week',interval '1 day') as cal(theday) -- calendar pseudo-table CROSS JOIN (VALUES ('page_open'),('session_start') ) AS act(action_name) LEFT JOIN ( SELECT es.action_name,es.date_update) as theday,COUNT(DISTINCT es.id ) AS the_count FROM event_statistics as es WHERE es.client_id = (SELECT c.id FROM clients AS c WHERE c.client_name = 'client name') AND (es.date_update BETWEEN (current_date - interval '1 week') AND now()) GROUP BY 1,2 ) sub ON cal.theday = sub.theday AND act.action_name = sub.action_name GROUP BY act.action_name,cal.theday ORDER BY act.action_name,cal.theday ; 下一步:将VALUES放入CTE并在聚合子查询中引用它. EXPLAIN WITH act(action_name) AS (VALUES ('page_open'),('session_start') ) SELECT cal.theday,interval '1day') AS cal(theday) CROSS JOIN act LEFT JOIN ( SELECT es.action_name,es.date_update) AS theday,COUNT(DISTINCT es.id ) AS the_count FROM event_statistics AS es WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now() AND EXISTS (SELECT * FROM clients cli WHERE cli.id= es.client_id AND cli.client_name = 'client name') AND EXISTS (SELECT * FROM act WHERE act.action_name = es.action_name) GROUP BY 1,cal.theday ; 更新:使用fysical(temp)表将导致更好的估计. -- Final attempt: materialize the carthesian product (timeseries*action_name) -- into a temp table CREATE TEMP TABLE grid AS (SELECT act.action_name,cal.theday FROM generate_series(current_date - interval '1 week',interval '1 day') AS cal(theday) CROSS JOIN (VALUES ('page_open'),('session_start') ) act(action_name) ); CREATE UNIQUE INDEX ON grid(action_name,theday); -- Index will force statistics to be collected --,and will generate better estimates for the numbers of rows CREATE INDEX iii ON event_statistics (action_name,date_update ) ; VACUUM ANALYZE grid; VACUUM ANALYZE event_statistics; EXPLAIN SELECT grid.action_name,grid.theday,SUM(sub.the_count) AS the_count FROM grid LEFT JOIN ( SELECT es.action_name,COUNT(*) AS the_count FROM event_statistics AS es WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now() AND EXISTS (SELECT * FROM clients cli WHERE cli.id= es.client_id AND cli.client_name = 'client name') -- AND EXISTS (SELECT * FROM grid WHERE grid.action_name = es.action_name) GROUP BY 1,2 ORDER BY 1,2 --nonsense! ) sub ON grid.theday = sub.theday AND grid.action_name = sub.action_name GROUP BY grid.action_name,grid.theday ORDER BY grid.action_name,grid.theday ; 更新#3(抱歉,我在这里创建了基表的索引,你需要编辑.我还删除了时间戳上的一列) -- attempt#4: -- - materialize the carthesian product (timeseries*action_name) -- - sanitize date interval -logic CREATE TEMP TABLE grid AS (SELECT act.action_name,cal.theday::date FROM generate_series(current_date - interval '1 week',('session_start') ) act(action_name) ); -- Index will force statistics to be collected --,and will generate better estimates for the numbers of rows -- CREATE UNIQUE INDEX ON grid(action_name,theday); -- CREATE INDEX iii ON event_statistics (action_name,date_update ) ; CREATE UNIQUE INDEX ON grid(theday,action_name); CREATE INDEX iii ON event_statistics (date_update,action_name) ; VACUUM ANALYZE grid; VACUUM ANALYZE event_statistics; EXPLAIN SELECT gr.action_name,gr.theday,COUNT(*) AS the_count FROM grid gr LEFT JOIN event_statistics AS es ON es.action_name = gr.action_name AND date_trunc('day',es.date_update)::date = gr.theday AND es.date_update BETWEEN (current_date - interval '1 week') AND current_date JOIN clients cli ON cli.id= es.client_id AND cli.client_name = 'client name' GROUP BY gr.action_name,gr.theday ORDER BY 1,2 ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=8.33..8.35 rows=1 width=17) Group Key: gr.action_name,gr.theday -> Sort (cost=8.33..8.34 rows=1 width=17) Sort Key: gr.action_name,gr.theday -> Nested Loop (cost=1.40..8.33 rows=1 width=17) -> Nested Loop (cost=1.31..7.78 rows=1 width=40) Join Filter: (es.client_id = cli.id) -> Index Scan using clients_client_name_key on clients cli (cost=0.09..2.30 rows=1 width=4) Index Cond: (client_name = 'client name'::text) -> Bitmap Heap Scan on event_statistics es (cost=1.22..5.45 rows=5 width=44) Recheck Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date)) -> Bitmap Index Scan on iii (cost=0.00..1.22 rows=5 width=0) Index Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date)) -> Index Only Scan using grid_theday_action_name_idx on grid gr (cost=0.09..0.54 rows=1 width=17) Index Cond: ((theday = (date_trunc('day'::text,es.date_update))::date) AND (action_name = es.action_name)) (15 rows) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |