sql – 连续重复/重复的有序计数
我非常怀疑我是以最有效的方式做到这一点,这就是我在这里标记plpgsql的原因.对于一千个测量系统,我需要在20亿行上运行它.
您有测量系统,当它们失去连接时经常报告先前的值,并且它们经常失去连接,但有时会长时间失去连接.您需要聚合,但是当您这样做时,您需要查看它重复多长时间并根据该信息制作各种过滤器.假设你在汽车上测量mpg,但它在20英里/加仑的速度下停留一小时,而不是在20.1左右,依此类推.你会想要在卡住时评估准确性.您还可以放置一些替代规则来查找汽车在高速公路上的情况,并且通过窗口功能,您可以生成汽车的“状态”并且可以分组.无需再费周折: --here's my data,you have different systems,the time of measurement,and the actual measurement --as well,the raw data has whether or not it's a repeat (hense the included window function select * into temporary table cumulative_repeat_calculator_data FROM ( select system_measured,time_of_measurement,measurement,case when measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc) then 1 else 0 end as repeat FROM ( SELECT 5 as measurement,1 as time_of_measurement,1 as system_measured UNION SELECT 150 as measurement,2 as time_of_measurement,1 as system_measured UNION SELECT 5 as measurement,3 as time_of_measurement,4 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,2 as system_measured UNION SELECT 150 as measurement,5 as time_of_measurement,6 as time_of_measurement,7 as time_of_measurement,8 as time_of_measurement,2 as system_measured ) as data ) as data; --unfortunately you can't have window functions within window functions,so I had to break it down into subquery --what we need is something to partion on,the 'state' of the system if you will,so I ran a running total of the nonrepeats --this creates a row that stays the same when your data is repeating - aka something you can partition/group on select * into temporary table cumulative_repeat_calculator_step_1 FROM ( select *,sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system from cumulative_repeat_calculator_data order by system_measured,time_of_measurement ) as data; --finally,the query. I didn't bother showing my desired output,because this (finally) got it --I wanted a sequential count of repeats that restarts when it stops repeating,and starts with the first repeat --what you can do now is take the average measurement under some condition based on how long it was repeating,for example select *,case when repeat = 0 then 0 else row_number() over (partition by cumlative_sum_of_nonrepeats_by_system,system_measured order by time_of_measurement) - 1 end as ordered_repeat from cumulative_repeat_calculator_step_1 order by system_measured,time_of_measurement 那么,为了在巨大的桌子上运行它,或者你会使用什么替代工具,你会采取哪些不同的做法?我正在考虑plpgsql,因为我怀疑这需要在数据库中完成,或者在数据插入过程中,尽管我通常在数据加载后处理它.有没有办法在一次扫描中得到这个而不诉诸子查询? 我已经测试了一种替代方法,但它仍然依赖于子查询,我认为这更快.对于该方法,您可以使用start_timestamp,end_timestamp,system创建“启动和停止”表.然后你加入更大的表,如果时间戳在那些之间,你将它归类为处于该状态,这实际上是cumlative_sum_of_nonrepeats_by_system的替代.但是当你这样做时,你加入1 = 1的数千个设备和数千或数百万’事件’.你认为这是一个更好的方式吗? 解决方法测试用例首先,一个更有用的方式来呈现您的数据 – 甚至更好,在sqlfiddle中,准备好玩: CREATE TEMP TABLE data( system_measured int,time_of_measurement int,measurement int ); INSERT INTO data VALUES (1,1,5),(1,2,150),3,4,(2,5,6,7,8,5); 简化查询 由于目前尚不清楚,我只假设上述情况. WITH x AS ( SELECT *,CASE WHEN lag(measurement) OVER (PARTITION BY system_measured ORDER BY time_of_measurement) = measurement THEN 0 ELSE 1 END AS step FROM data ),y AS ( SELECT *,sum(step) OVER(PARTITION BY system_measured ORDER BY time_of_measurement) AS grp FROM x ) SELECT *,row_number() OVER (PARTITION BY system_measured,grp ORDER BY time_of_measurement) - 1 AS repeat_ct FROM y ORDER BY system_measured,time_of_measurement; 现在,虽然使用纯SQL一切都很好,但是使用plpgsql函数会更快,因为它可以在单个表扫描中执行此操作,此查询至少需要三次扫描. 使用plpgsql函数更快: CREATE OR REPLACE FUNCTION x.f_repeat_ct() RETURNS TABLE ( system_measured int,measurement int,repeat_ct int ) LANGUAGE plpgsql AS $func$ DECLARE r data; -- table name serves as record type r0 data; BEGIN -- SET LOCAL work_mem = '1000 MB'; -- uncomment an adapt if needed,see below! repeat_ct := 0; -- init FOR r IN SELECT * FROM data d ORDER BY d.system_measured,d.time_of_measurement LOOP IF r.system_measured = r0.system_measured AND r.measurement = r0.measurement THEN repeat_ct := repeat_ct + 1; -- start new array ELSE repeat_ct := 0; -- start new count END IF; RETURN QUERY SELECT r.*,repeat_ct; r0 := r; -- remember last row END LOOP; END $func$; 呼叫: SELECT * FROM x.f_repeat_ct(); 确保在这种plpgsql函数中始终对列名进行表限定,因为我们使用相同的名称作为输出参数,如果不合格则优先使用. 数十亿行 如果您有数十亿行,则可能需要将此操作拆分.我引用手册here:
考虑一次计算一个系统的行,或者为work_mem设置足够高的值以应对负载.请按照报价中提供的链接了解有关work_mem的更多信息. 一种方法是在函数中使用 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |