sql – 连续重复/重复的有序计数

发布时间：2020-12-12 06:01:32 所属栏目：MsSql教程来源：网络整理

导读：我非常怀疑我是以最有效的方式做到这一点,这就是我在这里标记plpgsql的原因.对于一千个测量系统,我需要在20亿行上运行它. 您有测量系统,当它们失去连接时经常报告先前的值,并且它们经常失去连接,但有时会长时间失去连接.您需要聚合,但是当您这样做时,您需要查

我非常怀疑我是以最有效的方式做到这一点,这就是我在这里标记plpgsql的原因.对于一千个测量系统,我需要在20亿行上运行它.

您有测量系统,当它们失去连接时经常报告先前的值,并且它们经常失去连接,但有时会长时间失去连接.您需要聚合,但是当您这样做时,您需要查看它重复多长时间并根据该信息制作各种过滤器.假设你在汽车上测量mpg,但它在20英里/加仑的速度下停留一小时,而不是在20.1左右,依此类推.你会想要在卡住时评估准确性.您还可以放置一些替代规则来查找汽车在高速公路上的情况,并且通过窗口功能,您可以生成汽车的“状态”并且可以分组.无需再费周折：

--here's my data,you have different systems,the time of measurement,and the actual measurement
--as well,the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
    (
    select 
    system_measured,time_of_measurement,measurement,case when 
     measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc) 
     then 1 else 0 end as repeat
    FROM
    (
    SELECT 5 as measurement,1 as time_of_measurement,1 as system_measured
    UNION
    SELECT 150 as measurement,2 as time_of_measurement,1 as system_measured
    UNION
    SELECT 5 as measurement,3 as time_of_measurement,4 as time_of_measurement,2 as system_measured
    UNION
    SELECT 5 as measurement,2 as system_measured
    UNION
    SELECT 150 as measurement,5 as time_of_measurement,6 as time_of_measurement,7 as time_of_measurement,8 as time_of_measurement,2 as system_measured
    ) as data
) as data;

--unfortunately you can't have window functions within window functions,so I had to break it down into subquery
--what we need is something to partion on,the 'state' of the system if you will,so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
    (
    select 
    *,sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
    from cumulative_repeat_calculator_data
    order by system_measured,time_of_measurement
) as data;

--finally,the query. I didn't bother showing my desired output,because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating,and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating,for example  
select *,case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system,system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured,time_of_measurement

那么,为了在巨大的桌子上运行它,或者你会使用什么替代工具,你会采取哪些不同的做法？我正在考虑plpgsql,因为我怀疑这需要在数据库中完成,或者在数据插入过程中,尽管我通常在数据加载后处理它.有没有办法在一次扫描中得到这个而不诉诸子查询？

我已经测试了一种替代方法,但它仍然依赖于子查询,我认为这更快.对于该方法,您可以使用start_timestamp,end_timestamp,system创建“启动和停止”表.然后你加入更大的表,如果时间戳在那些之间,你将它归类为处于该状态,这实际上是cumlative_sum_of_nonrepeats_by_system的替代.但是当你这样做时,你加入1 = 1的数千个设备和数千或数百万’事件’.你认为这是一个更好的方式吗？

解决方法

测试用例

首先,一个更有用的方式来呈现您的数据 – 甚至更好,在sqlfiddle中,准备好玩：

CREATE TEMP TABLE data(
   system_measured int,time_of_measurement int,measurement int
);

INSERT INTO data VALUES
 (1,1,5),(1,2,150),3,4,(2,5,6,7,8,5);

简化查询

由于目前尚不清楚,我只假设上述情况.
接下来,我简化了您的查询以达到：

WITH x AS (
   SELECT *,CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
                               ORDER BY time_of_measurement) = measurement
                  THEN 0 ELSE 1 END AS step
   FROM   data
   ),y AS (
   SELECT *,sum(step) OVER(PARTITION BY system_measured
                            ORDER BY time_of_measurement) AS grp
   FROM   x
   )
SELECT *,row_number() OVER (PARTITION BY system_measured,grp
                             ORDER BY time_of_measurement) - 1 AS repeat_ct
FROM   y
ORDER  BY system_measured,time_of_measurement;

现在,虽然使用纯SQL一切都很好,但是使用plpgsql函数会更快,因为它可以在单个表扫描中执行此操作,此查询至少需要三次扫描.

使用plpgsql函数更快：

CREATE OR REPLACE FUNCTION x.f_repeat_ct()
  RETURNS TABLE (
    system_measured int,measurement int,repeat_ct int
  )  LANGUAGE plpgsql AS
$func$
DECLARE
   r    data;     -- table name serves as record type
   r0   data;
BEGIN

-- SET LOCAL work_mem = '1000 MB';  -- uncomment an adapt if needed,see below!

repeat_ct := 0;   -- init

FOR r IN
   SELECT * FROM data d ORDER BY d.system_measured,d.time_of_measurement
LOOP
   IF  r.system_measured = r0.system_measured
       AND r.measurement = r0.measurement THEN
      repeat_ct := repeat_ct + 1;   -- start new array
   ELSE
      repeat_ct := 0;               -- start new count
   END IF;

   RETURN QUERY SELECT r.*,repeat_ct;

   r0 := r;                         -- remember last row
END LOOP;

END
$func$;

呼叫：

SELECT * FROM x.f_repeat_ct();

确保在这种plpgsql函数中始终对列名进行表限定,因为我们使用相同的名称作为输出参数,如果不合格则优先使用.

数十亿行

如果您有数十亿行,则可能需要将此操作拆分.我引用手册here：

Note: The current implementation of RETURN NEXT and RETURN QUERY
stores the entire result set before returning from the function,as
discussed above. That means that if a PL/pgSQL function produces a
very large result set,performance might be poor: data will be written
to disk to avoid memory exhaustion,but the function itself will not
return until the entire result set has been generated. A future
version of PL/pgSQL might allow users to define set-returning
functions that do not have this limitation. Currently,the point at
which data begins being written to disk is controlled by the 07002
configuration variable. Administrators who have sufficient memory to
store larger result sets in memory should consider increasing this
parameter.

考虑一次计算一个系统的行,或者为work_mem设置足够高的值以应对负载.请按照报价中提供的链接了解有关work_mem的更多信息.

一种方法是在函数中使用SET LOCAL为work_mem设置一个非常高的值,这仅对当前事务有效.我在函数中添加了一条注释行.不要在全局范围内设置它,因为这可能会破坏您的服务器.阅读手册.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!