PostgreSQL – 查找具有特定值的最旧记录

发布时间：2020-12-13 15:56:04 所属栏目：百科来源：网络整理

导读：我有一个文档管理系统,它记录历史表中的所有历史事件.我被要求能够在给定日期为特定客户提供状态为5的最旧的doc_id.该表看起来像这样(为简单起见,截断)： doc_history: id integer doc_id integer event_date timestamp client_id integer status_id integer

我有一个文档管理系统,它记录历史表中的所有历史事件.我被要求能够在给定日期为特定客户提供状态为5的最旧的doc_id.该表看起来像这样(为简单起见,截断)：

doc_history:
    id integer
    doc_id integer
    event_date timestamp
    client_id integer
    status_id integer

client_id和status_id列是事件发生后文档的值.这意味着doc_id定义的文档的最大历史事件行将与文档表中的相同列匹配.通过特定事件日期限制事件,您可以查看当时文档的值.因为这些值不是静态的,所以我不能简单地搜索status_id为5的特定client_id,因为找到的结果可能与文档的max(id)不匹配.希望这是有道理的.

我发现什么工作,但速度很慢,如下：

select
    t.*
from
    (select
        distinct on (doc_id),*
    from
        doc_history
    where
        event_date <= '2013-02-17 23:59:59'
    order by
        doc_id,id desc) t
where
    t.client_id = 9999 and
    t.status_id = 5
limit 1;

基本上,我在给定的最大事件日期之前获取特定文档ID的最大ID,然后验证该最大历史记录项是否已分配给给定客户端,状态设置为5.

这样做的缺点是我正在扫描所有客户的所有历史记录以获得最大值,然后找到我正在寻找的客户和状态.截至目前,这扫描了大约1506万行,并在我的开发服务器上花了大约90秒(这不是很快速).

为了使问题更复杂,我需要在前一周的每一天执行此操作,或者每次运行总共执行七次.此外,系统中的所有文档都以状态5开头,表示新的.这使得这个查询只返回为该客户端输入的第一个文档：

select * from doc_history where client_id = 9999 and
    status_id = 5 and
    event_date <= '2013-02-17 23:59:59'
    order by id limit 1;

我希望做的是扫描,直到找到与特定客户端和状态值匹配的特定文档的最大历史记录,而不必首先为所有客户端找到所有文档ID的最大ID.我不知道这是否可以通过窗口函数(分区依据)或我目前没有看到的其他逻辑来完成.

doc_history表中的一个事件的示例：

# select id,doc_id,event,old_value,new_value,event_date,client_id,status_id from doc_history where doc_id = 9999999 order by id;
    id    | doc_id  | event | old_value | new_value |         event_date         | client_id | status_id
----------+---------+-------+-----------+-----------+----------------------------+-----------+-----------
 25362415 | 9999999 |    13 |           |           | 2013-02-14 11:49:50.032824 |      9999 |         5
 25428192 | 9999999 |    15 |           |           | 2013-02-18 11:15:48.272542 |      9999 |         5
 25428193 | 9999999 |     7 | 5         | 1         | 2013-02-18 11:15:48.301377 |      9999 |         1

事件7的状态已更改,旧值和新值显示它已从5更改为1,这反映在status_id列中.对于event_date小于或等于2013-02-17 23:59:59,上述记录将是最早的“NEW”文档,其status_id为5,但是在2013年2月17日之后它将没有.

解决方法

这应该快得多：

SELECT *
FROM   doc_history h1
WHERE  event_date < '2013-02-18 0:0'::timestamp
AND    client_id = 9999
AND    status_id = 5
AND NOT EXISTS (
   SELECT 1
   FROM   doc_history h2
   WHERE  h2.doc_id = h1.doc_id
   AND    h2.event_date < '2013-02-18 0:0'::timestamp
   AND    h2.event_date > h1.event_date  -- use event_date instead of id!
   )
ORDER  BY doc_id
LIMIT  1;

我很难理解你的描述.基本上,正如我现在所理解的那样,您希望给定时间戳之前具有最大doc_id的行(client_id,status_id)和event_date,其中不存在具有相同doc_id的更高id(等于event_date)的其他行.

请注意我如何替换示例中的条件：

WHERE  event_date <= '2013-02-17 23:59:59'

有：

WHERE  event_date < '2013-02-18 0:0'

由于你有小数秒,你的表达式将失败的时间戳,如：
‘2013-02-17 23：59：59.123’

我用h2.event_date> h1.event_date而不是h2.id> h1.id在NOT EXISTS半连接中因为我认为假设更大的id等于以后的event_date是不明智的.你应该单独依赖event_date.

为了快速实现这一目标,您需要一个multicolumn index的表格(更新)：

CREATE INDEX doc_history_multi_idx
ON doc_history (client_id,status_id,event_date DESC);

我在你的反馈之后切换了doc_id,event_date DESC的位置,这应该更好地适应ORDER BY doc_id LIMIT 1.

如果条件status_id = 5是恒定的(你总是检查5),那么partial index应该更快,但是：

CREATE INDEX doc_history_multi_idx
ON doc_history (client_id,event_date DESC)
WHERE status_id = 5;

和：

CREATE INDEX doc_history_id_idx ON doc_history (doc_id,event_date DESC);

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!