sql-server – 使用窗口函数的日期范围滚动总和

发布时间：2020-12-12 06:20:19 所属栏目：MsSql教程来源：网络整理

导读：我需要计算一个日期范围内的滚动总和.为了说明,使用 AdventureWorks sample database,以下假设语法将完全符合我的需要： SELECT TH.ProductID,TH.TransactionDate,TH.ActualCost,RollingSum45 = SUM(TH.ActualCost) OVER ( PARTITION BY TH.ProductID ORDER B

我需要计算一个日期范围内的滚动总和.为了说明,使用 AdventureWorks sample database,以下假设语法将完全符合我的需要：

SELECT
    TH.ProductID,TH.TransactionDate,TH.ActualCost,RollingSum45 = SUM(TH.ActualCost) OVER (
        PARTITION BY TH.ProductID
        ORDER BY TH.TransactionDate
        RANGE BETWEEN 
            INTERVAL 45 DAY PRECEDING
            AND CURRENT ROW)
FROM Production.TransactionHistory AS TH
ORDER BY
    TH.ProductID,TH.ReferenceOrderID;

遗憾的是,RANGE窗口框架范围当前不允许SQL Server中的间隔.

我知道我可以使用子查询和常规(非窗口)聚合编写解决方案：

SELECT 
    TH.ProductID,RollingSum45 =
    (
        SELECT SUM(TH2.ActualCost)
        FROM Production.TransactionHistory AS TH2
        WHERE
            TH2.ProductID = TH.ProductID
            AND TH2.TransactionDate <= TH.TransactionDate
            AND TH2.TransactionDate >= DATEADD(DAY,-45,TH.TransactionDate)
    )
FROM Production.TransactionHistory AS TH
ORDER BY
    TH.ProductID,TH.ReferenceOrderID;

鉴于以下指数：

CREATE UNIQUE INDEX i
ON Production.TransactionHistory
    (ProductID,TransactionDate,ReferenceOrderID)
INCLUDE
    (ActualCost);

执行计划是：

虽然不是非常低效,但似乎应该可以仅使用SQL Server 2012,2014或2016中支持的窗口聚合和分析函数来表达此查询(到目前为止).

为清楚起见,我正在寻找一种对数据执行单次传递的解决方案.

在T-SQL中,这可能意味着the OVER clause将完成工作,执行计划将以Window Spools和Window Aggregates为特色.使用OVER子句的所有语言元素都是公平游戏.只要能保证产生正确的结果,SQLCLR解决方案是可以接受的.

对于T-SQL解决方案,执行计划中的哈希,排序和窗口假脱机/聚合越少越好.随意添加索引,但不允许使用单独的结构(例如,没有预先计算的表与触发器保持同步).允许使用参考表(数字表,日期等)

理想情况下,解决方案将以与上述子查询版本相同的顺序生成完全相同的结果,但任何可以说是正确的也是可接受的.性能始终是一个考虑因素,因此解决方案至少应该是合理有效的.

专用聊天室：我创建了一个公共聊天室,用于讨论与此问题及其答案相关的问题.任何拥有at least 20 reputation points的用户都可以直接参加.如果您的代表少于20人并想参加,请在下面的评论中给我打电话.

> Discussion for “Date range rolling sum using window functions”

解决方法

好问题,保罗！我使用了几种不同的方法,一种是在T-SQL中,另一种是在CLR中.

T-SQL快速摘要

T-SQL方法可以概括为以下步骤：

>以产品/日期的交叉产品为例
>合并观察到的销售数据
>将该数据汇总到产品/日期级别
>根据此汇总数据(包含填写的任何“缺失”天数)计算过去45天内的滚动总和
>将这些结果过滤为仅具有一个或多个销售额的产品/日期配对

使用SET STATISTICS IO ON,此方法报告表’TransactionHistory’.扫描计数1,逻辑读取484,确认表上的“单次通过”.作为参考,原始循环搜索查询报告表’TransactionHistory’.扫描计数113444,逻辑读取438366.

如SET STATISTICS TIME ON所报告的,CPU时间为514ms.对于原始查询,这有利地与2231ms相比.

CLR快速摘要

CLR摘要可归纳为以下步骤：

>将数据读入内存,按产品和日期排序
>在处理每笔交易时,添加到运行的总成本中.每当事务是与上一个事务不同的产品时,将运行总计重置为0.
>维护指向与当前事务具有相同(产品,日期)的第一个事务的指针.每当遇到与该(事务,日期)的最后一次交易时,计算该交易的滚动金额并将其应用于具有相同的所有交易(产品,日期)
>将所有结果返回给用户！

使用SET STATISTICS IO ON,此方法报告没有发生逻辑I / O！哇,一个完美的解决方案！ (实际上,SET STATISTICS IO似乎没有报告CLR中发生的I / O.但是从代码中可以很容易地看到表的一次扫描完成并按照Paul建议的顺序检索数据.

据SET STATISTICS TIME ON报告,CPU时间现在为187ms.所以这比T-SQL方法有了很大改进.不幸的是,两种方法的总体经过时间非常相似,每次大约半秒钟.但是,基于CLR的方法必须向控制台输出113K行(对于按产品/日期分组的T-SQL方法,只需52K),这就是为什么我专注于CPU时间.

这种方法的另一大优点是它产生与原始循环/搜索方法完全相同的结果,即使在同一天多次销售产品的情况下,每个交易也包括一行. (在AdventureWorks上,我专门比较了逐行结果,并确认它们与Paul的原始查询相关联.)

这种方法的缺点,至少在其当前形式中,是它读取存储器中的所有数据.但是,设计的算法在任何给定时间都只需要内存中的当前窗口框架,并且可以更新以适用于超出内存的数据集. Paul在他的回答中通过生成这种算法的实现来说明这一点,该算法仅在内存中存储滑动窗口.这是以授予CLR程序集更高权限为代价的,但在将此解决方案扩展到任意大型数据集时肯定是值得的.

T-SQL – 按日期分组的一次扫描

初始设置

USE AdventureWorks2012
GO
-- Create Paul's index
CREATE UNIQUE INDEX i
ON Production.TransactionHistory (ProductID,ReferenceOrderID)
INCLUDE (ActualCost);
GO
-- Build calendar table for 2000 ~ 2020
CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY)
GO
DECLARE @d DATETIME = '1/1/2000'
WHILE (@d < '1/1/2021')
BEGIN
    INSERT INTO dbo.calendar (d) VALUES (@d)
    SELECT @d =  DATEADD(DAY,1,@d)
END
GO

查询

DECLARE @minAnalysisDate DATE = '2007-09-01',-- Customizable start date depending on business needs
        @maxAnalysisDate DATE = '2008-09-03'  -- Customizable end date depending on business needs
SELECT ProductID,ActualCost,RollingSum45,NumOrders
FROM (
    SELECT ProductID,NumOrders,SUM(ActualCost) OVER (
                PARTITION BY ProductId ORDER BY TransactionDate 
                ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
            ) AS RollingSum45
    FROM (
        -- The full cross-product of products and dates,combined with actual cost information for that product/date
        SELECT p.ProductID,c.d AS TransactionDate,COUNT(TH.ProductId) AS NumOrders,SUM(TH.ActualCost) AS ActualCost
        FROM Production.Product p
        JOIN dbo.calendar c
            ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate
        LEFT OUTER JOIN Production.TransactionHistory TH
            ON TH.ProductId = p.productId
            AND TH.TransactionDate = c.d
        GROUP BY P.ProductID,c.d
    ) aggsByDay
) rollingSums
WHERE NumOrders > 0
ORDER BY ProductID,TransactionDate
-- MAXDOP 1 to avoid parallel scan inflating the scan count
OPTION (MAXDOP 1)

执行计划

从执行计划中,我们看到Paul提出的原始索引足以允许我们执行Production.TransactionHistory的单个有序扫描,使用合并连接将事务历史记录与每个可能的产品/日期组合相结合.

假设

这种方法有一些重要的假设.我想保罗会决定他们是否可以接受:)

>我正在使用Production.Product表.此表在AdventureWorks2012上免费提供,并且该关系由Production.TransactionHistory中的外键强制执行,因此我将其解释为公平游戏.
>这种方法依赖于事务在AdventureWorks2012上没有时间组件;如果他们这样做,如果没有首先通过交易历史,就不可能生成全套产品/日期组合.
>我正在生成一个每个产品/日期对只包含一行的行集.我认为这是“可以说是正确的”,并且在许多情况下是一个更理想的结果.对于每个产品/日期,我添加了一个NumOrders列以指示发生了多少销售.如果产品在同一天出售多次(例如,319 / 2007-09-05 00：00：00.000),请参阅以下屏幕截图,以比较原始查询与建议查询的结果.

CLR – 一次扫描,完整的未分组结果集

主要功能体

这里没有什么可看的;函数的主体声明输入(必须与相应的SQL函数匹配),设置SQL连接,并打开SQLReader.

// SQL CLR function for rolling SUMs on AdventureWorks2012.Production.TransactionHistory
[SqlFunction(DataAccess = DataAccessKind.Read,FillRowMethodName = "RollingSum_Fill",TableDefinition = "ProductId INT,TransactionDate DATETIME,ReferenceOrderID INT," +
                      "ActualCost FLOAT,PrevCumulativeSum FLOAT,RollingSum FLOAT")]
public static IEnumerable RollingSumTvf(SqlInt32 rollingPeriodDays) {
    using (var connection = new SqlConnection("context connection=true;")) {
        connection.Open();
        List<TrxnRollingSum> trxns;
        using (var cmd = connection.CreateCommand()) {
            //Read the transaction history (note: the order is important!)
            cmd.CommandText = @"SELECT ProductId,ReferenceOrderID,CAST(ActualCost AS FLOAT) AS ActualCost 
                                FROM Production.TransactionHistory 
                                ORDER BY ProductId,TransactionDate";
            using (var reader = cmd.ExecuteReader()) {
                trxns = ComputeRollingSums(reader,rollingPeriodDays.Value);
            }
        }

        return trxns;
    }
}

核心逻辑

我已经将主要逻辑分离出来,以便更容易关注：

// Given a SqlReader with transaction history data,computes / returns the rolling sums
private static List<TrxnRollingSum> ComputeRollingSums(SqlDataReader reader,int rollingPeriodDays) {
    var startIndexOfRollingPeriod = 0;
    var rollingSumIndex = 0;
    var trxns = new List<TrxnRollingSum>();

    // Prior to the loop,initialize "next" to be the first transaction
    var nextTrxn = GetNextTrxn(reader,null);
    while (nextTrxn != null)
    {
        var currTrxn = nextTrxn;
        nextTrxn = GetNextTrxn(reader,currTrxn);
        trxns.Add(currTrxn);

        // If the next transaction is not the same product/date as the current
        // transaction,we can finalize the rolling sum for the current transaction
        // and all previous transactions for the same product/date
        var finalizeRollingSum = nextTrxn == null || (nextTrxn != null &&
                                (currTrxn.ProductId != nextTrxn.ProductId ||
                                currTrxn.TransactionDate != nextTrxn.TransactionDate));
        if (finalizeRollingSum)
        {
            // Advance the pointer to the first transaction (for the same product)
            // that occurs within the rolling period
            while (startIndexOfRollingPeriod < trxns.Count
                && trxns[startIndexOfRollingPeriod].TransactionDate <
                    currTrxn.TransactionDate.AddDays(-1 * rollingPeriodDays))
            {
                startIndexOfRollingPeriod++;
            }

            // Compute the rolling sum as the cumulative sum (for this product),// minus the cumulative sum for prior to the beginning of the rolling window
            var sumPriorToWindow = trxns[startIndexOfRollingPeriod].PrevSum;
            var rollingSum = currTrxn.ActualCost + currTrxn.PrevSum - sumPriorToWindow;
            // Fill in the rolling sum for all transactions sharing this product/date
            while (rollingSumIndex < trxns.Count)
            {
                trxns[rollingSumIndex++].RollingSum = rollingSum;
            }
        }

        // If this is the last transaction for this product,reset the rolling period
        if (nextTrxn != null && currTrxn.ProductId != nextTrxn.ProductId)
        {
            startIndexOfRollingPeriod = trxns.Count;
        }
    }

    return trxns;
}

助手

以下逻辑可以内联编写,但当它们被拆分为自己的方法时,它会更容易阅读.

private static TrxnRollingSum GetNextTrxn(SqlDataReader r,TrxnRollingSum currTrxn) {
    TrxnRollingSum nextTrxn = null;
    if (r.Read()) {
        nextTrxn = new TrxnRollingSum {
            ProductId = r.GetInt32(0),TransactionDate = r.GetDateTime(1),ReferenceOrderId = r.GetInt32(2),ActualCost = r.GetDouble(3),PrevSum = 0 };
        if (currTrxn != null) {
            nextTrxn.PrevSum = (nextTrxn.ProductId == currTrxn.ProductId)
                    ? currTrxn.PrevSum + currTrxn.ActualCost : 0;
        }
    }
    return nextTrxn;
}

// Represents the output to be returned
// Note that the ReferenceOrderId/PrevSum fields are for debugging only
private class TrxnRollingSum {
    public int ProductId { get; set; }
    public DateTime TransactionDate { get; set; }
    public int ReferenceOrderId { get; set; }
    public double ActualCost { get; set; }
    public double PrevSum { get; set; }
    public double RollingSum { get; set; }
}

// The function that generates the result data for each row
// (Such a function is mandatory for SQL CLR table-valued functions)
public static void RollingSum_Fill(object trxnWithRollingSumObj,out int productId,out DateTime transactionDate,out int referenceOrderId,out double actualCost,out double prevCumulativeSum,out double rollingSum) {
    var trxn = (TrxnRollingSum)trxnWithRollingSumObj;
    productId = trxn.ProductId;
    transactionDate = trxn.TransactionDate;
    referenceOrderId = trxn.ReferenceOrderId;
    actualCost = trxn.ActualCost;
    prevCumulativeSum = trxn.PrevSum;
    rollingSum = trxn.RollingSum;
}

在SQL中将它们捆绑在一起

到目前为止,一切都在C#中,所以让我们看看实际涉及的SQL. (或者,您可以使用this deployment script直接从我的程序集中创建程序集,而不是自己编译.)

USE AdventureWorks2012; /* GPATTERSON2SQL2014DEVELOPER */
GO

-- Enable CLR
EXEC sp_configure 'clr enabled',1;
GO
RECONFIGURE;
GO

-- Create the assembly based on the dll generated by compiling the CLR project
-- I've also included the "assembly bits" version that can be run without compiling
CREATE ASSEMBLY ClrPlayground
-- See http://pastebin.com/dfbv1w3z for a "from assembly bits" version
FROM 'C:FullPathGoesHereClrPlaygroundbinDebugClrPlayground.dll'
WITH PERMISSION_SET = safe;
GO

--Create a function from the assembly
CREATE FUNCTION dbo.RollingSumTvf (@rollingPeriodDays INT)
RETURNS TABLE ( ProductId INT,ActualCost FLOAT,RollingSum FLOAT)
-- The function yields rows in order,so let SQL Server know to avoid an extra sort
ORDER (ProductID,ReferenceOrderID)
AS EXTERNAL NAME ClrPlayground.UserDefinedFunctions.RollingSumTvf;
GO

-- Now we can actually use the TVF!
SELECT * 
FROM dbo.RollingSumTvf(45) 
ORDER BY ProductId,ReferenceOrderId
GO

注意事项

CLR方法为优化算法提供了更大的灵活性,C#专家可能会进一步调整它.但是,CLR策略也存在缺点.要记住以下几点：

>此CLR方法将数据集的副本保留在内存中.可以使用流式处理方法,但我遇到了最初的困难,并发现有an outstanding Connect issue抱怨SQL 2008中的更改使得使用这种方法更加困难.它仍然可能(如Paul所示),但通过将数据库设置为TRUSTWORTHY并向CLR程序集授予EXTERNAL_ACCESS,需要更高级别的权限.因此存在一些麻烦和潜在的安全隐患,但是回报是一种流式方法,可以比AdventureWorks上的方法更好地扩展到更大的数据集.
>某些DBA可能无法访问CLR,使得这样的功能更像是一个不透明的黑盒子,不易修改,不易于部署,也许不容易调试.与T-SQL方法相比,这是一个非常大的缺点.

额外奖励：T-SQL#2 – 我实际使用的实用方法

在尝试创造性思考这个问题一段时间后,我想我也会发布一个相当简单,实用的方法,如果它出现在我的日常工作中,我可能会选择解决这个问题.它确实利用了SQL 2012窗口功能,但没有采用问题所希望的突破性方式：

-- Compute all running costs into a #temp table; Note that this query could simply read
-- from Production.TransactionHistory,but a CROSS APPLY by product allows the window 
-- function to be computed independently per product,supporting a parallel query plan
SELECT t.*
INTO #runningCosts
FROM Production.Product p
CROSS APPLY (
    SELECT t.ProductId,t.TransactionDate,t.ReferenceOrderId,t.ActualCost,-- Running sum of the cost for this product,including all ties on TransactionDate
        SUM(t.ActualCost) OVER (
            ORDER BY t.TransactionDate 
            RANGE UNBOUNDED PRECEDING) AS RunningCost
    FROM Production.TransactionHistory t
    WHERE t.ProductId = p.ProductId
) t
GO

-- Key the table in our output order
ALTER TABLE #runningCosts
ADD PRIMARY KEY (ProductId,ReferenceOrderId)
GO

SELECT r.ProductId,r.TransactionDate,r.ReferenceOrderId,r.ActualCost,-- Cumulative running cost - running cost prior to the sliding window
    r.RunningCost - ISNULL(w.RunningCost,0) AS RollingSum45
FROM #runningCosts r
OUTER APPLY (
    -- For each transaction,find the running cost just before the sliding window begins
    SELECT TOP 1 b.RunningCost
    FROM #runningCosts b
    WHERE b.ProductId = r.ProductId
        AND b.TransactionDate < DATEADD(DAY,r.TransactionDate)
    ORDER BY b.TransactionDate DESC
) w
ORDER BY r.ProductId,r.ReferenceOrderId
GO

这实际上产生了一个相当简单的整体查询计划,即使同时查看两个相关的查询计划：

我喜欢这种方法的几个原因：

>它产生问题陈述中请求的完整结果集(与大多数其他T-SQL解决方案相反,后者返回结果的分组版本).
>易于解释,理解和调试;一年后我不会回来,想知道如何在不破坏正确性或性能的情况下做出一点改变
>它在提供的数据集上运行大约900ms,而不是原始循环搜索的2700ms
>如果数据更密集(每天更多的事务),计算复杂性不会随着滑动窗口中的事务数量呈二次方式增长(就像对原始查询一样);我认为这解决了保罗想要避免多次扫描的一部分问题
>由于new tempdb lazy write functionality,在SQL 2012的最新更新中导致基本上没有tempdb I / O.
>对于非常大的数据集,如果内存压力成为一个问题,将每个产品的工作分成不同的批次是微不足道的

一些潜在的警告：

>虽然技术上只扫描了Production.TransactionHistory一次,但它并不是真正的“一次扫描”方法,因为#temp表的大小相似,并且还需要在该表上执行额外的逻辑I / O.但是,由于我们已经定义了其精确的结构,因此我认为这与我们有更多手动控制的工作表有太大不同>根据您的环境,tempdb的使用可能被视为正面(例如,它位于一组单独的SSD驱动器上)或负面(服务器上的高并发性,已经存在大量的tempdb争用)

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!