sql-server – 使用窗口函数的日期范围滚动总和
我需要计算一个日期范围内的滚动总和.为了说明,使用
AdventureWorks sample database,以下假设语法将完全符合我的需要:
SELECT TH.ProductID,TH.TransactionDate,TH.ActualCost,RollingSum45 = SUM(TH.ActualCost) OVER ( PARTITION BY TH.ProductID ORDER BY TH.TransactionDate RANGE BETWEEN INTERVAL 45 DAY PRECEDING AND CURRENT ROW) FROM Production.TransactionHistory AS TH ORDER BY TH.ProductID,TH.ReferenceOrderID; 遗憾的是,RANGE窗口框架范围当前不允许SQL Server中的间隔. 我知道我可以使用子查询和常规(非窗口)聚合编写解决方案: SELECT TH.ProductID,RollingSum45 = ( SELECT SUM(TH2.ActualCost) FROM Production.TransactionHistory AS TH2 WHERE TH2.ProductID = TH.ProductID AND TH2.TransactionDate <= TH.TransactionDate AND TH2.TransactionDate >= DATEADD(DAY,-45,TH.TransactionDate) ) FROM Production.TransactionHistory AS TH ORDER BY TH.ProductID,TH.ReferenceOrderID; 鉴于以下指数: CREATE UNIQUE INDEX i ON Production.TransactionHistory (ProductID,TransactionDate,ReferenceOrderID) INCLUDE (ActualCost); 执行计划是: 虽然不是非常低效,但似乎应该可以仅使用SQL Server 2012,2014或2016中支持的窗口聚合和分析函数来表达此查询(到目前为止). 为清楚起见,我正在寻找一种对数据执行单次传递的解决方案. 在T-SQL中,这可能意味着the 对于T-SQL解决方案,执行计划中的哈希,排序和窗口假脱机/聚合越少越好.随意添加索引,但不允许使用单独的结构(例如,没有预先计算的表与触发器保持同步).允许使用参考表(数字表,日期等) 理想情况下,解决方案将以与上述子查询版本相同的顺序生成完全相同的结果,但任何可以说是正确的也是可接受的.性能始终是一个考虑因素,因此解决方案至少应该是合理有效的. 专用聊天室:我创建了一个公共聊天室,用于讨论与此问题及其答案相关的问题.任何拥有at least 20 reputation points的用户都可以直接参加.如果您的代表少于20人并想参加,请在下面的评论中给我打电话. > Discussion for “Date range rolling sum using window functions” 解决方法好问题,保罗!我使用了几种不同的方法,一种是在T-SQL中,另一种是在CLR中.T-SQL快速摘要 T-SQL方法可以概括为以下步骤: >以产品/日期的交叉产品为例 使用SET STATISTICS IO ON,此方法报告表’TransactionHistory’.扫描计数1,逻辑读取484,确认表上的“单次通过”.作为参考,原始循环搜索查询报告表’TransactionHistory’.扫描计数113444,逻辑读取438366. 如SET STATISTICS TIME ON所报告的,CPU时间为514ms.对于原始查询,这有利地与2231ms相比. CLR快速摘要 CLR摘要可归纳为以下步骤: >将数据读入内存,按产品和日期排序 使用SET STATISTICS IO ON,此方法报告没有发生逻辑I / O!哇,一个完美的解决方案! (实际上,SET STATISTICS IO似乎没有报告CLR中发生的I / O.但是从代码中可以很容易地看到表的一次扫描完成并按照Paul建议的顺序检索数据. 据SET STATISTICS TIME ON报告,CPU时间现在为187ms.所以这比T-SQL方法有了很大改进.不幸的是,两种方法的总体经过时间非常相似,每次大约半秒钟.但是,基于CLR的方法必须向控制台输出113K行(对于按产品/日期分组的T-SQL方法,只需52K),这就是为什么我专注于CPU时间. 这种方法的另一大优点是它产生与原始循环/搜索方法完全相同的结果,即使在同一天多次销售产品的情况下,每个交易也包括一行. (在AdventureWorks上,我专门比较了逐行结果,并确认它们与Paul的原始查询相关联.) 这种方法的缺点,至少在其当前形式中,是它读取存储器中的所有数据.但是,设计的算法在任何给定时间都只需要内存中的当前窗口框架,并且可以更新以适用于超出内存的数据集. Paul在他的回答中通过生成这种算法的实现来说明这一点,该算法仅在内存中存储滑动窗口.这是以授予CLR程序集更高权限为代价的,但在将此解决方案扩展到任意大型数据集时肯定是值得的. T-SQL – 按日期分组的一次扫描 初始设置 USE AdventureWorks2012 GO -- Create Paul's index CREATE UNIQUE INDEX i ON Production.TransactionHistory (ProductID,ReferenceOrderID) INCLUDE (ActualCost); GO -- Build calendar table for 2000 ~ 2020 CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY) GO DECLARE @d DATETIME = '1/1/2000' WHILE (@d < '1/1/2021') BEGIN INSERT INTO dbo.calendar (d) VALUES (@d) SELECT @d = DATEADD(DAY,1,@d) END GO 查询 DECLARE @minAnalysisDate DATE = '2007-09-01',-- Customizable start date depending on business needs @maxAnalysisDate DATE = '2008-09-03' -- Customizable end date depending on business needs SELECT ProductID,ActualCost,RollingSum45,NumOrders FROM ( SELECT ProductID,NumOrders,SUM(ActualCost) OVER ( PARTITION BY ProductId ORDER BY TransactionDate ROWS BETWEEN 45 PRECEDING AND CURRENT ROW ) AS RollingSum45 FROM ( -- The full cross-product of products and dates,combined with actual cost information for that product/date SELECT p.ProductID,c.d AS TransactionDate,COUNT(TH.ProductId) AS NumOrders,SUM(TH.ActualCost) AS ActualCost FROM Production.Product p JOIN dbo.calendar c ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate LEFT OUTER JOIN Production.TransactionHistory TH ON TH.ProductId = p.productId AND TH.TransactionDate = c.d GROUP BY P.ProductID,c.d ) aggsByDay ) rollingSums WHERE NumOrders > 0 ORDER BY ProductID,TransactionDate -- MAXDOP 1 to avoid parallel scan inflating the scan count OPTION (MAXDOP 1) 执行计划 从执行计划中,我们看到Paul提出的原始索引足以允许我们执行Production.TransactionHistory的单个有序扫描,使用合并连接将事务历史记录与每个可能的产品/日期组合相结合. 假设 这种方法有一些重要的假设.我想保罗会决定他们是否可以接受:) >我正在使用Production.Product表.此表在AdventureWorks2012上免费提供,并且该关系由Production.TransactionHistory中的外键强制执行,因此我将其解释为公平游戏. CLR – 一次扫描,完整的未分组结果集 主要功能体 这里没有什么可看的;函数的主体声明输入(必须与相应的SQL函数匹配),设置SQL连接,并打开SQLReader. // SQL CLR function for rolling SUMs on AdventureWorks2012.Production.TransactionHistory [SqlFunction(DataAccess = DataAccessKind.Read,FillRowMethodName = "RollingSum_Fill",TableDefinition = "ProductId INT,TransactionDate DATETIME,ReferenceOrderID INT," + "ActualCost FLOAT,PrevCumulativeSum FLOAT,RollingSum FLOAT")] public static IEnumerable RollingSumTvf(SqlInt32 rollingPeriodDays) { using (var connection = new SqlConnection("context connection=true;")) { connection.Open(); List<TrxnRollingSum> trxns; using (var cmd = connection.CreateCommand()) { //Read the transaction history (note: the order is important!) cmd.CommandText = @"SELECT ProductId,ReferenceOrderID,CAST(ActualCost AS FLOAT) AS ActualCost FROM Production.TransactionHistory ORDER BY ProductId,TransactionDate"; using (var reader = cmd.ExecuteReader()) { trxns = ComputeRollingSums(reader,rollingPeriodDays.Value); } } return trxns; } } 核心逻辑 我已经将主要逻辑分离出来,以便更容易关注: // Given a SqlReader with transaction history data,computes / returns the rolling sums private static List<TrxnRollingSum> ComputeRollingSums(SqlDataReader reader,int rollingPeriodDays) { var startIndexOfRollingPeriod = 0; var rollingSumIndex = 0; var trxns = new List<TrxnRollingSum>(); // Prior to the loop,initialize "next" to be the first transaction var nextTrxn = GetNextTrxn(reader,null); while (nextTrxn != null) { var currTrxn = nextTrxn; nextTrxn = GetNextTrxn(reader,currTrxn); trxns.Add(currTrxn); // If the next transaction is not the same product/date as the current // transaction,we can finalize the rolling sum for the current transaction // and all previous transactions for the same product/date var finalizeRollingSum = nextTrxn == null || (nextTrxn != null && (currTrxn.ProductId != nextTrxn.ProductId || currTrxn.TransactionDate != nextTrxn.TransactionDate)); if (finalizeRollingSum) { // Advance the pointer to the first transaction (for the same product) // that occurs within the rolling period while (startIndexOfRollingPeriod < trxns.Count && trxns[startIndexOfRollingPeriod].TransactionDate < currTrxn.TransactionDate.AddDays(-1 * rollingPeriodDays)) { startIndexOfRollingPeriod++; } // Compute the rolling sum as the cumulative sum (for this product),// minus the cumulative sum for prior to the beginning of the rolling window var sumPriorToWindow = trxns[startIndexOfRollingPeriod].PrevSum; var rollingSum = currTrxn.ActualCost + currTrxn.PrevSum - sumPriorToWindow; // Fill in the rolling sum for all transactions sharing this product/date while (rollingSumIndex < trxns.Count) { trxns[rollingSumIndex++].RollingSum = rollingSum; } } // If this is the last transaction for this product,reset the rolling period if (nextTrxn != null && currTrxn.ProductId != nextTrxn.ProductId) { startIndexOfRollingPeriod = trxns.Count; } } return trxns; } 助手 以下逻辑可以内联编写,但当它们被拆分为自己的方法时,它会更容易阅读. private static TrxnRollingSum GetNextTrxn(SqlDataReader r,TrxnRollingSum currTrxn) { TrxnRollingSum nextTrxn = null; if (r.Read()) { nextTrxn = new TrxnRollingSum { ProductId = r.GetInt32(0),TransactionDate = r.GetDateTime(1),ReferenceOrderId = r.GetInt32(2),ActualCost = r.GetDouble(3),PrevSum = 0 }; if (currTrxn != null) { nextTrxn.PrevSum = (nextTrxn.ProductId == currTrxn.ProductId) ? currTrxn.PrevSum + currTrxn.ActualCost : 0; } } return nextTrxn; } // Represents the output to be returned // Note that the ReferenceOrderId/PrevSum fields are for debugging only private class TrxnRollingSum { public int ProductId { get; set; } public DateTime TransactionDate { get; set; } public int ReferenceOrderId { get; set; } public double ActualCost { get; set; } public double PrevSum { get; set; } public double RollingSum { get; set; } } // The function that generates the result data for each row // (Such a function is mandatory for SQL CLR table-valued functions) public static void RollingSum_Fill(object trxnWithRollingSumObj,out int productId,out DateTime transactionDate,out int referenceOrderId,out double actualCost,out double prevCumulativeSum,out double rollingSum) { var trxn = (TrxnRollingSum)trxnWithRollingSumObj; productId = trxn.ProductId; transactionDate = trxn.TransactionDate; referenceOrderId = trxn.ReferenceOrderId; actualCost = trxn.ActualCost; prevCumulativeSum = trxn.PrevSum; rollingSum = trxn.RollingSum; } 在SQL中将它们捆绑在一起 到目前为止,一切都在C#中,所以让我们看看实际涉及的SQL. (或者,您可以使用this deployment script直接从我的程序集中创建程序集,而不是自己编译.) USE AdventureWorks2012; /* GPATTERSON2SQL2014DEVELOPER */ GO -- Enable CLR EXEC sp_configure 'clr enabled',1; GO RECONFIGURE; GO -- Create the assembly based on the dll generated by compiling the CLR project -- I've also included the "assembly bits" version that can be run without compiling CREATE ASSEMBLY ClrPlayground -- See http://pastebin.com/dfbv1w3z for a "from assembly bits" version FROM 'C:FullPathGoesHereClrPlaygroundbinDebugClrPlayground.dll' WITH PERMISSION_SET = safe; GO --Create a function from the assembly CREATE FUNCTION dbo.RollingSumTvf (@rollingPeriodDays INT) RETURNS TABLE ( ProductId INT,ActualCost FLOAT,RollingSum FLOAT) -- The function yields rows in order,so let SQL Server know to avoid an extra sort ORDER (ProductID,ReferenceOrderID) AS EXTERNAL NAME ClrPlayground.UserDefinedFunctions.RollingSumTvf; GO -- Now we can actually use the TVF! SELECT * FROM dbo.RollingSumTvf(45) ORDER BY ProductId,ReferenceOrderId GO 注意事项 CLR方法为优化算法提供了更大的灵活性,C#专家可能会进一步调整它.但是,CLR策略也存在缺点.要记住以下几点: >此CLR方法将数据集的副本保留在内存中.可以使用流式处理方法,但我遇到了最初的困难,并发现有an outstanding Connect issue抱怨SQL 2008中的更改使得使用这种方法更加困难.它仍然可能(如Paul所示),但通过将数据库设置为TRUSTWORTHY并向CLR程序集授予EXTERNAL_ACCESS,需要更高级别的权限.因此存在一些麻烦和潜在的安全隐患,但是回报是一种流式方法,可以比AdventureWorks上的方法更好地扩展到更大的数据集. 额外奖励:T-SQL#2 – 我实际使用的实用方法 在尝试创造性思考这个问题一段时间后,我想我也会发布一个相当简单,实用的方法,如果它出现在我的日常工作中,我可能会选择解决这个问题.它确实利用了SQL 2012窗口功能,但没有采用问题所希望的突破性方式: -- Compute all running costs into a #temp table; Note that this query could simply read -- from Production.TransactionHistory,but a CROSS APPLY by product allows the window -- function to be computed independently per product,supporting a parallel query plan SELECT t.* INTO #runningCosts FROM Production.Product p CROSS APPLY ( SELECT t.ProductId,t.TransactionDate,t.ReferenceOrderId,t.ActualCost,-- Running sum of the cost for this product,including all ties on TransactionDate SUM(t.ActualCost) OVER ( ORDER BY t.TransactionDate RANGE UNBOUNDED PRECEDING) AS RunningCost FROM Production.TransactionHistory t WHERE t.ProductId = p.ProductId ) t GO -- Key the table in our output order ALTER TABLE #runningCosts ADD PRIMARY KEY (ProductId,ReferenceOrderId) GO SELECT r.ProductId,r.TransactionDate,r.ReferenceOrderId,r.ActualCost,-- Cumulative running cost - running cost prior to the sliding window r.RunningCost - ISNULL(w.RunningCost,0) AS RollingSum45 FROM #runningCosts r OUTER APPLY ( -- For each transaction,find the running cost just before the sliding window begins SELECT TOP 1 b.RunningCost FROM #runningCosts b WHERE b.ProductId = r.ProductId AND b.TransactionDate < DATEADD(DAY,r.TransactionDate) ORDER BY b.TransactionDate DESC ) w ORDER BY r.ProductId,r.ReferenceOrderId GO 这实际上产生了一个相当简单的整体查询计划,即使同时查看两个相关的查询计划: 我喜欢这种方法的几个原因: >它产生问题陈述中请求的完整结果集(与大多数其他T-SQL解决方案相反,后者返回结果的分组版本). 一些潜在的警告: >虽然技术上只扫描了Production.TransactionHistory一次,但它并不是真正的“一次扫描”方法,因为#temp表的大小相似,并且还需要在该表上执行额外的逻辑I / O.但是,由于我们已经定义了其精确的结构,因此我认为这与我们有更多手动控制的工作表有太大不同>根据您的环境,tempdb的使用可能被视为正面(例如,它位于一组单独的SSD驱动器上)或负面(服务器上的高并发性,已经存在大量的tempdb争用) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |