介绍PostgreSQL中的Lateral类型
PostgreSQL 9.3 用了一种新的联合类型! Lateral联合的推出比较低调,但它实现了之前需要使用编写程序才能获得的强大的新查询. 在本文中,我将会介绍一个在 PostgreSQL 9.2 不可能被实现的渠道转换分析. 对此的最佳描述在文档中 可选 FROM 语句清单 的底部: LATERAL 关键词可以在前缀一个 SELECT FROM 子项. 这能让 SELECT 子项在FROM项出现之前就引用到FROM项中的列. (没有 LATERAL 的话,每一个 SELECT 子项彼此都是独立的,因此不能够对其它的 FROM 项进行交叉引用.) 这种计算有一点密集。你可以比较松散的将 LATERAL 联合理解作一个 SQL 的foreach 选择,在这个循环中 PostgreSQL 将循环一个结果集中的每一行,并将那一行作为参数来执行一次子查询的计算. 我们可以用这个来干些什么? 看看下面这个用来记录点击事件的表结构: CREATE TABLE event ( user_id BIGINT,event_id BIGINT,time BIGINT NOT NULL,data JSON NOT NULL,PRIMARY KEY (user_id,event_id) ) 每一个事件都关联了一个用户,拥有一个ID,一个时间戳,还有一个带有事件属性的JSON blob. 在堆中,这些属性可能包含一次点击的DOM层级,窗口的标题,会话引用等等信息. 加入我们要优化我们的登录页面以增加注册. 第一步就是要计算看看我们的哪个渠道转换上正在丢失用户. 示例:一个注册流程的个步骤之间的渠道转换率.
SELECT user_id,view_homepage,view_homepage_time,enter_credit_card,enter_credit_card_time FROM ( -- Get the first time each user viewed the homepage. SELECT user_id,1 AS view_homepage,min(time) AS view_homepage_time FROM event WHERE data->>'type' = 'view_homepage' GROUP BY user_id ) e1 LEFT JOIN LATERAL ( -- For each row,get the first time the user_id did the enter_credit_card -- event,if one exists within two weeks of view_homepage_time. SELECT 1 AS enter_credit_card,time AS enter_credit_card_time FROM event WHERE user_id = e1.user_id AND data->>'type' = 'enter_credit_card' AND time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*14) ORDER BY time LIMIT 1 ) e2 ON true 没有人会喜欢30多行的SQL查询,所以让我们将这些SQL分成片段来分析。第一块是一段普通的 SQL: SELECT user_id,min(time) AS view_homepage_time FROM event WHERE data->>'type' = 'view_homepage' GROUP BY user_id 也就是要获取到每个用户最开始触发 view_homepage 事件的时间. 然后我们的 lateral 联合就可以让我们迭代结果集的每一行,并会在接下来执行一次参数化的子查询. 这就等同于针对结果集的每一行都要执行一边下面的这个查询: SELECT 1 AS enter_credit_card,time AS enter_credit_card_time FROM event WHERE user_id = e1.user_id AND data->>'type' = 'enter_credit_card' AND time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*14) ORDER BY time LIMIT 1 例如,对于每一个用户,要获取他们在触发 view_homepage_time 事件后的两星期内触发 enter_credit_card 事件的时间. 因为这是一个lateral联合,我们的子查询就可以从之前的子查询出引用到 view_homepage_time 结果集. 否则,子查询就只能单独执行,而没办法访问到另外一个子查询所计算出来的结果集. 之后哦我们整个封装成一个select,它会返回像下面这样的东西: user_id | view_homepage | view_homepage_time | enter_credit_card | enter_credit_card_time ---------+---------------+--------------------+-------------------+------------------------ 567 | 1 | 5234567890 | 1 | 5839367890 234 | 1 | 2234567890 | | 345 | 1 | 3234567890 | | 456 | 1 | 4234567890 | | 678 | 1 | 6234567890 | | 123 | 1 | 1234567890 | | ...
SELECT sum(view_homepage) AS viewed_homepage,sum(enter_credit_card) AS entered_credit_card FROM ( -- Get the first time each user viewed the homepage. SELECT user_id,min(time) AS view_homepage_time FROM event WHERE data->>'type' = 'view_homepage' GROUP BY user_id ) e1 LEFT JOIN LATERAL ( -- For each (user_id,view_homepage_time) tuple,get the first time that -- user did the enter_credit_card event,if one exists within two weeks. SELECT 1 AS enter_credit_card,time AS enter_credit_card_time FROM event WHERE user_id = e1.user_id AND data->>'type' = 'enter_credit_card' AND time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*14) ORDER BY time LIMIT 1 ) e2 ON true … 它会输出: viewed_homepage | entered_credit_card -----------------+--------------------- 827 | 10
SELECT sum(view_homepage) AS viewed_homepage,sum(use_demo) AS use_demo,min(time) AS view_homepage_time FROM event WHERE data->>'type' = 'view_homepage' GROUP BY user_id ) e1 LEFT JOIN LATERAL ( -- For each row,get the first time the user_id did the use_demo -- event,if one exists within one week of view_homepage_time. SELECT user_id,1 AS use_demo,time AS use_demo_time FROM event WHERE user_id = e1.user_id AND data->>'type' = 'use_demo' AND time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*7) ORDER BY time LIMIT 1 ) e2 ON true LEFT JOIN LATERAL ( -- For each row,get the first time the user_id did the enter_credit_card -- event,if one exists within one week of use_demo_time. SELECT 1 AS enter_credit_card,time AS enter_credit_card_time FROM event WHERE user_id = e2.user_id AND data->>'type' = 'enter_credit_card' AND time BETWEEN use_demo_time AND (use_demo_time + 1000*60*60*24*7) ORDER BY time LIMIT 1 ) e3 ON true 这样就会输出: viewed_homepage | use_demo | entered_credit_card -----------------+----------+--------------------- 827 | 220 | 86
这些问题的答案会直接影响到产品的改进,它们可以从 PostgreSQL 数据库中找到答案,因为现在它支持 lateral 联合.
注意这些查询可以被优化,以变得更加高效. 在本例中,如果我们在 (user_id,(data->>'type'),time)上创建一个btree索引,我们只用一次索引查找就能针对每一个用户计算每一个渠道步骤. 如果你使用的是SSD,在上面做查找花费是很小的,那这就足够了。而如果不是,你就可能需要用稍微不同的手段来图示化你的数据,详细的内容我会留到另外一篇文章之中进行介绍.
(编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |