postgresql-分页重复数据探索

发布时间：2020-12-13 16:09:42 所属栏目：百科来源：网络整理

导读：# postgresql-分页重复数据探索## 问题背景许多开发和测试人员都可能遇到过列表的数据翻下一页的时候显示了上一页的数据，也就是翻页会有重复的数据。### 如何处理？这个问题出现的原因是因为选择的排序字段有重复，常见的处理办法就是排序的时候加上唯一字

# postgresql-分页重复数据探索## 问题背景许多开发和测试人员都可能遇到过列表的数据翻下一页的时候显示了上一页的数据，也就是翻页会有重复的数据。### 如何处理？这个问题出现的原因是因为选择的排序字段有重复，常见的处理办法就是排序的时候加上唯一字段，这样在分页的过程中数据就不会重复了。关于这个问题文档也有解释并非是一个bug。而是排序时需要选择唯一字段来做排序，不然返回的结果不确定### 排序返回数据重复的根本原因是什么呢？经常优化sql的同学可能会发现，执行计划里面会有Sort Method这个关键字，而这个关键字就是排序选择的方法。abase的排序分为三种```sqlquicksort?? ??? ??? ??? ??? ???? 快速排序?? ?top-N heapsort? Memory?? ??? ??? ?堆排序external merge? Disk?? ??? ??? ?归并排序```### 推测分页重复的问题和执行计划选择排序算法的稳定性有关。### 简单介绍下这三种排序算法的场景：在有索引的情况下：排序可以直接走索引。在没有索引的情况下：当表的数据量较小的时候选择快速排序（排序所需必须内存小于work_mem），当排序有limit，且耗费的内存不超过work_mem时选择堆排序，当work_mem不够时选择归并排序。### 验证推测1.创建表，初始化数据```sqlabase=# create table t_sort(n_int int,c_id varchar(300));CREATE TABLEabase=# insert into t_sort(n_int,c_id) select?? 100,generate_series(1,9);INSERT 0 9abase=# insert into t_sort(n_int,c_id) select?? 200,c_id) select?? 300,c_id) select?? 400,c_id) select?? 500,c_id) select?? 600,9);INSERT 0 9```### 三种排序```sql--快速排序 quicksortabase=# explain? analyze select ctid,n_int,c_id from t_sort order by n_int asc;?????????????????????????????????????????????? QUERY PLAN?????????????????????????????????????????????? ?------------------------------------------------------------?Sort? (cost=3.09..3.23 rows=54 width=12) (actual time=0.058..0.061 rows=54 loops=1)?? Sort Key: n_int?? Sort Method: quicksort? Memory: 27kB?? ->? Seq Scan on t_sort? (cost=0.00..1.54 rows=54 width=12) (actual time=0.021..0.032 rows=54 loops=1)?Planning time: 0.161 ms?Execution time: 0.104 ms(6 rows)--堆排序? top-N heapsortabase=# explain? analyze select ctid,c_id from t_sort order by n_int asc limit 10;????????????????????????????????????????????????? QUERY PLAN??????????????????????????????????????????????? ?? ------------------------------------------------------------?Limit? (cost=2.71..2.73 rows=10 width=12) (actual time=0.066..0.068 rows=10 loops=1)?? ->? Sort? (cost=2.71..2.84 rows=54 width=12) (actual time=0.065..0.066 rows=10 loops=1)???????? Sort Key: n_int???????? Sort Method: top-N heapsort? Memory: 25kB???????? ->? Seq Scan on t_sort? (cost=0.00..1.54 rows=54 width=12) (actual time=0.022..0.031 rows=54 loops=1)?Planning time: 0.215 ms?Execution time: 0.124 ms(7 rows)--归并排序? external sort? Disk--插入大量值为a的数据abase=# insert into t_sort(n_int,c_id) select?? generate_series(1000,2000),‘a‘;INSERT 0 1001abase=# set work_mem = ‘64kB‘;SETabase=# explain? analyze select ctid,c_id from t_sort order by n_int asc;???????????????????????????????????????????????? QUERY PLAN??????????????????????????????????????????????? ?-------------------------------------------------------------?Sort? (cost=18.60..19.28 rows=270 width=12) (actual time=1.235..1.386 rows=1055 loops=1)?? Sort Key: n_int?? Sort Method: external sort? Disk: 32kB?? ->? Seq Scan on t_sort? (cost=0.00..7.70 rows=270 width=12) (actual time=0.030..0.247 rows=1055 loops=1)?Planning time: 0.198 ms?Execution time: 1.663 ms(6 rows)```### 快速排序```sql--快速排序abase=# explain? analyze select ctid,c_id from t_sort order by n_int asc;?????????????????????????????????????????????? QUERY PLAN?????????????????????????????????????????????? ?------------------------------------------------------------?Sort? (cost=3.09..3.23 rows=54 width=12) (actual time=0.058..0.061 rows=54 loops=1)?? Sort Key: n_int?? Sort Method: quicksort? Memory: 27kB?? ->? Seq Scan on t_sort? (cost=0.00..1.54 rows=54 width=12) (actual time=0.021..0.032 rows=54 loops=1)?Planning time: 0.161 ms?Execution time: 0.104 ms(6 rows)? ?--获取前20条数据??? abase=#? select ctid,c_id from t_sort order by n_int asc? limit 20;????? ctid? | n_int | c_id ??? --------+-------+------???? (0,7)? |?? 100 | 7???? (0,2)? |?? 100 | 2???? (0,4)? |?? 100 | 4???? (0,8)? |?? 100 | 8???? (0,3)? |?? 100 | 3???? (0,6)? |?? 100 | 6???? (0,5)? |?? 100 | 5???? (0,9)? |?? 100 | 9???? (0,1)? |?? 100 | 1???? (0,14) |?? 200 | 5???? (0,13) |?? 200 | 4???? (0,12) |?? 200 | 3???? (0,10) |?? 200 | 1???? (0,15) |?? 200 | 6???? (0,16) |?? 200 | 7???? (0,17) |?? 200 | 8???? (0,11) |?? 200 | 2???? (0,18) |?? 200 | 9???? (0,20) |?? 300 | 2???? (0,19) |?? 300 | 1??? (20 rows)??? --分页获取前10条数据??? abase=# select ctid,c_id from t_sort order by n_int asc limit 10 offset 0;????? ctid? | n_int | c_id ??? --------+-------+------???? (0,10) |?? 200 | 1??? (10 rows)??? --分页从第10条开始获取10条??? abase=#? select ctid,c_id from t_sort order by n_int asc? limit 10 offset 10;????? ctid? | n_int | c_id ??? --------+-------+------???? (0,19) |?? 300 | 1??? (10 rows)```limit 10 offset 0,limit 10 offset 10，连续取两页数据此处可以看到limit 10 offset 10结果中，第三条数据重复了第一页的最后一条： (0,10) |?? 200 | 1并且n_int = 200 and c_id = 5这条数据被“遗漏”了。### 堆排序```sqlabase=# select count(*) from t_sort;?count -------? 1055(1 row)--设置work_mem 4MBabase=# show work_mem ;?work_mem ----------?4MB(1 row)--top-N heapsortabase=# explain analyze select * from ( select ctid,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;??????????????????????????????????????????????????????? QUERY PLAN????????????????????????????????????????? ????????????? ?-------------------------------------------------------------------------------------------------------------?Limit? (cost=2061.33..2061.45 rows=10 width=13) (actual time=15.247..15.251 rows=10 loops=1)?? ->? Limit? (cost=2061.33..2063.83 rows=1001 width=13) (actual time=15.245..15.247 rows=10 loops=1)???????? ->? Sort? (cost=2061.33..2135.72 rows=29757 width=13) (actual time=15.244..15.245 rows=10 loops=1)?????????????? Sort Key: test.n_int?????????????? Sort Method: top-N heapsort? Memory: 95kB?????????????? ->? Seq Scan on test? (cost=0.00..429.57 rows=29757 width=13) (actual time=0.042..7.627 rows=29757 loops=1)?Planning time: 0.376 ms?Execution time: 15.415 ms(8 rows)--获取limit 1001 offset 0，然后取10前10条数据abase=#? select * from ( select ctid,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;?? ctid?? | n_int | c_id ----------+-------+------?(0,6)??? |?? 100 | 6?(0,2)??? |?? 100 | 2?(0,5)??? |?? 100 | 5?(87,195) |?? 100 | 888?(0,3)??? |?? 100 | 3?(0,1)??? |?? 100 | 1?(0,8)??? |?? 100 | 8?(0,55)?? |?? 100 | 888?(44,12)? |?? 100 | 888?(0,9)??? |?? 100 | 9(10 rows)---获取limit 1001 offset 1，然后取10前10条数据abase=#? select * from ( select ctid,c_id from test order by n_int asc limit 1001 offset 1) td limit 10;?? ctid?? | n_int | c_id ----------+-------+------?(44,5)??? |?? 100 | 5?(0,9)??? |?? 100 | 9?(87,7)??? |?? 100 | 7?(0,4)??? |?? 100 | 4(10 rows)---获取limit 1001 offset 2，然后取10前10条数据abase=# select * from ( select ctid,c_id from test order by n_int asc limit 1001 offset 2) td limit 10;?? ctid?? | n_int | c_id ----------+-------+------?(0,55)?? |?? 100 | 888?(0,9)??? |?? 100 | 9?(0,3)??? |?? 100 | 3?(44,7)??? |?? 100 | 7?(87,4)??? |?? 100 | 4(10 rows)堆排序使用内存： Sort Method: top-N heapsort? Memory: 95kB```当offset从0变成1后，以及变成2后，会发现查询出的10条数据不是有顺序的。### 归并排序```sql--将work_mem设置为64kb让其走归并排序。abase=# set work_mem =‘64kB‘;SETabase=# show work_mem;?work_mem ----------?64kB(1 row)-- external merge? Diskabase=# explain analyze? select * from ( select ctid,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;??????????????????????????????????????????????????????? QUERY PLAN??????????????????????????????????????????????????????? ?---------------------------------------------------------------------------------------------------------------------------?Limit? (cost=2061.33..2061.45 rows=10 width=13) (actual time=27.912..27.916 rows=10 loops=1)?? ->? Limit? (cost=2061.33..2063.83 rows=1001 width=13) (actual time=27.910..27.913 rows=10 loops=1)???????? ->? Sort? (cost=2061.33..2135.72 rows=29757 width=13) (actual time=27.909..27.911 rows=10 loops=1)?????????????? Sort Key: test.n_int?????????????? Sort Method: external merge? Disk: 784kB?????????????? ->? Seq Scan on test? (cost=0.00..429.57 rows=29757 width=13) (actual time=0.024..6.730 rows=29757 loops=1)?Planning time: 0.218 ms?Execution time: 28.358 ms(8 rows)--同堆排序一样，获取limit 1001 offset 0，然后取10前10条数据abase=# select * from ( select ctid,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;? ctid? | n_int | c_id --------+-------+------?(0,1)? |?? 100 | 1?(0,2)? |?? 100 | 2?(0,4)? |?? 100 | 4?(0,8)? |?? 100 | 8?(0,9)? |?? 100 | 9?(0,5)? |?? 100 | 5?(0,3)? |?? 100 | 3?(0,6)? |?? 100 | 6?(0,55) |?? 100 | 888?(0,7)? |?? 100 | 7(10 rows)--同堆排序一样，获取limit 1001 offset 1，然后取10前10条数据abase=# select * from ( select ctid,c_id from test order by n_int asc limit 1001 offset 1) td limit 10;?? ctid?? | n_int | c_id ----------+-------+------?(0,4)??? |?? 100 | 4?(0,195) |?? 100 | 888(10 rows)--同堆排序一样，获取limit 1001 offset 2，然后取10前10条数据abase=# select * from ( select ctid,195) |?? 100 | 888?(44,12)? |?? 100 | 888(10 rows)```减小work_mem使用归并排序的时候，offset从0变成1后以及变成2后，任然有序。### 还有一种情况，那就是在查询前面几页的时候会有重复，但是越往后面翻就不会重复了，现在也可以解释清楚。```sql如果每页10条数据，当offse较小的时候使用的内存较少。当offse不断增大，所耗费的内存也就越多。--设置work_mem =64kbabase=# show work_mem;?work_mem ----------?64kB(1 row)--查询limit 10 offset 10abase=# explain analyze? select * from ( select ctid,c_id from test order by n_int asc limit 10 offset 10) td limit 10;??????????????????????????????????????????????????????? QUERY PLAN??????????????????????????????????????????????????????? ?---------------------------------------------------------------------------------------------------------------------------?Limit? (cost=1221.42..1221.54 rows=10 width=13) (actual time=12.881..12.884 rows=10 loops=1)?? ->? Limit? (cost=1221.42..1221.44 rows=10 width=13) (actual time=12.879..12.881 rows=10 loops=1)???????? ->? Sort? (cost=1221.39..1295.79 rows=29757 width=13) (actual time=12.877..12.879 rows=20 loops=1)?????????????? Sort Key: test.n_int?????????????? Sort Method: top-N heapsort? Memory: 25kB?????????????? ->? Seq Scan on test? (cost=0.00..429.57 rows=29757 width=13) (actual time=0.058..6.363 rows=29757 loops=1)?Planning time: 0.230 ms?Execution time: 12.976 ms(8 rows)--查询limit 10 offset 1000abase=# explain analyze? select * from ( select ctid,c_id from test order by n_int asc limit 10 offset 1000) td limit 10;??????????????????????????????????????????????????????? QUERY PLAN??????????????????????????????????????????????????????? ?---------------------------------------------------------------------------------------------------------------------------?Limit? (cost=2065.75..2065.88 rows=10 width=13) (actual time=27.188..27.192 rows=10 loops=1)?? ->? Limit? (cost=2065.75..2065.78 rows=10 width=13) (actual time=27.186..27.188 rows=10 loops=1)???????? ->? Sort? (cost=2063.25..2137.64 rows=29757 width=13) (actual time=26.940..27.138 rows=1010 loops=1)?????????????? Sort Key: test.n_int?????????????? Sort Method: external merge? Disk: 784kB?????????????? ->? Seq Scan on test? (cost=0.00..429.57 rows=29757 width=13) (actual time=0.026..6.374 rows=29757 loops=1)?Planning time: 0.207 ms?Execution time: 27.718 ms(8 rows)可以看到当offset从10增加到1000的时候，使用的内存增加，排序的方法从堆排序变成了归并排序。而归并排序为稳定排序，所以后面的分页不会再有后一页出现前一页数据的情况。```## 结语1.关于分页重复数据的问题主要是排序字段不唯一并且执行计划走了快速排序和堆排序导致。2.当排序有重复字段，但是如果查询是归并排序，便不会存在有重复数据的问题。3.当用重复字段排序，前面的页重复，随着offset的增大导致work_mem不足以后使用归并排序，就不存在重复的数据了。4.排序和算法的稳定性有关，当执行计划选择不同的排序算法时，返回的结果不一样。5.处理重复数据的常见手段就是，排序的时候选择唯一字段来排序。[参考资料](https://stackoverflow.com/questions/13580826/postgresql-repeating-rows-from-limit-offset)[参考资料](https://www.postgresql.org/docs/8.3/queries-limit.html)

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!