配置PostgreSQL以获得读取性能
我们的系统编写了大量数据(大数据系统的种类).写入性能足以满足我们的需求,但读取性能实在太慢.
所有表的主键(约束)结构都类似: timestamp(Timestamp) ; index(smallint) ; key(integer). 一个表可以包含数百万行,甚至数十亿行,并且读取请求通常用于特定时间段(时间戳/索引)和标记.有一个返回大约200k行的查询是很常见的.目前,我们可以读取大约每秒15k行,但我们需要快10倍.这是可能的,如果是的话,怎么样? 注意:PostgreSQL与我们的软件一起打包,因此不同客户端的硬件不同. 它是用于测试的VM. VM的主机是Windows Server 2008 R2 x64,具有24.0 GB的RAM. 服务器规范(虚拟机VMWare) Server 2008 R2 x64 2.00 GB of memory Intel Xeon W3520 @ 2.67GHz (2 cores) postgresql.conf优化 shared_buffers = 512MB (default: 32MB) effective_cache_size = 1024MB (default: 128MB) checkpoint_segment = 32 (default: 3) checkpoint_completion_target = 0.9 (default: 0.5) default_statistics_target = 1000 (default: 100) work_mem = 100MB (default: 1MB) maintainance_work_mem = 256MB (default: 16MB) 表定义 CREATE TABLE "AnalogTransition" ( "KeyTag" integer NOT NULL,"Timestamp" timestamp with time zone NOT NULL,"TimestampQuality" smallint,"TimestampIndex" smallint NOT NULL,"Value" numeric,"Quality" boolean,"QualityFlags" smallint,"UpdateTimestamp" timestamp without time zone,-- (UTC) CONSTRAINT "PK_AnalogTransition" PRIMARY KEY ("Timestamp","TimestampIndex","KeyTag" ),CONSTRAINT "FK_AnalogTransition_Tag" FOREIGN KEY ("KeyTag") REFERENCES "Tag" ("Key") MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ) WITH ( OIDS=FALSE,autovacuum_enabled=true ); 询问 查询在pgAdmin3中执行大约需要30秒,但我们希望在5秒内获得相同的结果(如果可能). SELECT "AnalogTransition"."KeyTag","AnalogTransition"."Timestamp" AT TIME ZONE 'UTC',"AnalogTransition"."TimestampQuality","AnalogTransition"."TimestampIndex","AnalogTransition"."Value","AnalogTransition"."Quality","AnalogTransition"."QualityFlags","AnalogTransition"."UpdateTimestamp" FROM "AnalogTransition" WHERE "AnalogTransition"."Timestamp" >= '2013-05-16 00:00:00.000' AND "AnalogTransition"."Timestamp" <= '2013-05-17 00:00:00.00' AND ("AnalogTransition"."KeyTag" = 56 OR "AnalogTransition"."KeyTag" = 57 OR "AnalogTransition"."KeyTag" = 58 OR "AnalogTransition"."KeyTag" = 59 OR "AnalogTransition"."KeyTag" = 60) ORDER BY "AnalogTransition"."Timestamp" DESC,"AnalogTransition"."TimestampIndex" DESC LIMIT 500000; 解释1 "Limit (cost=0.00..125668.31 rows=500000 width=33) (actual time=2.193..3241.319 rows=500000 loops=1)" " Buffers: shared hit=190147" " -> Index Scan Backward using "PK_AnalogTransition" on "AnalogTransition" (cost=0.00..389244.53 rows=1548698 width=33) (actual time=2.187..1893.283 rows=500000 loops=1)" " Index Cond: (("Timestamp" >= '2013-05-16 01:00:00-04'::timestamp with time zone) AND ("Timestamp" <= '2013-05-16 15:00:00-04'::timestamp with time zone))" " Filter: (("KeyTag" = 56) OR ("KeyTag" = 57) OR ("KeyTag" = 58) OR ("KeyTag" = 59) OR ("KeyTag" = 60))" " Buffers: shared hit=190147" "Total runtime: 3863.028 ms" 解释2 在我的最新测试中,选择我的数据需要7分钟!见下文: "Limit (cost=0.00..313554.08 rows=250001 width=35) (actual time=0.040..410721.033 rows=250001 loops=1)" " -> Index Scan using "PK_AnalogTransition" on "AnalogTransition" (cost=0.00..971400.46 rows=774511 width=35) (actual time=0.037..410088.960 rows=250001 loops=1)" " Index Cond: (("Timestamp" >= '2013-05-22 20:00:00-04'::timestamp with time zone) AND ("Timestamp" <= '2013-05-24 20:00:00-04'::timestamp with time zone) AND ("KeyTag" = 16))" "Total runtime: 411044.175 ms"
数据对齐和存储大小
实际上,元组标头的每个元组的开销是24字节,项目指针的开销是4字节. > Use GIN to index bit strings this related answer on SO中数据对齐和填充的基础知识: > Calculating and saving space in PostgreSQL 我们有三列主键: PRIMARY KEY ("Timestamp","KeyTag") "Timestamp" timestamp (8 bytes) "TimestampIndex" smallint (2 bytes) "KeyTag" integer (4 bytes) 结果是: 4 bytes item pointer in the page header (not counting towards multiple of 8 bytes) --- 23 bytes for the tuple header 1 byte padding for data alignment (or NULL bitmap) 8 bytes "Timestamp" 2 bytes "TimestampIndex" 2 bytes padding for data alignment 4 bytes "KeyTag" 0 padding to the nearest multiple of 8 bytes ----- 44 bytes per tuple 有关在此相关答案中测量对象大小的更多信息 > Measure the size of a PostgreSQL table row 多列索引中的列顺序 阅读这两个问题和答案,了解: > Is a composite index also good for queries on the first field? 您拥有索引(主键)的方式,您可以在没有排序步骤的情况下检索行,这很有吸引力,尤其是使用LIMIT.但检索行似乎非常昂贵. 通常,在多列索引中,“相等”列应首先出现,“范围”列应最后: > Multicolumn index and performance 因此,请尝试使用反向列顺序的其他索引: CREATE INDEX analogransition_mult_idx1 ON "AnalogTransition" ("KeyTag","Timestamp"); 这取决于数据分布.但是,有数百万行,这可能要快得多. 由于数据对齐和放大,元组大小增加了8个字节.填充.如果您将此作为普通索引使用,则可能会尝试删除第三列“Timestamp”.可能有点快或不快(因为它可能有助于排序). 您可能希望保留两个索引.根据许多因素,您的原始索引可能更合适 – 特别是小LIMIT. autovacuum和表统计信息 您的表统计信息需要是最新的.我相信你有autovacuum跑. 由于您的表格似乎很大且统计信息对于正确的查询计划很重要,因此我会大幅增加相关列的statistics target: ALTER TABLE "AnalogTransition" ALTER "Timestamp" SET STATISTICS 1000; …甚至更高的数十亿行.最大值为10000,默认值为100. 对WHERE或ORDER BY子句中涉及的所有列执行此操作.然后运行ANALYZE. 表格布局 在此期间,如果您应用了解有关数据对齐和填充的知识,这种优化的表格布局应该节省一些磁盘空间并帮助提高性能(忽略pk& fk): CREATE TABLE "AnalogTransition"( "Timestamp" timestamp with time zone NOT NULL,"KeyTag" integer NOT NULL,-- (UTC) "QualityFlags" smallint,"Value" numeric ); CLUSTER / pg_repack 要优化使用特定索引(无论是原始索引还是我建议的替代索引)的查询的读取性能,您可以按索引的物理顺序重写表. 内存 通常,2GB的物理RAM不足以快速处理数十亿行.更多RAM可能会有很长的路要走 – 伴随着适应的设置:显然是一个更大的effective_cache_size开始. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |