加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 百科 > 正文

PostgreSQL pg_bulkload speed test

发布时间:2020-12-13 17:08:16 所属栏目:百科 来源:网络整理
导读:本文将对比一下传统的数据导入和使用pg_bulkload数据导入的速度差别. 1. 使用普通的copy模式导入unlogged table. 2.使用普通的copy模式导入logged table. 3. 使用pg_bulkload导入unlogged table. 4.使用pg_bulkload导入logged table. 测试环境如下 : 数据库

本文将对比一下传统的数据导入和使用pg_bulkload数据导入的速度差别.
1. 使用普通的copy模式导入unlogged table.
2.使用普通的copy模式导入logged table.
3. 使用pg_bulkload导入unlogged table.
4.使用pg_bulkload导入logged table.
测试环境如下 :
数据库编译参数 :

pg93@db-172163150-> pg_config |grep CONFIG
CONFIGURE = '--prefix=/home/pg93/pgsql9.3.3''--with-pgport=1922''--with-perl''--with-tcl''--with-python''--with-openssl''--with-pam''--without-ldap''--with-libxml''--with-libxslt''--enable-thread-safety''--with-wal-blocksize=64''--with-blocksize=32''--enable-dtrace''--enable-debug''--enable-cassert'

数据库配置 :
listen_addresses '0.0.0.0' # what IP address(es) to listen on;
port = 1921 # (change requires restart)
max_connections = 100 # (change requires restart)
superuser_reserved_connections = 3 # (change requires restart)
unix_socket_directories = '.' # comma-separated list of directories
unix_socket_permissions = 0700 # begin with 0 to use octal notation
tcp_keepalives_idle = 60 # TCP_KEEPIDLE,in seconds;
tcp_keepalives_interval = 10 # TCP_KEEPINTVL,0)">tcp_keepalives_count = 10 # TCP_KEEPCNT;
shared_buffers = 4096MB # min 128kB
maintenance_work_mem = 512MB # min 1MB
shared_preload_libraries = 'pg_stat_statements' # (change requires restart)
vacuum_cost_delay = 10 # 0-100 milliseconds
vacuum_cost_limit = 10000 # 1-10000 credits
bgwriter_delay = 10ms # 10-10000ms between rounds
wal_level = hot_standby # minimal,archive,or hot_standby
synchronous_commit = off # synchronization level;
wal_buffers = 16MB # min 32kB,-1 sets based on shared_buffers
wal_writer_delay = 10ms # 1-10000 milliseconds
checkpoint_segments = 512 # in logfile segments,min 1,16MB each
archive_mode = on # allows archiving to be done
archive_command = '/bin/date' # command to use to archive a logfile segment
max_wal_senders = 32 # max number of walsender processes
hot_standby = on # "on" allows queries during recovery
wal_receiver_status_interval = 1s # send replies at least this often
hot_standby_feedback = on # send info from standby to prevent
random_page_cost = 2.0 # same scale as above
effective_cache_size = 96GB
log_destination = 'csvlog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_truncate_on_rotation = on # If on,an existing log file with the
log_min_messages = log # values in order of decreasing detail:
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose # terse,default,or verbose messages
log_lock_waits = on # log lock waits >= deadlock_timeout
log_statement = 'ddl' # none,ddl,mod,all
log_timezone = 'PRC'
track_activities = on
track_counts = on
track_functions = all # none,pl,0)">track_activity_query_size = 1024 # (change requires restart)
autovacuum = off # Enable autovacuum subprocess? 'on'
log_autovacuum_min_duration = 0 # -1 disables,0 logs all actions and
autovacuum_naptime = 3s # time between autovacuum runs
autovacuum_vacuum_scale_factor = 0.0002 # fraction of table size before vacuum
autovacuum_analyze_scale_factor = 0.0001 # fraction of table size before analyze
datestyle = 'iso,mdy'
timezone = 'PRC'
lc_messages = 'C' # locale for system error message
lc_monetary = 'C' # locale for monetary formatting
lc_numeric = 'C' # locale for number formatting
lc_time = 'C' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
pg_stat_statements.max = 1000
pg_stat_statements.track = all

数据库列表以及collection :
psql
psql (9.3.)
Type"help" for help.
digoal=# l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+---------+-------+-----------------------
digoal | postgres | UTF8 | C | C |
postgres | postgres | UTF8 | C | C |
stats | postgres | UTF8 | C | C | =Tc/postgres +
| | | | | postgres=CTc/postgres+
| | | | | stats=CTc/postgres
statsrepo | postgres | UTF8 | C | C |
template0 | postgres | UTF8 | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | C | C | =c/postgres +
(6 rows)

生成5000万条测试数据.

# create table test(id int primary key,info text,crt_time timestamp);
CREATE TABLE
digoal=# insert into test select generate_series(1,50000000),md5(random()::text),clock_timestamp();
INSERT 0 50000000
digoal=# dt+ test
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+---------+-------------
public | test | table | postgres | 3634 MB |
(1 row)
digoal=# di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-----------+-------+----------+-------+---------+-------------
public | test_pkey | index | postgres | test | 1063 MB |
(1 row)

导出到csv文件

# copy test to '/ssd3/pg93/test.dmp' with (format csv,delimiter ',',null 'N',quote '"',force_quote *) ;

使用copy导入数据的测试 :

# truncate test;
TRUNCATE TABLE
digoal=# timing
使用copy导入数据
digoal=# copy test from '/ssd3/pg93/test.dmp' with (format csv,quote '"');
COPY 50000000
Time: 411245.879 ms

改为unlogged table重新测试 :
# update pg_class set relpersistence='u' where relname='test';
UPDATE 1
digoal=# update pg_class set relpersistence='u' where relname='test_pkey';
UPDATE 1
使用copy导入数据
COPY 50000000
Time: 363699.466 ms

接下来要试一下使用pg_bulkload绕过shared buffer导入数据.
首先要按照pg_bulkload.
wgethttp://pgfoundry.org/frs/download.php/3566/pg_bulkload-3.1.5.tar.gz
[root@db-172-16-3-150 ~]# export PATH=/home/pg93/pgsql9.3.3/bin:$PATH
[root@db-172-16-3-150 ~]# cd /opt/soft_bak/pg_bulkload-3.1.5
[root@db-172-16-3-150 pg_bulkload-3.1.5]# which pg_config
/home/pg93/pgsql9.3.3/bin/pg_config
[root@db-172-16-3-150 pg_bulkload-3.1.5]# make
[root@db-172-16-3-150 pg_bulkload-3.1.5]# make install


清除test表的数据,创建pg_bulkload extension.
 psql 
    
)
.
digoal=# create extension pg_bulkload;
digoal=# update pg_class set relpersistence ='p' where relname='test_pkey'; UPDATE 1 # update pg_class set relpersistence ='p' where relname='test'; 1

使用postgresql启动数据库 :
postgresql stop -m fast
waiting server to shut down..... done
server stopped
postgresql start
server starting

注意,pg_bulkload默认连接/tmp socket,如果配置了其他sock,必须改为/tmp或者添加/tmp的unix sock 监听.
pg_bulkload i /ssd3pg93test.dmp O test l log o "TYPE=CSV" "WRITER=PARALLEL"h $PGDATA p $PGPORT d $PGDATABASE
NOTICE: BULK LOAD START
ERROR query failed ERROR could not establish connection to parallel writer
DETAIL connect to server No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.1921"?
HINTRefer to the following if it is an authentication errorSpecifies the authentication method to without the need a password in pg_hbaconf (ex trust ident), specify the password to the password file of the operating system user who ran PostgreSQL serverIf cannot use these solution,0)"> specify WRITERDIRECT query was SELECT * FROM pg_bulkload$1)

修改unix socket目录,增加 /tmp. 使用pg_bulkload 提供的postgresql脚本重启数据库.
vi $PGDATApostgresqlconf
unix_socket_directories '.,/tmp'
postgresql restart m fast
.... done
server stopped
server starting

重新执行pg_bulkload.

pg_bulkload d $PGDATABASE

在执行过程中我们看到$PGDATA多了一个目录pg_bulkload,存储加载数据的状态信息,如果导入过程中发生异常,使用postgresql脚本重启数据库时将自动修复. 或者在使用pg_ctl启动数据库前先使用pg_bulkload修复.
cd $PGDATA
ll pg_bulkload/
total 4.0K
rw------- 1 pg93 pg93 512Mar2809:3616384.34315loadstatus

日志 :

[root@db150 pg93]# cat test.log
pg_bulkload 3.1.5 on 2014-03-28 13:32:31.32559+08 INPUT /ssd3/dmp PARSE_BADFILE /ssd4/pg_rootpg_bulkload/20140328133231_digoal_public_testprsLOGFILE log LIMIT INFINITE PARSE_ERRORS 0 CHECK_CONSTRAINTS NO TYPE CSV SKIP DELIMITER QUOTE """ ESCAPE NULL OUTPUT publictest MULTI_PROCESS YES VERBOSE WRITER DIRECT DUPLICATE_BADFILE dupcsv DUPLICATE_ERRORS ON_DUPLICATE_KEEP NEW TRUNCATE 0Rows skipped. 50000000 successfully loaded loaded due to parse errors loaded due to duplicate errors replaced withnew rows. Run began on 201403133231.32559+08 ended on 3513.01901808 CPU 1.55s128.55u sec elapsed 161.69 sec

使用pg_bulkload的direct 和 multi process模式(即parallel)导入数据总耗时161秒. 相比普通的copy logged table 411秒快了一倍多.
改为unlogged table,使用pg_bulkload重新测试 :
# update pg_class set relpersistence ='u' where relname='test';
digoal=# update pg_class set relpersistence ='u' where relname='test_pkey';
digoal=# truncate test;
TRUNCATE TABLE
digoal=# checkpoint;
CHECKPOINT
$ pg_bulkload -i /ssd3/pg93/test.dmp -O test -l /ssd3/pg93/test.log -o "TYPE=CSV" -o "WRITER=PARALLEL" -h $PGDATA -p $PGPORT -d $PGDATABASE
pg_bulkload 3.1.5 on 2014-03-28 13:36:15.602787+08
dmp
20140328133615log
INFINITE
0
CHECK_CONSTRAINTS NO
CSV
DELIMITER"""
test
YES
DIRECT
csv
ON_DUPLICATE_KEEP NEW
NO
.
.
3615.60278708
3857.50655808
2.26s129.23u161.90 sec

导入数据总耗时161秒. 相比普通的copy unlogged table 363秒快了一倍多.
因为已经绕过了shared buffer,所以使用pg_bulkload导入目标unlogged和logged表的结果一样.
最后附direct模式的测试结果,(不开multi process). 256秒,还是比363快.
"WRITER=DIRECT"d $PGDATABASE
pg_bulkload 3.15 on 4110.93457820140328134110499999984527.00794110.68s243.64u256.07 sec
http://blog.163.com/digoal@126/blog/static/163877040201422883734398/

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读