PostgreSQL pg_bulkload speed test

发布时间：2020-12-13 17:08:16 所属栏目：百科来源：网络整理

导读：本文将对比一下传统的数据导入和使用pg_bulkload数据导入的速度差别. 1. 使用普通的copy模式导入unlogged table. 2.使用普通的copy模式导入logged table. 3. 使用pg_bulkload导入unlogged table. 4.使用pg_bulkload导入logged table. 测试环境如下 : 数据库

本文将对比一下传统的数据导入和使用pg_bulkload数据导入的速度差别.

1. 使用普通的copy模式导入unlogged table.

2.使用普通的copy模式导入logged table.

3. 使用pg_bulkload导入unlogged table.

4.使用pg_bulkload导入logged table.

测试环境如下 :

数据库编译参数 :

pg93@db-172163150-> pg_config |grep CONFIG

CONFIGURE = '--prefix=/home/pg93/pgsql9.3.3''--with-pgport=1922''--with-perl''--with-tcl''--with-python''--with-openssl''--with-pam''--without-ldap''--with-libxml''--with-libxslt''--enable-thread-safety''--with-wal-blocksize=64''--with-blocksize=32''--enable-dtrace''--enable-debug''--enable-cassert'

数据库配置 :

listen_addresses '0.0.0.0' # what IP address(es) to listen on;

port = 1921 # (change requires restart)

max_connections = 100 # (change requires restart)

superuser_reserved_connections = 3 # (change requires restart)

unix_socket_directories = '.' # comma-separated list of directories

unix_socket_permissions = 0700 # begin with 0 to use octal notation

tcp_keepalives_idle = 60 # TCP_KEEPIDLE,in seconds;

tcp_keepalives_interval = 10 # TCP_KEEPINTVL,0)">tcp_keepalives_count = 10 # TCP_KEEPCNT;

shared_buffers = 4096MB # min 128kB

maintenance_work_mem = 512MB # min 1MB

shared_preload_libraries = 'pg_stat_statements' # (change requires restart)

vacuum_cost_delay = 10 # 0-100 milliseconds

vacuum_cost_limit = 10000 # 1-10000 credits

bgwriter_delay = 10ms # 10-10000ms between rounds

wal_level = hot_standby # minimal,archive,or hot_standby

synchronous_commit = off # synchronization level;

wal_buffers = 16MB # min 32kB,-1 sets based on shared_buffers

wal_writer_delay = 10ms # 1-10000 milliseconds

checkpoint_segments = 512 # in logfile segments,min 1,16MB each

archive_mode = on # allows archiving to be done

archive_command = '/bin/date' # command to use to archive a logfile segment

max_wal_senders = 32 # max number of walsender processes

hot_standby = on # "on" allows queries during recovery

wal_receiver_status_interval = 1s # send replies at least this often

hot_standby_feedback = on # send info from standby to prevent

random_page_cost = 2.0 # same scale as above

effective_cache_size = 96GB

log_destination = 'csvlog' # Valid values are combinations of

logging_collector = on # Enable capturing of stderr and csvlog

log_directory = 'pg_log' # directory where log files are written,

log_truncate_on_rotation = on # If on,an existing log file with the

log_min_messages = log # values in order of decreasing detail:

log_checkpoints = on

log_connections = on

log_disconnections = on

log_error_verbosity = verbose # terse,default,or verbose messages

log_lock_waits = on # log lock waits >= deadlock_timeout

log_statement = 'ddl' # none,ddl,mod,all

log_timezone = 'PRC'

track_activities = on

track_counts = on

track_functions = all # none,pl,0)">track_activity_query_size = 1024 # (change requires restart)

autovacuum = off # Enable autovacuum subprocess? 'on'

log_autovacuum_min_duration = 0 # -1 disables,0 logs all actions and

autovacuum_naptime = 3s # time between autovacuum runs

autovacuum_vacuum_scale_factor = 0.0002 # fraction of table size before vacuum

autovacuum_analyze_scale_factor = 0.0001 # fraction of table size before analyze

datestyle = 'iso,mdy'

timezone = 'PRC'

lc_messages = 'C' # locale for system error message

lc_monetary = 'C' # locale for monetary formatting

lc_numeric = 'C' # locale for number formatting

lc_time = 'C' # locale for time formatting

default_text_search_config = 'pg_catalog.english'

pg_stat_statements.max = 1000

pg_stat_statements.track = all

数据库列表以及collection :

psql

psql (9.3.)

Type"help" for help.

digoal=# l

List of databases

-----------+----------+----------+---------+-------+-----------------------

digoal | postgres | UTF8 | C | C |

postgres | postgres | UTF8 | C | C |

stats | postgres | UTF8 | C | C | =Tc/postgres +

statsrepo | postgres | UTF8 | C | C |

template0 | postgres | UTF8 | C | C | =c/postgres +

template1 | postgres | UTF8 | C | C | =c/postgres +

(6 rows)

生成5000万条测试数据.

# create table test(id int primary key,info text,crt_time timestamp);

CREATE TABLE

digoal=# insert into test select generate_series(1,50000000),md5(random()::text),clock_timestamp();

INSERT 0 50000000

digoal=# dt+ test

List of relations

--------+------+-------+----------+---------+-------------

(1 row)

digoal=# di+

List of relations

--------+-----------+-------+----------+-------+---------+-------------

(1 row)

导出到csv文件

# copy test to '/ssd3/pg93/test.dmp' with (format csv,delimiter ',',null 'N',quote '"',force_quote *) ;

使用copy导入数据的测试 :

# truncate test;

TRUNCATE TABLE

digoal=# timing

使用copy导入数据

digoal=# copy test from '/ssd3/pg93/test.dmp' with (format csv,quote '"');

COPY 50000000

Time: 411245.879 ms

改为unlogged table重新测试 :

# update pg_class set relpersistence='u' where relname='test';

UPDATE 1

digoal=# update pg_class set relpersistence='u' where relname='test_pkey';

UPDATE 1

使用copy导入数据

COPY 50000000

Time: 363699.466 ms

接下来要试一下使用pg_bulkload绕过shared buffer导入数据.

首先要按照pg_bulkload.

wgethttp://pgfoundry.org/frs/download.php/3566/pg_bulkload-3.1.5.tar.gz

[root@db-172-16-3-150 ~]# export PATH=/home/pg93/pgsql9.3.3/bin:$PATH

[root@db-172-16-3-150 ~]# cd /opt/soft_bak/pg_bulkload-3.1.5

[root@db-172-16-3-150 pg_bulkload-3.1.5]# which pg_config

/home/pg93/pgsql9.3.3/bin/pg_config

[root@db-172-16-3-150 pg_bulkload-3.1.5]# make

[root@db-172-16-3-150 pg_bulkload-3.1.5]# make install

清除test表的数据,创建pg_bulkload extension.

 psql

)

digoal=# create extension pg_bulkload;

digoal=# update pg_class set relpersistence ='p' where relname='test_pkey'; UPDATE 1 # update pg_class set relpersistence ='p' where relname='test'; 1

使用postgresql启动数据库 :

postgresql stop -m fast

waiting server to shut down..... done

server stopped

postgresql start

server starting

注意,pg_bulkload默认连接/tmp socket,如果配置了其他sock,必须改为/tmp或者添加/tmp的unix sock 监听.

pg_bulkload i /ssd3pg93test.dmp O test l log o "TYPE=CSV" "WRITER=PARALLEL"h $PGDATA p $PGPORT d $PGDATABASE

NOTICE: BULK LOAD START

ERROR query failed ERROR could not establish connection to parallel writer

DETAIL connect to server No such file or directory

Is the server running locally and accepting

connections on Unix domain socket "/tmp/.s.PGSQL.1921"?

HINTRefer to the following if it is an authentication errorSpecifies the authentication method to without the need a password in pg_hbaconf (ex trust ident), specify the password to the password file of the operating system user who ran PostgreSQL serverIf cannot use these solution,0)"> specify WRITERDIRECT query was SELECT * FROM pg_bulkload$1)

修改unix socket目录,增加 /tmp. 使用pg_bulkload 提供的postgresql脚本重启数据库.

vi $PGDATApostgresqlconf

unix_socket_directories '.,/tmp'

postgresql restart m fast

.... done

server stopped

server starting

重新执行pg_bulkload.

pg_bulkload d $PGDATABASE

在执行过程中我们看到$PGDATA多了一个目录pg_bulkload,存储加载数据的状态信息,如果导入过程中发生异常,使用postgresql脚本重启数据库时将自动修复. 或者在使用pg_ctl启动数据库前先使用pg_bulkload修复.

cd $PGDATA

ll pg_bulkload/

total 4.0K

rw------- 1 pg93 pg93 512Mar2809:3616384.34315loadstatus

日志 :

[root@db150 pg93]# cat test.log

pg_bulkload 3.1.5 on 2014-03-28 13:32:31.32559+08 INPUT /ssd3/dmp PARSE_BADFILE /ssd4/pg_rootpg_bulkload/20140328133231_digoal_public_testprsLOGFILE log LIMIT INFINITE PARSE_ERRORS 0 CHECK_CONSTRAINTS NO TYPE CSV SKIP DELIMITER QUOTE """ ESCAPE NULL OUTPUT publictest MULTI_PROCESS YES VERBOSE WRITER DIRECT DUPLICATE_BADFILE dupcsv DUPLICATE_ERRORS ON_DUPLICATE_KEEP NEW TRUNCATE 0Rows skipped. 50000000 successfully loaded loaded due to parse errors loaded due to duplicate errors replaced withnew rows. Run began on 201403133231.32559+08 ended on 3513.01901808 CPU 1.55s128.55u sec elapsed 161.69 sec

使用pg_bulkload的direct 和 multi process模式(即parallel)导入数据总耗时161秒. 相比普通的copy logged table 411秒快了一倍多.

改为unlogged table,使用pg_bulkload重新测试 :

# update pg_class set relpersistence ='u' where relname='test';

digoal=# update pg_class set relpersistence ='u' where relname='test_pkey';

digoal=# truncate test;

TRUNCATE TABLE

digoal=# checkpoint;

CHECKPOINT

$ pg_bulkload -i /ssd3/pg93/test.dmp -O test -l /ssd3/pg93/test.log -o "TYPE=CSV" -o "WRITER=PARALLEL" -h $PGDATA -p $PGPORT -d $PGDATABASE

pg_bulkload 3.1.5 on 2014-03-28 13:36:15.602787+08

dmp

20140328133615log

INFINITE

CHECK_CONSTRAINTS NO

CSV

DELIMITER"""

test

YES

DIRECT

csv

ON_DUPLICATE_KEEP NEW

3615.60278708

3857.50655808

2.26s129.23u161.90 sec

导入数据总耗时161秒. 相比普通的copy unlogged table 363秒快了一倍多.

因为已经绕过了shared buffer,所以使用pg_bulkload导入目标unlogged和logged表的结果一样.

最后附direct模式的测试结果,(不开multi process). 256秒,还是比363快.

"WRITER=DIRECT"d $PGDATABASE

pg_bulkload 3.15 on 4110.93457820140328134110499999984527.00794110.68s243.64u256.07 sec

http://blog.163.com/digoal@126/blog/static/163877040201422883734398/

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!