PostgreSQL TPC-H测试
When benchmarking PostgreSQL database,pgbenchis probably the first choice. But the default pgbench transactions are rather OLTP-like (it's a TCP-B-like stress test) so it's not very usable when you need to test a DSS/DWH-like workload. I needed to do that recently,and I wasn't very happy with the available tools. Luckily,the TPC Council providesDBGEN,a tool that makes it quite easy to generate data and scripts - and it's not very difficult to make that work with PostgreSQL. So let's see how to make it work ..
Tweet
I've found several tools that claim to implement TPC-H like benchmark,but I've found them unusable for various reasons. For exampleDBT-3is a bit old (last update in 2005) and the dbgen command keeps failing for strange reasons. The tool from TPC Council works quite well,and although it does not support PostgreSQL out of the box,it's not very difficult to make it work. You can get it right at theTPC-H page(see the right column,below the TPC-H specification). Download the tgz package (37MB). compilingAfter extracting it,you have to prepare a Makefile - enter the dbgen directory,copy the makefile.suite and modify those four variables (about line 109): CC=gcc DATABASE=ORACLE MACHINE=LINUX WORKLOAD=TPCH If you're not running Linux,set the MACHINE accordingly. Then just build the tool using make. generating dataSo now we have a "dbgen" tool that generates data in a CSV format. Just like pgbench it has a scaling factor that influences the amount of data generater - the default value (1) means about 1GB of raw data,i.e. about 2GB of data after populating the database. So let's create 10GB of raw data: $ ./dbgen -s 10 That gives us eight .tbl files with a CSV format,each containing data for one table. The problem is each row contains an extra "|" separator at the end of the line,so PostgreSQL fails to load that. But it's quite easily fixable with sed - just run this: for i in `ls *.tbl`; do sed 's/|$//' $i > ${i/tbl/csv}; echo $i; done; Now we have eight CSV files that may be loaded into the database. But we have to create it first. populating the databaseAlthough theTPC-H specificationdescribes the database structure,the create scripts are not part of the package. I've prepared acreate scriptthat creates all the tables and analter scriptthat creates the foreign keys (after the database is populated). So create a database and create the tables $ createdb tpch $ psql tpch < tpch-create.sql Now it's time to populate the database with generated data. Don't forget theCOPYrequires absolute paths,so you may use a script like this dir=`pwd` opts="-h localhost pgbench" psql $opts -c "COPY part FROM '$dir/part.csv' WITH (FORMAT csv,DELIMITER '|')" psql $opts -c "COPY region FROM '$dir/region.csv' WITH (FORMAT csv,DELIMITER '|')" psql $opts -c "COPY nation FROM '$dir/nation.csv' WITH (FORMAT csv,DELIMITER '|')" psql $opts -c "COPY supplier FROM '$dir/supplier.csv' WITH (FORMAT csv,DELIMITER '|')" psql $opts -c "COPY customer FROM '$dir/customer.csv' WITH (FORMAT csv,DELIMITER '|')" psql $opts -c "COPY partsupp FROM '$dir/partsupp.csv' WITH (FORMAT csv,DELIMITER '|')" psql $opts -c "COPY orders FROM '$dir/orders.csv' WITH (FORMAT csv,DELIMITER '|')" psql $opts -c "COPY lineitem FROM '$dir/lineitem.csv' WITH (FORMAT csv,DELIMITER '|')" and finally create the foreign keys $ psql tpch < tpch-alter.sql So now we have a database populated with data,so let's prepare some queries. generating queriesTPC-H describes 22 queries (or rather templates),stored in the "queries" directory. Then there is a "qgen" tool that generates queries from the templates (mostly fill them with random data). We have to fix some minor problems before running it. First,most of the queries uses a ROWCOUNT (or other db-specific variant) instead of LIMIT. This is quite easy to fix. Second,there are about five queries that use correlated aggregate subqueries with huge outer tables - that does not work very well in PostgreSQL,but it's easy to rewrite to a JOIN. So intead of query like this select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem,part where p_partkey = l_partkey and p_brand = ':1' and p_container = ':2' and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey ) LIMIT 1; you get a query like this that produces the same result and is much faster (actually the first query never finished for me). I'm not quite sure how strict the TPC-H is about the query format,but my goal was not to to a fully compliant TPC-H benchmark,it's rather a good starting point to do a DSS benchmark. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |