asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mingda li <limingda1...@gmail.com>
Subject Time of Multiple Joins in AsterixDB
Date Wed, 21 Dec 2016 00:06:38 GMT
Dear all,

I am testing different systems' (AsterixDB, Spark, Hive, Pig) multiple
joins to see if there is a big difference with different join order. This
is the reason for our research on multiple join and the result will apppear
in our paper which is to be submitted to VLDB soon. Could you help us to
make sure that the test results make sense for AsterixDB?

We configure the AsterixDB 0.8.9 ( use
asterix-server-0.8.9-SNAPSHOT-binary-assembly)
in our cluster of 16 machines, each with a 3.40GHz i7 processor (4 cores
and 2 hyper-threads per core), 32GB of RAM and 1TB of disk capacity. The
operating system is 64-bit Ubuntu 12.04. JDK version 1.8.0. During
configuration, I follow the NCService instruction here
https://ci.apache.org/projects/asterixdb/ncservice.html. And I set the
cc.conf as in attachment. (Each node work as nc and the first node also
work as cc).

For experiment, we use 3 fact tables from TPC-DS: inventory; catalog_sales;
catalog_returns with TPC-DS scale factor 1g and 10g. The multiple join
query we use in AsterixDB are as following:

Good Join Order: *SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1
JOIN catalog_returns cr1*
* ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk =
cr1.cr_item_sk))  m1 JOIN inventory i1 ON i1.inv_item_sk = cs1.cs_item_sk;*

Bad Join Order: *SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 JOIN
inventory i1 ON cs1.cs_item_sk = i1.inv_item_sk) m1 JOIN catalog_returns
cr1 ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk =
cr1.cr_item_sk);*

We load the data to AsterixDB firstly and run the two different queries.
(The complete version of all queries for AsterixDB is in attachment)  We
assume the data has already been stored in AsterixDB and only count the
time for multiple join.

Meanwhile, we use the same dataset and query to test Spark, Pig and Hive.
The result is shown in the attachment's figure. And you can find
AsterixDB's time is always better than others  no matter good or bad
order:-) (BTW, the y scale of figure is time in log scale. You can see the
time by the label of each bar.)

Thanks for your help.

Bests,
Mingda

Mime
View raw message