asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyson Condie" <>
Subject RE: Time of Multiple Joins in AsterixDB
Date Wed, 21 Dec 2016 00:21:02 GMT
Mingda: can you check again please. I’m not seeing the plots.




From: mingda li [] 
Sent: Tuesday, December 20, 2016 4:16 PM
Cc: Michael Carey <>; Tyson Condie <>
Subject: Re: Time of Multiple Joins in AsterixDB


Sorry for the wrong version of cc.conf. I convert it to pdf version as attachment.


On Tue, Dec 20, 2016 at 4:06 PM, mingda li < <>
> wrote:

Dear all,


I am testing different systems' (AsterixDB, Spark, Hive, Pig) multiple joins to see if there
is a big difference with different join order. This is the reason for our research on multiple
join and the result will apppear in our paper which is to be submitted to VLDB soon. Could
you help us to make sure that the test results make sense for AsterixDB?


We configure the AsterixDB 0.8.9 ( use asterix-server-0.8.9-SNAPSHOT-binary-assembly) in our
cluster of 16 machines, each with a 3.40GHz i7 processor (4 cores and 2 hyper-threads per
core), 32GB of RAM and 1TB of disk capacity. The operating system is 64-bit Ubuntu 12.04.
JDK version 1.8.0. During configuration, I follow the NCService instruction here
And I set the cc.conf as in attachment. (Each node work as nc and the first node also work
as cc). 


For experiment, we use 3 fact tables from TPC-DS: inventory; catalog_sales; catalog_returns
with TPC-DS scale factor 1g and 10g. The multiple join query we use in AsterixDB are as following:


Good Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 JOIN catalog_returns

 ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk = cr1.cr_item_sk))  m1 JOIN
inventory i1 ON i1.inv_item_sk = cs1.cs_item_sk;


Bad Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 JOIN inventory i1 ON
cs1.cs_item_sk = i1.inv_item_sk) m1 JOIN catalog_returns cr1 ON (cs1.cs_order_number = cr1.cr_order_number
AND cs1.cs_item_sk = cr1.cr_item_sk);


We load the data to AsterixDB firstly and run the two different queries. (The complete version
of all queries for AsterixDB is in attachment)  We assume the data has already been stored
in AsterixDB and only count the time for multiple join. 


Meanwhile, we use the same dataset and query to test Spark, Pig and Hive. The result is shown
in the attachment's figure. And you can find AsterixDB's time is always better than others
 no matter good or bad order:-) (BTW, the y scale of figure is time in log scale. You can
see the time by the label of each bar.)


Thanks for your help.







  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message