asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyson Condie" <tcondie.u...@gmail.com>
Subject RE: Time of Multiple Joins in AsterixDB
Date Wed, 21 Dec 2016 00:40:53 GMT
Mingda: Please also share the numbers for 100GB, which show AsterixDB not quite doing as well
as Spark. These 100GB results will not be in our submission version, since they’re not needed
for the desired message: picking the right join order matters. Nevertheless, I’d like to
get a better understanding of what’s going on in the larger dataset regime. 

 

-Tyson

 

From: Yingyi Bu [mailto:buyingyi@gmail.com] 
Sent: Tuesday, December 20, 2016 4:30 PM
To: dev@asterixdb.apache.org
Cc: Michael Carey <mjcarey@ics.uci.edu>; Tyson Condie <tcondie.ucla@gmail.com>
Subject: Re: Time of Multiple Joins in AsterixDB

 

Hi Mingda,

 

     It looks that you didn't attach the pdf?

     Thanks!

 

Best,

Yingyi

 

On Tue, Dec 20, 2016 at 4:15 PM, mingda li <limingda1993@gmail.com <mailto:limingda1993@gmail.com>
> wrote:

Sorry for the wrong version of cc.conf. I convert it to pdf version as attachment.

 

On Tue, Dec 20, 2016 at 4:06 PM, mingda li <limingda1993@gmail.com <mailto:limingda1993@gmail.com>
> wrote:

Dear all,

 

I am testing different systems' (AsterixDB, Spark, Hive, Pig) multiple joins to see if there
is a big difference with different join order. This is the reason for our research on multiple
join and the result will apppear in our paper which is to be submitted to VLDB soon. Could
you help us to make sure that the test results make sense for AsterixDB?

 

We configure the AsterixDB 0.8.9 ( use asterix-server-0.8.9-SNAPSHOT-binary-assembly) in our
cluster of 16 machines, each with a 3.40GHz i7 processor (4 cores and 2 hyper-threads per
core), 32GB of RAM and 1TB of disk capacity. The operating system is 64-bit Ubuntu 12.04.
JDK version 1.8.0. During configuration, I follow the NCService instruction here https://ci.apache.org/projects/asterixdb/ncservice.html.
And I set the cc.conf as in attachment. (Each node work as nc and the first node also work
as cc). 

 

For experiment, we use 3 fact tables from TPC-DS: inventory; catalog_sales; catalog_returns
with TPC-DS scale factor 1g and 10g. The multiple join query we use in AsterixDB are as following:

 

Good Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 JOIN catalog_returns
cr1

 ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk = cr1.cr_item_sk))  m1 JOIN
inventory i1 ON i1.inv_item_sk = cs1.cs_item_sk;

 

Bad Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 JOIN inventory i1 ON
cs1.cs_item_sk = i1.inv_item_sk) m1 JOIN catalog_returns cr1 ON (cs1.cs_order_number = cr1.cr_order_number
AND cs1.cs_item_sk = cr1.cr_item_sk);

 

We load the data to AsterixDB firstly and run the two different queries. (The complete version
of all queries for AsterixDB is in attachment)  We assume the data has already been stored
in AsterixDB and only count the time for multiple join. 

 

Meanwhile, we use the same dataset and query to test Spark, Pig and Hive. The result is shown
in the attachment's figure. And you can find AsterixDB's time is always better than others
 no matter good or bad order:-) (BTW, the y scale of figure is time in log scale. You can
see the time by the label of each bar.)

 

Thanks for your help.

 

Bests,

Mingda

 

 

 

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message