asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yingyi Bu <>
Subject Re: Time of Multiple Joins in AsterixDB
Date Wed, 21 Dec 2016 00:30:05 GMT
Hi Mingda,

     It looks that you didn't attach the pdf?


On Tue, Dec 20, 2016 at 4:15 PM, mingda li <> wrote:

> Sorry for the wrong version of cc.conf. I convert it to pdf version as
> attachment.
> On Tue, Dec 20, 2016 at 4:06 PM, mingda li <> wrote:
>> Dear all,
>> I am testing different systems' (AsterixDB, Spark, Hive, Pig) multiple
>> joins to see if there is a big difference with different join order. This
>> is the reason for our research on multiple join and the result will apppear
>> in our paper which is to be submitted to VLDB soon. Could you help us to
>> make sure that the test results make sense for AsterixDB?
>> We configure the AsterixDB 0.8.9 ( use asterix-server-0.8.9-SNAPSHOT-binary-assembly)
>> in our cluster of 16 machines, each with a 3.40GHz i7 processor (4 cores
>> and 2 hyper-threads per core), 32GB of RAM and 1TB of disk capacity. The
>> operating system is 64-bit Ubuntu 12.04. JDK version 1.8.0. During
>> configuration, I follow the NCService instruction here
>> And I set the
>> cc.conf as in attachment. (Each node work as nc and the first node also
>> work as cc).
>> For experiment, we use 3 fact tables from TPC-DS: inventory;
>> catalog_sales; catalog_returns with TPC-DS scale factor 1g and 10g. The
>> multiple join query we use in AsterixDB are as following:
>> Good Join Order: *SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1
>> JOIN catalog_returns cr1*
>> * ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk =
>> cr1.cr_item_sk))  m1 JOIN inventory i1 ON i1.inv_item_sk = cs1.cs_item_sk;*
>> Bad Join Order: *SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1
>> JOIN inventory i1 ON cs1.cs_item_sk = i1.inv_item_sk) m1 JOIN
>> catalog_returns cr1 ON (cs1.cs_order_number = cr1.cr_order_number AND
>> cs1.cs_item_sk = cr1.cr_item_sk);*
>> We load the data to AsterixDB firstly and run the two different queries.
>> (The complete version of all queries for AsterixDB is in attachment)  We
>> assume the data has already been stored in AsterixDB and only count the
>> time for multiple join.
>> Meanwhile, we use the same dataset and query to test Spark, Pig and Hive.
>> The result is shown in the attachment's figure. And you can find
>> AsterixDB's time is always better than others  no matter good or bad
>> order:-) (BTW, the y scale of figure is time in log scale. You can see the
>> time by the label of each bar.)
>> Thanks for your help.
>> Bests,
>> Mingda

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message