asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yingyi Bu <buyin...@gmail.com>
Subject Re: Time of Multiple Joins in AsterixDB
Date Wed, 21 Dec 2016 01:58:17 GMT
Mingda,

     1. Can you paste the returned JSON of http://<master
node>:19002/admin/cluster at your side? (Pls replace <master node> with the
actual master node name or IP)
     2. Can you list the individual size of each dataset involved in the
query, e.g., catalog_returns, catalog_sales, and inventory?  (I assume
100GB is the overall size?)
     3. Do Spark/Hive/Pig saturate all CPUs on all machines, i.e., how many
partitions are running on each machine?  (It seems that your AsterixDB
configuration wouldn't saturate all CPUs for queries --- in the current
AsterixDB master, the computation parallelism is set to be the same as the
storage parallelism (i.e., the number of iodevices on each NC). I've
submitted a new patch that allow flexible computation parallelism, which
should be able to get merged into master very soon.)
     Thanks!

Best,
Yingyi

On Tue, Dec 20, 2016 at 5:44 PM, mingda li <limingda1993@gmail.com> wrote:

> Oh, sure. When we test the 100G multiple join, we find AsterixDB is slower
> than Spark (but still faster than Pig and Hive).
> I can share with you the both plots: 1-10G.eps and 1-100G.eps. (We will
> only use 1-10G.eps in our paper).
> And thanks for Ian's advice:* The dev list generally strips attachments.
> Maybe you can just put the config inline? Or link to a pastebin/gist?*
> I know why you can't see the attachments. So I move the plots with two
> documents to my Dropbox.
> You can find the
> 1-10G.eps here: https://www.dropbox.com/s/rk3xg6gigsfcuyq/1-10G.eps?dl=0
> 1-100G.eps here:https://www.dropbox.com/s/tyxnmt6ehau2ski/1-100G.eps?dl=0
> cc_conf.pdf here: https://www.dropbox.com/s/y3of1s17qdstv5f/cc_conf.pdf?
> dl=0
> CompleteQuery.pdf here:
> https://www.dropbox.com/s/lml3fzxfjcmf2c1/CompleteQuery.pdf?dl=0
>
> On Tue, Dec 20, 2016 at 4:40 PM, Tyson Condie <tcondie.ucla@gmail.com>
> wrote:
>
> > Mingda: Please also share the numbers for 100GB, which show AsterixDB not
> > quite doing as well as Spark. These 100GB results will not be in our
> > submission version, since they’re not needed for the desired message:
> > picking the right join order matters. Nevertheless, I’d like to get a
> > better understanding of what’s going on in the larger dataset regime.
> >
> >
> >
> > -Tyson
> >
> >
> >
> > From: Yingyi Bu [mailto:buyingyi@gmail.com]
> > Sent: Tuesday, December 20, 2016 4:30 PM
> > To: dev@asterixdb.apache.org
> > Cc: Michael Carey <mjcarey@ics.uci.edu>; Tyson Condie <
> > tcondie.ucla@gmail.com>
> > Subject: Re: Time of Multiple Joins in AsterixDB
> >
> >
> >
> > Hi Mingda,
> >
> >
> >
> >      It looks that you didn't attach the pdf?
> >
> >      Thanks!
> >
> >
> >
> > Best,
> >
> > Yingyi
> >
> >
> >
> > On Tue, Dec 20, 2016 at 4:15 PM, mingda li <limingda1993@gmail.com
> > <mailto:limingda1993@gmail.com> > wrote:
> >
> > Sorry for the wrong version of cc.conf. I convert it to pdf version as
> > attachment.
> >
> >
> >
> > On Tue, Dec 20, 2016 at 4:06 PM, mingda li <limingda1993@gmail.com
> > <mailto:limingda1993@gmail.com> > wrote:
> >
> > Dear all,
> >
> >
> >
> > I am testing different systems' (AsterixDB, Spark, Hive, Pig) multiple
> > joins to see if there is a big difference with different join order. This
> > is the reason for our research on multiple join and the result will
> apppear
> > in our paper which is to be submitted to VLDB soon. Could you help us to
> > make sure that the test results make sense for AsterixDB?
> >
> >
> >
> > We configure the AsterixDB 0.8.9 ( use asterix-server-0.8.9-SNAPSHOT-
> binary-assembly)
> > in our cluster of 16 machines, each with a 3.40GHz i7 processor (4 cores
> > and 2 hyper-threads per core), 32GB of RAM and 1TB of disk capacity. The
> > operating system is 64-bit Ubuntu 12.04. JDK version 1.8.0. During
> > configuration, I follow the NCService instruction here
> > https://ci.apache.org/projects/asterixdb/ncservice.html. And I set the
> > cc.conf as in attachment. (Each node work as nc and the first node also
> > work as cc).
> >
> >
> >
> > For experiment, we use 3 fact tables from TPC-DS: inventory;
> > catalog_sales; catalog_returns with TPC-DS scale factor 1g and 10g. The
> > multiple join query we use in AsterixDB are as following:
> >
> >
> >
> > Good Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1
> > JOIN catalog_returns cr1
> >
> >  ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk =
> > cr1.cr_item_sk))  m1 JOIN inventory i1 ON i1.inv_item_sk =
> cs1.cs_item_sk;
> >
> >
> >
> > Bad Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1
> JOIN
> > inventory i1 ON cs1.cs_item_sk = i1.inv_item_sk) m1 JOIN catalog_returns
> > cr1 ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk =
> > cr1.cr_item_sk);
> >
> >
> >
> > We load the data to AsterixDB firstly and run the two different queries.
> > (The complete version of all queries for AsterixDB is in attachment)  We
> > assume the data has already been stored in AsterixDB and only count the
> > time for multiple join.
> >
> >
> >
> > Meanwhile, we use the same dataset and query to test Spark, Pig and Hive.
> > The result is shown in the attachment's figure. And you can find
> > AsterixDB's time is always better than others  no matter good or bad
> > order:-) (BTW, the y scale of figure is time in log scale. You can see
> the
> > time by the label of each bar.)
> >
> >
> >
> > Thanks for your help.
> >
> >
> >
> > Bests,
> >
> > Mingda
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message