tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jihoon Son <jihoon...@apache.org>
Subject Re: Feedback for tajo-0.10.0
Date Mon, 16 Mar 2015 06:58:16 GMT
Azuryy, thanks for your feedbacks.
They are very interesting results.
Would you mind telling me how Tajo with Parquet is slower than Tajo with
RCFile?

Thanks,
Jihoon

On Mon, Mar 16, 2015 at 3:39 PM Hyunsik Choi <hyunsik@apache.org> wrote:

> Hi Azuryy,
>
> Thank for sharing the test results. They are very inspiring to us.
> Also, I'll make some jira about the problems that you found.
>
> Best regards,
> Hyunsik
>
> On Sun, Mar 15, 2015 at 10:58 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
> > Another fix:
> > My test result is unfair during compare Imapla-2.1.2 and Tajo-0.10.0,
> > because I used Parquet with Impala and RCFILE snappy with Tajo. I should
> > use the same file format to compare.
> >
> > because I've got a clear conclusion that Imapala works better on Parquet
> > than Tajo, so I use RCFILE as the test data.
> >
> > *Tajo*:
> > default> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as
> > bigint)),sum(cast(movie_pt as bigint)) from snappy;
> > Progress: 0%, response time: 1.598 sec
> > Progress: 0%, response time: 1.6 sec
> > Progress: 0%, response time: 2.003 sec
> > Progress: 0%, response time: 2.806 sec
> > Progress: 37%, response time: 3.808 sec
> > Progress: 100%, response time: 4.792 sec
> > ?sum_3,  ?sum_4,  ?sum_5
> > -------------------------------
> > 22557920,  19648838,  2005366694576
> > (1 rows, 4.792 sec, 32 B selected)
> >
> > *Impala*:
> >  > select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as
> > bigint)),sum(cast(movie_pt as bigint)) from snappy;
> > +-------------------------------+---------------------------
> ----+-------------------------------+
> > | sum(cast(movie_vv as bigint)) | sum(cast(movie_cv as bigint)) |
> > sum(cast(movie_pt as bigint)) |
> > +-------------------------------+---------------------------
> ----+-------------------------------+
> > | 22557920                      | 19648838                      |
> > 2005366694576                 |
> > +-------------------------------+---------------------------
> ----+-------------------------------+
> > Fetched 1 row(s) in 11.12s
> >
> >
> >
> > On Mon, Mar 16, 2015 at 1:49 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
> >
> >> There is a typo in my Email. I corrected here:
> >>
> >> for example:
> >>
> >>   <property>
> >>     <name>tajo.master.umbilical-rpc.address</name>
> >>     <value>1-1-1-1:26001</value>
> >>   </property>
> >>
> >> which does work under tajo-0.9.0, but it complain "1-1-1-1:2601" is not
> a
> >> valid network address under tajo-0.10.0.
> >>
> >> I have to change to:
> >>   <property>
> >>     <name>tajo.master.umbilical-rpc.address</name>
> >>     <value>1.1.1.1:26001</value>
> >>   </property>
> >>
> >>
> >> On Mon, Mar 16, 2015 at 1:44 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
> >>
> >>> Hi,
> >>> I compiled tajo-0.10 source based on hadoop-2.6.0, then post some
> >>> feedback here.
> >>>
> >>> My cluster:
> >>> 1 tajo-master, 9 tajo-worker
> >>> 24 CPU(logic), 64GB mem, 4TB*12 HDD
> >>>
> >>> Feedback:
> >>> 1) tajo task progress estimate is normal on partitioned table, which is
> >>> incorrect sometimes in tajo-0.9.0
> >>> 2) Tajo configuration doesn't support hostname in tajo-site.xml.
> >>> for example:
> >>>
> >>>   <property>
> >>>     <name>tajo.master.umbilical-rpc.address</name>
> >>>     <value>1-1-1-1:26001</value>
> >>>   </property>
> >>>
> >>> which does work under tajo-0.9.0, but it complain "1-1-1-1:2601" is
> not a
> >>> valid network address.
> >>>
> >>> I have to change to:
> >>>   <property>
> >>>     <name>tajo.master.umbilical-rpc.address</name>
> >>>     <value>1.1.1.1:26001</value>
> >>>   </property>
> >>>
> >>> but we don't use IP in our cluster, only hostname. so I did a little in
> >>> the code:
> >>> org.apache.tajo.validation.NetworkAddressValidator.java:
> >>> hostnamePattern = Pattern.compile("\\d*-\\d*-\\d*-\\d");
> >>> then It works.
> >>>
> >>> 3) I did some test on the parquet, RCFILE(snappy compressed),
> >>> RCFILE(GZIP compressed)
> >>>
> >>> they are the same data, only different from file format.
> >>> the table has six partitions, 20 RCFILES, each parquet file is 1GB.
> >>>
> >>> then rcfile with snappy's performance is similiar to rcfile with gzip.
> >>> but they are all two~three times better than parquet.
> >>>
> >>> 4) I compared tajo-0.10 and Impala-2.1.2,
> >>> Impala can provide very good support for parquet. more better than
> Tajo.
> >>>
> >>> but impala is more *slow *with other format than Tajo.
> >>> such as(I don't use WHERE because I want query all six partitions
> >>> together):
> >>>
> >>> *Impala*:
> >>>  > select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as
> >>> bigint)),sum(cast(movie_pt as bigint)) from par;
> >>>
> >>> +-------------------------------+---------------------------
> ----+-------------------------------+
> >>> | sum(cast(movie_vv as bigint)) | sum(cast(movie_cv as bigint)) |
> >>> sum(cast(movie_pt as bigint)) |
> >>>
> >>> +-------------------------------+---------------------------
> ----+-------------------------------+
> >>> | 22557920                      | 19648838                      |
> >>> 2005366694576           |
> >>>
> >>> +-------------------------------+---------------------------
> ----+-------------------------------+
> >>> Fetched 1 row(s) in 6.02s
> >>>
> >>> *Tajo:*
> >>>
> >>> *default*> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as
> >>> bigint)),sum(cast(movie_pt as bigint)) from snappy;
> >>> Progress: 0%, response time: 1.598 sec
> >>> Progress: 0%, response time: 1.6 sec
> >>> Progress: 0%, response time: 2.003 sec
> >>> Progress: 0%, response time: 2.806 sec
> >>> Progress: 37%, response time: 3.808 sec
> >>> Progress: 100%, response time: 4.792 sec
> >>> ?sum_3,  ?sum_4,  ?sum_5
> >>> -------------------------------
> >>> 22557920,  19648838,  2005366694576
> >>> (1 rows, 4.792 sec, 32 B selected)
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message