tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Azuryy Yu <azury...@gmail.com>
Subject Re: Feedback for tajo-0.10.0
Date Mon, 16 Mar 2015 08:02:54 GMT
Hi Jihoon,

Impala works on Parquet is more faster than other file formats. and Impala
advice don't make more small parquet files. 1GB would be better.




On Mon, Mar 16, 2015 at 3:57 PM, Jihoon Son <jihoonson@apache.org> wrote:

> Thanks!
> It is really interesting.
> I suspect that the large file size of Parquet makes Tajo slower. This is
> because Parquet is non-splittable, which means that only 4 workers read
> data from HDFS. In addition, if the HDFS block size is smaller than 1GB, a
> lot of data can be moved over network during the scan phase.
>
> But, I have no idea why Impala shows good performance.
> Maybe, its cache scheme improved it.
>
> Best regards,
> Jihoon
>
> On Mon, Mar 16, 2015 at 4:16 PM Azuryy Yu <azuryyyu@gmail.com> wrote:
>
> > PS. my Parquet data was generated by Impala: "Insert into a parquet table
> > [SHUFFLE] ... AS select .... from a text table"
> >
> > On Mon, Mar 16, 2015 at 3:11 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
> >
> > > Hi Jihoon,
> > >
> > > Here is an example:
> > > My data: (Parquet file is 1GB limited)
> > >  hadoop fs -ls /data/basetable/par/dt=20150301/pf=pc
> > >
> > > -rw-r--r--   9 hadoop tajo 1062932057 2015-03-12 15:08
> > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-
> > 3ead7e35ecf0da8_448517166_data.0.parq
> > > -rw-r--r--   9 hadoop tajo 1063205684 2015-03-12 15:11
> > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-
> > 3ead7e35ecf0da8_448517166_data.1.parq
> > > -rw-r--r--   9 hadoop tajo 1063236005 2015-03-12 15:14
> > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-
> > 3ead7e35ecf0da8_448517166_data.2.parq
> > > -rw-r--r--   9 hadoop tajo  543786632 2015-03-12 15:16
> > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-
> > 3ead7e35ecf0da8_448517166_data.3.parq
> > >
> > > hadoop fs -ls /data/basetable/snappy/dt=20150301/pf=pc
> > >
> > > -rw-r--r--   9 tajo tajo  144059045 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00000
> > > -rw-r--r--   9 tajo tajo  144178118 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00001
> > > -rw-r--r--   9 tajo tajo  143642438 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00002
> > > -rw-r--r--   9 tajo tajo  143553142 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00003
> > > -rw-r--r--   9 tajo tajo  143849627 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00004
> > > -rw-r--r--   9 tajo tajo  144648456 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00005
> > > -rw-r--r--   9 tajo tajo  144647502 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00006
> > > -rw-r--r--   9 tajo tajo  144551053 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00007
> > > -rw-r--r--   9 tajo tajo  144017287 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00008
> > > -rw-r--r--   9 tajo tajo  144205111 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00009
> > > -rw-r--r--   9 tajo tajo  145066506 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00010
> > > -rw-r--r--   9 tajo tajo  144740791 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00011
> > > -rw-r--r--   9 tajo tajo  144198266 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00012
> > > -rw-r--r--   9 tajo tajo  143575440 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00013
> > > -rw-r--r--   9 tajo tajo  143922343 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00014
> > > -rw-r--r--   9 tajo tajo  143930019 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00015
> > > -rw-r--r--   9 tajo tajo  144253019 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00016
> > > -rw-r--r--   9 tajo tajo  144175506 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00017
> > > -rw-r--r--   9 tajo tajo  143072995 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00018
> > > -rw-r--r--   9 tajo tajo  143818118 2015-03-16 11:48
> > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00019
> > >
> > > Result:
> > >
> > > default> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as
> > > bigint)),sum(cast(movie_pt as bigint)) from snappy where pf='pc';
> > > Progress: 19%, response time: 1.87 sec
> > > Progress: 19%, response time: 1.873 sec
> > > Progress: 19%, response time: 2.276 sec
> > > Progress: 100%, response time: 2.372 sec
> > > ?sum_3,  ?sum_4,  ?sum_5
> > > -------------------------------
> > > 6928463,  6183665,  6055494385
> > > (1 rows, 2.372 sec, 27 B selected)
> > > default> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as
> > > bigint)),sum(cast(movie_pt as bigint)) from par where pf='pc';
> > > Progress: 0%, response time: 0.751 sec
> > > Progress: 0%, response time: 0.753 sec
> > > Progress: 0%, response time: 1.155 sec
> > > Progress: 0%, response time: 1.959 sec
> > > Progress: 0%, response time: 2.962 sec
> > > Progress: 0%, response time: 3.965 sec
> > > Progress: 0%, response time: 4.968 sec
> > > Progress: 0%, response time: 5.97 sec
> > > Progress: 12%, response time: 6.974 sec
> > > Progress: 12%, response time: 7.977 sec
> > > Progress: 12%, response time: 8.979 sec
> > > Progress: 12%, response time: 9.982 sec
> > > Progress: 25%, response time: 10.985 sec
> > > Progress: 100%, response time: 11.14 sec
> > > ?sum_3,  ?sum_4,  ?sum_5
> > > -------------------------------
> > > 6928463,  6183665,  6055494385
> > > (1 rows, 11.14 sec, 27 B selected)
> > >
> > > On Mon, Mar 16, 2015 at 2:58 PM, Jihoon Son <jihoonson@apache.org>
> > wrote:
> > >
> > >> Azuryy, thanks for your feedbacks.
> > >> They are very interesting results.
> > >> Would you mind telling me how Tajo with Parquet is slower than Tajo
> with
> > >> RCFile?
> > >>
> > >> Thanks,
> > >> Jihoon
> > >>
> > >> On Mon, Mar 16, 2015 at 3:39 PM Hyunsik Choi <hyunsik@apache.org>
> > wrote:
> > >>
> > >> > Hi Azuryy,
> > >> >
> > >> > Thank for sharing the test results. They are very inspiring to us.
> > >> > Also, I'll make some jira about the problems that you found.
> > >> >
> > >> > Best regards,
> > >> > Hyunsik
> > >> >
> > >> > On Sun, Mar 15, 2015 at 10:58 PM, Azuryy Yu <azuryyyu@gmail.com>
> > wrote:
> > >> > > Another fix:
> > >> > > My test result is unfair during compare Imapla-2.1.2 and
> > Tajo-0.10.0,
> > >> > > because I used Parquet with Impala and RCFILE snappy with Tajo.
I
> > >> should
> > >> > > use the same file format to compare.
> > >> > >
> > >> > > because I've got a clear conclusion that Imapala works better
on
> > >> Parquet
> > >> > > than Tajo, so I use RCFILE as the test data.
> > >> > >
> > >> > > *Tajo*:
> > >> > > default> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv
> as
> > >> > > bigint)),sum(cast(movie_pt as bigint)) from snappy;
> > >> > > Progress: 0%, response time: 1.598 sec
> > >> > > Progress: 0%, response time: 1.6 sec
> > >> > > Progress: 0%, response time: 2.003 sec
> > >> > > Progress: 0%, response time: 2.806 sec
> > >> > > Progress: 37%, response time: 3.808 sec
> > >> > > Progress: 100%, response time: 4.792 sec
> > >> > > ?sum_3,  ?sum_4,  ?sum_5
> > >> > > -------------------------------
> > >> > > 22557920,  19648838,  2005366694576
> > >> > > (1 rows, 4.792 sec, 32 B selected)
> > >> > >
> > >> > > *Impala*:
> > >> > >  > select sum (cast(movie_vv as bigint)), sum(cast(movie_cv
as
> > >> > > bigint)),sum(cast(movie_pt as bigint)) from snappy;
> > >> > > +-------------------------------+---------------------------
> > >> > ----+-------------------------------+
> > >> > > | sum(cast(movie_vv as bigint)) | sum(cast(movie_cv as bigint))
|
> > >> > > sum(cast(movie_pt as bigint)) |
> > >> > > +-------------------------------+---------------------------
> > >> > ----+-------------------------------+
> > >> > > | 22557920                      | 19648838                  
   |
> > >> > > 2005366694576                 |
> > >> > > +-------------------------------+---------------------------
> > >> > ----+-------------------------------+
> > >> > > Fetched 1 row(s) in 11.12s
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Mon, Mar 16, 2015 at 1:49 PM, Azuryy Yu <azuryyyu@gmail.com>
> > >> wrote:
> > >> > >
> > >> > >> There is a typo in my Email. I corrected here:
> > >> > >>
> > >> > >> for example:
> > >> > >>
> > >> > >>   <property>
> > >> > >>     <name>tajo.master.umbilical-rpc.address</name>
> > >> > >>     <value>1-1-1-1:26001</value>
> > >> > >>   </property>
> > >> > >>
> > >> > >> which does work under tajo-0.9.0, but it complain "1-1-1-1:2601"
> is
> > >> not
> > >> > a
> > >> > >> valid network address under tajo-0.10.0.
> > >> > >>
> > >> > >> I have to change to:
> > >> > >>   <property>
> > >> > >>     <name>tajo.master.umbilical-rpc.address</name>
> > >> > >>     <value>1.1.1.1:26001</value>
> > >> > >>   </property>
> > >> > >>
> > >> > >>
> > >> > >> On Mon, Mar 16, 2015 at 1:44 PM, Azuryy Yu <azuryyyu@gmail.com>
> > >> wrote:
> > >> > >>
> > >> > >>> Hi,
> > >> > >>> I compiled tajo-0.10 source based on hadoop-2.6.0, then
post
> some
> > >> > >>> feedback here.
> > >> > >>>
> > >> > >>> My cluster:
> > >> > >>> 1 tajo-master, 9 tajo-worker
> > >> > >>> 24 CPU(logic), 64GB mem, 4TB*12 HDD
> > >> > >>>
> > >> > >>> Feedback:
> > >> > >>> 1) tajo task progress estimate is normal on partitioned
table,
> > >> which is
> > >> > >>> incorrect sometimes in tajo-0.9.0
> > >> > >>> 2) Tajo configuration doesn't support hostname in tajo-site.xml.
> > >> > >>> for example:
> > >> > >>>
> > >> > >>>   <property>
> > >> > >>>     <name>tajo.master.umbilical-rpc.address</name>
> > >> > >>>     <value>1-1-1-1:26001</value>
> > >> > >>>   </property>
> > >> > >>>
> > >> > >>> which does work under tajo-0.9.0, but it complain "1-1-1-1:2601"
> > is
> > >> > not a
> > >> > >>> valid network address.
> > >> > >>>
> > >> > >>> I have to change to:
> > >> > >>>   <property>
> > >> > >>>     <name>tajo.master.umbilical-rpc.address</name>
> > >> > >>>     <value>1.1.1.1:26001</value>
> > >> > >>>   </property>
> > >> > >>>
> > >> > >>> but we don't use IP in our cluster, only hostname. so
I did a
> > >> little in
> > >> > >>> the code:
> > >> > >>> org.apache.tajo.validation.NetworkAddressValidator.java:
> > >> > >>> hostnamePattern = Pattern.compile("\\d*-\\d*-\\d*-\\d");
> > >> > >>> then It works.
> > >> > >>>
> > >> > >>> 3) I did some test on the parquet, RCFILE(snappy compressed),
> > >> > >>> RCFILE(GZIP compressed)
> > >> > >>>
> > >> > >>> they are the same data, only different from file format.
> > >> > >>> the table has six partitions, 20 RCFILES, each parquet
file is
> > 1GB.
> > >> > >>>
> > >> > >>> then rcfile with snappy's performance is similiar to
rcfile with
> > >> gzip.
> > >> > >>> but they are all two~three times better than parquet.
> > >> > >>>
> > >> > >>> 4) I compared tajo-0.10 and Impala-2.1.2,
> > >> > >>> Impala can provide very good support for parquet. more
better
> than
> > >> > Tajo.
> > >> > >>>
> > >> > >>> but impala is more *slow *with other format than Tajo.
> > >> > >>> such as(I don't use WHERE because I want query all six
> partitions
> > >> > >>> together):
> > >> > >>>
> > >> > >>> *Impala*:
> > >> > >>>  > select sum (cast(movie_vv as bigint)), sum(cast(movie_cv
as
> > >> > >>> bigint)),sum(cast(movie_pt as bigint)) from par;
> > >> > >>>
> > >> > >>> +-------------------------------+---------------------------
> > >> > ----+-------------------------------+
> > >> > >>> | sum(cast(movie_vv as bigint)) | sum(cast(movie_cv as
bigint))
> |
> > >> > >>> sum(cast(movie_pt as bigint)) |
> > >> > >>>
> > >> > >>> +-------------------------------+---------------------------
> > >> > ----+-------------------------------+
> > >> > >>> | 22557920                      | 19648838
> |
> > >> > >>> 2005366694576           |
> > >> > >>>
> > >> > >>> +-------------------------------+---------------------------
> > >> > ----+-------------------------------+
> > >> > >>> Fetched 1 row(s) in 6.02s
> > >> > >>>
> > >> > >>> *Tajo:*
> > >> > >>>
> > >> > >>> *default*> select sum (cast(movie_vv as bigint)),
> > sum(cast(movie_cv
> > >> as
> > >> > >>> bigint)),sum(cast(movie_pt as bigint)) from snappy;
> > >> > >>> Progress: 0%, response time: 1.598 sec
> > >> > >>> Progress: 0%, response time: 1.6 sec
> > >> > >>> Progress: 0%, response time: 2.003 sec
> > >> > >>> Progress: 0%, response time: 2.806 sec
> > >> > >>> Progress: 37%, response time: 3.808 sec
> > >> > >>> Progress: 100%, response time: 4.792 sec
> > >> > >>> ?sum_3,  ?sum_4,  ?sum_5
> > >> > >>> -------------------------------
> > >> > >>> 22557920,  19648838,  2005366694576
> > >> > >>> (1 rows, 4.792 sec, 32 B selected)
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message