tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Azuryy Yu <azury...@gmail.com>
Subject Re: Feedback for tajo-0.10.0
Date Mon, 16 Mar 2015 07:14:33 GMT
RCFile with snappy was generated by my MR job, which read text file and
output RCFile, compressed with Snappy-1.1.2.


On Mon, Mar 16, 2015 at 3:13 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:

> PS. my Parquet data was generated by Impala: "Insert into a parquet table
> [SHUFFLE] ... AS select .... from a text table"
>
> On Mon, Mar 16, 2015 at 3:11 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
>
>> Hi Jihoon,
>>
>> Here is an example:
>> My data: (Parquet file is 1GB limited)
>>  hadoop fs -ls /data/basetable/par/dt=20150301/pf=pc
>>
>> -rw-r--r--   9 hadoop tajo 1062932057 2015-03-12 15:08
>> /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-3ead7e35ecf0da8_448517166_data.0.parq
>> -rw-r--r--   9 hadoop tajo 1063205684 2015-03-12 15:11
>> /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-3ead7e35ecf0da8_448517166_data.1.parq
>> -rw-r--r--   9 hadoop tajo 1063236005 2015-03-12 15:14
>> /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-3ead7e35ecf0da8_448517166_data.2.parq
>> -rw-r--r--   9 hadoop tajo  543786632 2015-03-12 15:16
>> /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-3ead7e35ecf0da8_448517166_data.3.parq
>>
>> hadoop fs -ls /data/basetable/snappy/dt=20150301/pf=pc
>>
>> -rw-r--r--   9 tajo tajo  144059045 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00000
>> -rw-r--r--   9 tajo tajo  144178118 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00001
>> -rw-r--r--   9 tajo tajo  143642438 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00002
>> -rw-r--r--   9 tajo tajo  143553142 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00003
>> -rw-r--r--   9 tajo tajo  143849627 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00004
>> -rw-r--r--   9 tajo tajo  144648456 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00005
>> -rw-r--r--   9 tajo tajo  144647502 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00006
>> -rw-r--r--   9 tajo tajo  144551053 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00007
>> -rw-r--r--   9 tajo tajo  144017287 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00008
>> -rw-r--r--   9 tajo tajo  144205111 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00009
>> -rw-r--r--   9 tajo tajo  145066506 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00010
>> -rw-r--r--   9 tajo tajo  144740791 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00011
>> -rw-r--r--   9 tajo tajo  144198266 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00012
>> -rw-r--r--   9 tajo tajo  143575440 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00013
>> -rw-r--r--   9 tajo tajo  143922343 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00014
>> -rw-r--r--   9 tajo tajo  143930019 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00015
>> -rw-r--r--   9 tajo tajo  144253019 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00016
>> -rw-r--r--   9 tajo tajo  144175506 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00017
>> -rw-r--r--   9 tajo tajo  143072995 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00018
>> -rw-r--r--   9 tajo tajo  143818118 2015-03-16 11:48
>> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00019
>>
>> Result:
>>
>> default> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as
>> bigint)),sum(cast(movie_pt as bigint)) from snappy where pf='pc';
>> Progress: 19%, response time: 1.87 sec
>> Progress: 19%, response time: 1.873 sec
>> Progress: 19%, response time: 2.276 sec
>> Progress: 100%, response time: 2.372 sec
>> ?sum_3,  ?sum_4,  ?sum_5
>> -------------------------------
>> 6928463,  6183665,  6055494385
>> (1 rows, 2.372 sec, 27 B selected)
>> default> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as
>> bigint)),sum(cast(movie_pt as bigint)) from par where pf='pc';
>> Progress: 0%, response time: 0.751 sec
>> Progress: 0%, response time: 0.753 sec
>> Progress: 0%, response time: 1.155 sec
>> Progress: 0%, response time: 1.959 sec
>> Progress: 0%, response time: 2.962 sec
>> Progress: 0%, response time: 3.965 sec
>> Progress: 0%, response time: 4.968 sec
>> Progress: 0%, response time: 5.97 sec
>> Progress: 12%, response time: 6.974 sec
>> Progress: 12%, response time: 7.977 sec
>> Progress: 12%, response time: 8.979 sec
>> Progress: 12%, response time: 9.982 sec
>> Progress: 25%, response time: 10.985 sec
>> Progress: 100%, response time: 11.14 sec
>> ?sum_3,  ?sum_4,  ?sum_5
>> -------------------------------
>> 6928463,  6183665,  6055494385
>> (1 rows, 11.14 sec, 27 B selected)
>>
>> On Mon, Mar 16, 2015 at 2:58 PM, Jihoon Son <jihoonson@apache.org> wrote:
>>
>>> Azuryy, thanks for your feedbacks.
>>> They are very interesting results.
>>> Would you mind telling me how Tajo with Parquet is slower than Tajo with
>>> RCFile?
>>>
>>> Thanks,
>>> Jihoon
>>>
>>> On Mon, Mar 16, 2015 at 3:39 PM Hyunsik Choi <hyunsik@apache.org> wrote:
>>>
>>> > Hi Azuryy,
>>> >
>>> > Thank for sharing the test results. They are very inspiring to us.
>>> > Also, I'll make some jira about the problems that you found.
>>> >
>>> > Best regards,
>>> > Hyunsik
>>> >
>>> > On Sun, Mar 15, 2015 at 10:58 PM, Azuryy Yu <azuryyyu@gmail.com>
>>> wrote:
>>> > > Another fix:
>>> > > My test result is unfair during compare Imapla-2.1.2 and Tajo-0.10.0,
>>> > > because I used Parquet with Impala and RCFILE snappy with Tajo. I
>>> should
>>> > > use the same file format to compare.
>>> > >
>>> > > because I've got a clear conclusion that Imapala works better on
>>> Parquet
>>> > > than Tajo, so I use RCFILE as the test data.
>>> > >
>>> > > *Tajo*:
>>> > > default> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv
as
>>> > > bigint)),sum(cast(movie_pt as bigint)) from snappy;
>>> > > Progress: 0%, response time: 1.598 sec
>>> > > Progress: 0%, response time: 1.6 sec
>>> > > Progress: 0%, response time: 2.003 sec
>>> > > Progress: 0%, response time: 2.806 sec
>>> > > Progress: 37%, response time: 3.808 sec
>>> > > Progress: 100%, response time: 4.792 sec
>>> > > ?sum_3,  ?sum_4,  ?sum_5
>>> > > -------------------------------
>>> > > 22557920,  19648838,  2005366694576
>>> > > (1 rows, 4.792 sec, 32 B selected)
>>> > >
>>> > > *Impala*:
>>> > >  > select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as
>>> > > bigint)),sum(cast(movie_pt as bigint)) from snappy;
>>> > > +-------------------------------+---------------------------
>>> > ----+-------------------------------+
>>> > > | sum(cast(movie_vv as bigint)) | sum(cast(movie_cv as bigint)) |
>>> > > sum(cast(movie_pt as bigint)) |
>>> > > +-------------------------------+---------------------------
>>> > ----+-------------------------------+
>>> > > | 22557920                      | 19648838                      |
>>> > > 2005366694576                 |
>>> > > +-------------------------------+---------------------------
>>> > ----+-------------------------------+
>>> > > Fetched 1 row(s) in 11.12s
>>> > >
>>> > >
>>> > >
>>> > > On Mon, Mar 16, 2015 at 1:49 PM, Azuryy Yu <azuryyyu@gmail.com>
>>> wrote:
>>> > >
>>> > >> There is a typo in my Email. I corrected here:
>>> > >>
>>> > >> for example:
>>> > >>
>>> > >>   <property>
>>> > >>     <name>tajo.master.umbilical-rpc.address</name>
>>> > >>     <value>1-1-1-1:26001</value>
>>> > >>   </property>
>>> > >>
>>> > >> which does work under tajo-0.9.0, but it complain "1-1-1-1:2601"
is
>>> not
>>> > a
>>> > >> valid network address under tajo-0.10.0.
>>> > >>
>>> > >> I have to change to:
>>> > >>   <property>
>>> > >>     <name>tajo.master.umbilical-rpc.address</name>
>>> > >>     <value>1.1.1.1:26001</value>
>>> > >>   </property>
>>> > >>
>>> > >>
>>> > >> On Mon, Mar 16, 2015 at 1:44 PM, Azuryy Yu <azuryyyu@gmail.com>
>>> wrote:
>>> > >>
>>> > >>> Hi,
>>> > >>> I compiled tajo-0.10 source based on hadoop-2.6.0, then post
some
>>> > >>> feedback here.
>>> > >>>
>>> > >>> My cluster:
>>> > >>> 1 tajo-master, 9 tajo-worker
>>> > >>> 24 CPU(logic), 64GB mem, 4TB*12 HDD
>>> > >>>
>>> > >>> Feedback:
>>> > >>> 1) tajo task progress estimate is normal on partitioned table,
>>> which is
>>> > >>> incorrect sometimes in tajo-0.9.0
>>> > >>> 2) Tajo configuration doesn't support hostname in tajo-site.xml.
>>> > >>> for example:
>>> > >>>
>>> > >>>   <property>
>>> > >>>     <name>tajo.master.umbilical-rpc.address</name>
>>> > >>>     <value>1-1-1-1:26001</value>
>>> > >>>   </property>
>>> > >>>
>>> > >>> which does work under tajo-0.9.0, but it complain "1-1-1-1:2601"
is
>>> > not a
>>> > >>> valid network address.
>>> > >>>
>>> > >>> I have to change to:
>>> > >>>   <property>
>>> > >>>     <name>tajo.master.umbilical-rpc.address</name>
>>> > >>>     <value>1.1.1.1:26001</value>
>>> > >>>   </property>
>>> > >>>
>>> > >>> but we don't use IP in our cluster, only hostname. so I did
a
>>> little in
>>> > >>> the code:
>>> > >>> org.apache.tajo.validation.NetworkAddressValidator.java:
>>> > >>> hostnamePattern = Pattern.compile("\\d*-\\d*-\\d*-\\d");
>>> > >>> then It works.
>>> > >>>
>>> > >>> 3) I did some test on the parquet, RCFILE(snappy compressed),
>>> > >>> RCFILE(GZIP compressed)
>>> > >>>
>>> > >>> they are the same data, only different from file format.
>>> > >>> the table has six partitions, 20 RCFILES, each parquet file
is 1GB.
>>> > >>>
>>> > >>> then rcfile with snappy's performance is similiar to rcfile
with
>>> gzip.
>>> > >>> but they are all two~three times better than parquet.
>>> > >>>
>>> > >>> 4) I compared tajo-0.10 and Impala-2.1.2,
>>> > >>> Impala can provide very good support for parquet. more better
than
>>> > Tajo.
>>> > >>>
>>> > >>> but impala is more *slow *with other format than Tajo.
>>> > >>> such as(I don't use WHERE because I want query all six partitions
>>> > >>> together):
>>> > >>>
>>> > >>> *Impala*:
>>> > >>>  > select sum (cast(movie_vv as bigint)), sum(cast(movie_cv
as
>>> > >>> bigint)),sum(cast(movie_pt as bigint)) from par;
>>> > >>>
>>> > >>> +-------------------------------+---------------------------
>>> > ----+-------------------------------+
>>> > >>> | sum(cast(movie_vv as bigint)) | sum(cast(movie_cv as bigint))
|
>>> > >>> sum(cast(movie_pt as bigint)) |
>>> > >>>
>>> > >>> +-------------------------------+---------------------------
>>> > ----+-------------------------------+
>>> > >>> | 22557920                      | 19648838                
     |
>>> > >>> 2005366694576           |
>>> > >>>
>>> > >>> +-------------------------------+---------------------------
>>> > ----+-------------------------------+
>>> > >>> Fetched 1 row(s) in 6.02s
>>> > >>>
>>> > >>> *Tajo:*
>>> > >>>
>>> > >>> *default*> select sum (cast(movie_vv as bigint)),
>>> sum(cast(movie_cv as
>>> > >>> bigint)),sum(cast(movie_pt as bigint)) from snappy;
>>> > >>> Progress: 0%, response time: 1.598 sec
>>> > >>> Progress: 0%, response time: 1.6 sec
>>> > >>> Progress: 0%, response time: 2.003 sec
>>> > >>> Progress: 0%, response time: 2.806 sec
>>> > >>> Progress: 37%, response time: 3.808 sec
>>> > >>> Progress: 100%, response time: 4.792 sec
>>> > >>> ?sum_3,  ?sum_4,  ?sum_5
>>> > >>> -------------------------------
>>> > >>> 22557920,  19648838,  2005366694576
>>> > >>> (1 rows, 4.792 sec, 32 B selected)
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message