tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinho Kim <jh...@apache.org>
Subject Re: Feedback for tajo-0.10.0
Date Tue, 17 Mar 2015 01:49:30 GMT
Impalad is a c++ daemon with an embedded JVM. They has own c++ parquet code
In my opinion, you can find more information from impala community or github


-Jinho
Best regards

2015-03-17 10:00 GMT+09:00 Azuryy Yu <azuryyyu@gmail.com>:

> Impala using C wrapper, which start JVM internal,  you can use jps to find
> this JVM after start Impala.
>
> BTW, I used the same JVM options and HDFS configuration during my test on
> Impala and Tajo.
>
>
> On Mon, Mar 16, 2015 at 7:34 PM, Jihoon Son <jihoonson@apache.org> wrote:
>
> > Even I haven't tried, it seems to be not possible because their
> > implementation is in C.
> >
> > On Mon, Mar 16, 2015 at 6:49 PM Azuryy Yu <azuryyyu@gmail.com> wrote:
> >
> > > Does that possible Tajo reuse Parquet bundle from Impala?
> > >
> > >
> > > On Mon, Mar 16, 2015 at 5:35 PM, Jihoon Son <jihoonson@apache.org>
> > wrote:
> > >
> > > > Honestly, even if there are sufficiently many input files, Impala may
> > be
> > > > faster than Tajo becuase their optimization for Parquet if better
> than
> > > that
> > > > of Tajo.
> > > >
> > > > On Mon, Mar 16, 2015 at 6:16 PM Jihoon Son <jihoonson@apache.org>
> > wrote:
> > > >
> > > > > Interesting.
> > > > > Here are some reasons I think.
> > > > >
> > > > > Impala can process Parquet files in a distributed manner even if
> > their
> > > > > size is very large. This is becuase of its DBMS-like architecture.
> > > Impala
> > > > > has a storage manager and a query executor, which have totally
> > > separated
> > > > > roles during query processing. Once a query is executed, the
> storage
> > > > > manager continuously loads data from HDFS into memory. Query
> > executors
> > > > just
> > > > > process data loaded in memory according to the query plan. As a
> > result,
> > > > > even though the number of data files is small, huge amounts of
> query
> > > > > executors can process them simultaneously.
> > > > >
> > > > > However, in Tajo, each query executor (i.e., task) has a scanner
to
> > > read
> > > > > data from HDFS. During query processing, their roles are closely
> > > related.
> > > > > Thus, the number of tasks is mainly decided based on the number of
> > > files.
> > > > > (Of course, when input data are dispersed on the entire cluster
> > nodes,
> > > > the
> > > > > number of tasks is decided based on the number of cpu cores and
> disks
> > > per
> > > > > worker.)
> > > > >
> > > > > So, I expect that Tajo's performance with Parquet will be improved
> > when
> > > > > there are sufficiently many input files.
> > > > > As I aforementioned, this is just my suspection.
> > > > > I will investigate further.
> > > > >
> > > > > Thanks,
> > > > > Jihoon
> > > > >
> > > > >
> > > > > On Mon, Mar 16, 2015 at 5:45 PM Azuryy Yu <azuryyyu@gmail.com>
> > wrote:
> > > > >
> > > > >> HDFS block size is also 1GB
> > > > >>
> > > > >>
> > > > >> On Mon, Mar 16, 2015 at 4:18 PM, Jihoon Son <jihoonson@apache.org
> >
> > > > wrote:
> > > > >>
> > > > >> > Right. A large file size can improves the sequential scan
on
> > > Parquet.
> > > > >> > However, if you want to use the large file size, it is
> recommended
> > > to
> > > > >> also
> > > > >> > increase the HDFS block size to reduce the remote read cost.
> > > > >> > How large size did you set for HDFS blocks?
> > > > >> >
> > > > >> > On Impala's good performance, I will also investigate it.
> > > > >> > It seems to be related with Impala's storage manager.
> > > > >> >
> > > > >> > Best,
> > > > >> > Jihoon
> > > > >> >
> > > > >> > On Mon, Mar 16, 2015 at 5:05 PM Azuryy Yu <azuryyyu@gmail.com>
> > > wrote:
> > > > >> >
> > > > >> > > Hi Jihoon,
> > > > >> > >
> > > > >> > > Impala works on Parquet is more faster than other file
> formats.
> > > and
> > > > >> > Impala
> > > > >> > > advice don't make more small parquet files. 1GB would
be
> better.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > On Mon, Mar 16, 2015 at 3:57 PM, Jihoon Son <
> > jihoonson@apache.org
> > > >
> > > > >> > wrote:
> > > > >> > >
> > > > >> > > > Thanks!
> > > > >> > > > It is really interesting.
> > > > >> > > > I suspect that the large file size of Parquet
makes Tajo
> > slower.
> > > > >> This
> > > > >> > is
> > > > >> > > > because Parquet is non-splittable, which means
that only 4
> > > workers
> > > > >> read
> > > > >> > > > data from HDFS. In addition, if the HDFS block
size is
> smaller
> > > > than
> > > > >> > 1GB,
> > > > >> > > a
> > > > >> > > > lot of data can be moved over network during the
scan phase.
> > > > >> > > >
> > > > >> > > > But, I have no idea why Impala shows good performance.
> > > > >> > > > Maybe, its cache scheme improved it.
> > > > >> > > >
> > > > >> > > > Best regards,
> > > > >> > > > Jihoon
> > > > >> > > >
> > > > >> > > > On Mon, Mar 16, 2015 at 4:16 PM Azuryy Yu <
> azuryyyu@gmail.com
> > >
> > > > >> wrote:
> > > > >> > > >
> > > > >> > > > > PS. my Parquet data was generated by Impala:
"Insert into
> a
> > > > >> parquet
> > > > >> > > table
> > > > >> > > > > [SHUFFLE] ... AS select .... from a text
table"
> > > > >> > > > >
> > > > >> > > > > On Mon, Mar 16, 2015 at 3:11 PM, Azuryy Yu
<
> > > azuryyyu@gmail.com>
> > > > >> > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi Jihoon,
> > > > >> > > > > >
> > > > >> > > > > > Here is an example:
> > > > >> > > > > > My data: (Parquet file is 1GB limited)
> > > > >> > > > > >  hadoop fs -ls /data/basetable/par/dt=20150301/pf=pc
> > > > >> > > > > >
> > > > >> > > > > > -rw-r--r--   9 hadoop tajo 1062932057
2015-03-12 15:08
> > > > >> > > > > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-
> > > > >> > > > > 3ead7e35ecf0da8_448517166_data.0.parq
> > > > >> > > > > > -rw-r--r--   9 hadoop tajo 1063205684
2015-03-12 15:11
> > > > >> > > > > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-
> > > > >> > > > > 3ead7e35ecf0da8_448517166_data.1.parq
> > > > >> > > > > > -rw-r--r--   9 hadoop tajo 1063236005
2015-03-12 15:14
> > > > >> > > > > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-
> > > > >> > > > > 3ead7e35ecf0da8_448517166_data.2.parq
> > > > >> > > > > > -rw-r--r--   9 hadoop tajo  543786632
2015-03-12 15:16
> > > > >> > > > > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-
> > > > >> > > > > 3ead7e35ecf0da8_448517166_data.3.parq
> > > > >> > > > > >
> > > > >> > > > > > hadoop fs -ls /data/basetable/snappy/dt=20150301/pf=pc
> > > > >> > > > > >
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  144059045
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00000
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  144178118
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00001
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  143642438
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00002
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  143553142
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00003
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  143849627
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00004
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  144648456
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00005
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  144647502
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00006
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  144551053
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00007
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  144017287
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00008
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  144205111
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00009
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  145066506
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00010
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  144740791
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00011
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  144198266
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00012
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  143575440
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00013
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  143922343
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00014
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  143930019
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00015
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  144253019
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00016
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  144175506
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00017
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  143072995
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00018
> > > > >> > > > > > -rw-r--r--   9 tajo tajo  143818118
2015-03-16 11:48
> > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00019
> > > > >> > > > > >
> > > > >> > > > > > Result:
> > > > >> > > > > >
> > > > >> > > > > > default> select sum (cast(movie_vv
as bigint)),
> > > > >> sum(cast(movie_cv
> > > > >> > as
> > > > >> > > > > > bigint)),sum(cast(movie_pt as bigint))
from snappy where
> > > > >> pf='pc';
> > > > >> > > > > > Progress: 19%, response time: 1.87 sec
> > > > >> > > > > > Progress: 19%, response time: 1.873
sec
> > > > >> > > > > > Progress: 19%, response time: 2.276
sec
> > > > >> > > > > > Progress: 100%, response time: 2.372
sec
> > > > >> > > > > > ?sum_3,  ?sum_4,  ?sum_5
> > > > >> > > > > > -------------------------------
> > > > >> > > > > > 6928463,  6183665,  6055494385
> > > > >> > > > > > (1 rows, 2.372 sec, 27 B selected)
> > > > >> > > > > > default> select sum (cast(movie_vv
as bigint)),
> > > > >> sum(cast(movie_cv
> > > > >> > as
> > > > >> > > > > > bigint)),sum(cast(movie_pt as bigint))
from par where
> > > pf='pc';
> > > > >> > > > > > Progress: 0%, response time: 0.751 sec
> > > > >> > > > > > Progress: 0%, response time: 0.753 sec
> > > > >> > > > > > Progress: 0%, response time: 1.155 sec
> > > > >> > > > > > Progress: 0%, response time: 1.959 sec
> > > > >> > > > > > Progress: 0%, response time: 2.962 sec
> > > > >> > > > > > Progress: 0%, response time: 3.965 sec
> > > > >> > > > > > Progress: 0%, response time: 4.968 sec
> > > > >> > > > > > Progress: 0%, response time: 5.97 sec
> > > > >> > > > > > Progress: 12%, response time: 6.974
sec
> > > > >> > > > > > Progress: 12%, response time: 7.977
sec
> > > > >> > > > > > Progress: 12%, response time: 8.979
sec
> > > > >> > > > > > Progress: 12%, response time: 9.982
sec
> > > > >> > > > > > Progress: 25%, response time: 10.985
sec
> > > > >> > > > > > Progress: 100%, response time: 11.14
sec
> > > > >> > > > > > ?sum_3,  ?sum_4,  ?sum_5
> > > > >> > > > > > -------------------------------
> > > > >> > > > > > 6928463,  6183665,  6055494385
> > > > >> > > > > > (1 rows, 11.14 sec, 27 B selected)
> > > > >> > > > > >
> > > > >> > > > > > On Mon, Mar 16, 2015 at 2:58 PM, Jihoon
Son <
> > > > >> jihoonson@apache.org>
> > > > >> > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > >> Azuryy, thanks for your feedbacks.
> > > > >> > > > > >> They are very interesting results.
> > > > >> > > > > >> Would you mind telling me how Tajo
with Parquet is
> slower
> > > > than
> > > > >> > Tajo
> > > > >> > > > with
> > > > >> > > > > >> RCFile?
> > > > >> > > > > >>
> > > > >> > > > > >> Thanks,
> > > > >> > > > > >> Jihoon
> > > > >> > > > > >>
> > > > >> > > > > >> On Mon, Mar 16, 2015 at 3:39 PM
Hyunsik Choi <
> > > > >> hyunsik@apache.org>
> > > > >> > > > > wrote:
> > > > >> > > > > >>
> > > > >> > > > > >> > Hi Azuryy,
> > > > >> > > > > >> >
> > > > >> > > > > >> > Thank for sharing the test
results. They are very
> > > inspiring
> > > > >> to
> > > > >> > us.
> > > > >> > > > > >> > Also, I'll make some jira about
the problems that you
> > > > found.
> > > > >> > > > > >> >
> > > > >> > > > > >> > Best regards,
> > > > >> > > > > >> > Hyunsik
> > > > >> > > > > >> >
> > > > >> > > > > >> > On Sun, Mar 15, 2015 at 10:58
PM, Azuryy Yu <
> > > > >> azuryyyu@gmail.com
> > > > >> > >
> > > > >> > > > > wrote:
> > > > >> > > > > >> > > Another fix:
> > > > >> > > > > >> > > My test result is unfair
during compare
> Imapla-2.1.2
> > > and
> > > > >> > > > > Tajo-0.10.0,
> > > > >> > > > > >> > > because I used Parquet
with Impala and RCFILE
> snappy
> > > with
> > > > >> > Tajo.
> > > > >> > > I
> > > > >> > > > > >> should
> > > > >> > > > > >> > > use the same file format
to compare.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > because I've got a clear
conclusion that Imapala
> > works
> > > > >> better
> > > > >> > on
> > > > >> > > > > >> Parquet
> > > > >> > > > > >> > > than Tajo, so I use RCFILE
as the test data.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > *Tajo*:
> > > > >> > > > > >> > > default> select sum
(cast(movie_vv as bigint)),
> > > > >> > > sum(cast(movie_cv
> > > > >> > > > as
> > > > >> > > > > >> > > bigint)),sum(cast(movie_pt
as bigint)) from snappy;
> > > > >> > > > > >> > > Progress: 0%, response
time: 1.598 sec
> > > > >> > > > > >> > > Progress: 0%, response
time: 1.6 sec
> > > > >> > > > > >> > > Progress: 0%, response
time: 2.003 sec
> > > > >> > > > > >> > > Progress: 0%, response
time: 2.806 sec
> > > > >> > > > > >> > > Progress: 37%, response
time: 3.808 sec
> > > > >> > > > > >> > > Progress: 100%, response
time: 4.792 sec
> > > > >> > > > > >> > > ?sum_3,  ?sum_4,  ?sum_5
> > > > >> > > > > >> > > -------------------------------
> > > > >> > > > > >> > > 22557920,  19648838, 
2005366694576
> > > > >> > > > > >> > > (1 rows, 4.792 sec, 32
B selected)
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > *Impala*:
> > > > >> > > > > >> > >  > select sum (cast(movie_vv
as bigint)),
> > > > >> sum(cast(movie_cv as
> > > > >> > > > > >> > > bigint)),sum(cast(movie_pt
as bigint)) from snappy;
> > > > >> > > > > >> > > +-----------------------------
> > > > >> --+---------------------------
> > > > >> > > > > >> > ----+-------------------------------+
> > > > >> > > > > >> > > | sum(cast(movie_vv as
bigint)) | sum(cast(movie_cv
> > as
> > > > >> > bigint))
> > > > >> > > |
> > > > >> > > > > >> > > sum(cast(movie_pt as bigint))
|
> > > > >> > > > > >> > > +-----------------------------
> > > > >> --+---------------------------
> > > > >> > > > > >> > ----+-------------------------------+
> > > > >> > > > > >> > > | 22557920           
          | 19648838
> > > > >> > > |
> > > > >> > > > > >> > > 2005366694576        
        |
> > > > >> > > > > >> > > +-----------------------------
> > > > >> --+---------------------------
> > > > >> > > > > >> > ----+-------------------------------+
> > > > >> > > > > >> > > Fetched 1 row(s) in 11.12s
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > On Mon, Mar 16, 2015 at
1:49 PM, Azuryy Yu <
> > > > >> > azuryyyu@gmail.com>
> > > > >> > > > > >> wrote:
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >> There is a typo in
my Email. I corrected here:
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> for example:
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >>   <property>
> > > > >> > > > > >> > >>     <name>tajo.master.umbilical-rpc.address</name>
> > > > >> > > > > >> > >>     <value>1-1-1-1:26001</value>
> > > > >> > > > > >> > >>   </property>
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> which does work under
tajo-0.9.0, but it complain
> > > > >> > > "1-1-1-1:2601"
> > > > >> > > > is
> > > > >> > > > > >> not
> > > > >> > > > > >> > a
> > > > >> > > > > >> > >> valid network address
under tajo-0.10.0.
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> I have to change to:
> > > > >> > > > > >> > >>   <property>
> > > > >> > > > > >> > >>     <name>tajo.master.umbilical-rpc.address</name>
> > > > >> > > > > >> > >>     <value>1.1.1.1:26001</value>
> > > > >> > > > > >> > >>   </property>
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> On Mon, Mar 16, 2015
at 1:44 PM, Azuryy Yu <
> > > > >> > azuryyyu@gmail.com
> > > > >> > > >
> > > > >> > > > > >> wrote:
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >>> Hi,
> > > > >> > > > > >> > >>> I compiled tajo-0.10
source based on
> hadoop-2.6.0,
> > > then
> > > > >> post
> > > > >> > > > some
> > > > >> > > > > >> > >>> feedback here.
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> My cluster:
> > > > >> > > > > >> > >>> 1 tajo-master,
9 tajo-worker
> > > > >> > > > > >> > >>> 24 CPU(logic),
64GB mem, 4TB*12 HDD
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> Feedback:
> > > > >> > > > > >> > >>> 1) tajo task progress
estimate is normal on
> > > partitioned
> > > > >> > table,
> > > > >> > > > > >> which is
> > > > >> > > > > >> > >>> incorrect sometimes
in tajo-0.9.0
> > > > >> > > > > >> > >>> 2) Tajo configuration
doesn't support hostname in
> > > > >> > > tajo-site.xml.
> > > > >> > > > > >> > >>> for example:
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>>   <property>
> > > > >> > > > > >> > >>>
>  <name>tajo.master.umbilical-rpc.address</name>
> > > > >> > > > > >> > >>>     <value>1-1-1-1:26001</value>
> > > > >> > > > > >> > >>>   </property>
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> which does work
under tajo-0.9.0, but it complain
> > > > >> > > "1-1-1-1:2601"
> > > > >> > > > > is
> > > > >> > > > > >> > not a
> > > > >> > > > > >> > >>> valid network
address.
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> I have to change
to:
> > > > >> > > > > >> > >>>   <property>
> > > > >> > > > > >> > >>>
>  <name>tajo.master.umbilical-rpc.address</name>
> > > > >> > > > > >> > >>>     <value>1.1.1.1:26001</value>
> > > > >> > > > > >> > >>>   </property>
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> but we don't use
IP in our cluster, only
> hostname.
> > > so I
> > > > >> did
> > > > >> > a
> > > > >> > > > > >> little in
> > > > >> > > > > >> > >>> the code:
> > > > >> > > > > >> > >>>
> > > > org.apache.tajo.validation.NetworkAddressValidator.java:
> > > > >> > > > > >> > >>> hostnamePattern
=
> > > > Pattern.compile("\\d*-\\d*-\\d*-\\d");
> > > > >> > > > > >> > >>> then It works.
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> 3) I did some
test on the parquet, RCFILE(snappy
> > > > >> > compressed),
> > > > >> > > > > >> > >>> RCFILE(GZIP compressed)
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> they are the same
data, only different from file
> > > > format.
> > > > >> > > > > >> > >>> the table has
six partitions, 20 RCFILES, each
> > > parquet
> > > > >> file
> > > > >> > is
> > > > >> > > > > 1GB.
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> then rcfile with
snappy's performance is similiar
> > to
> > > > >> rcfile
> > > > >> > > with
> > > > >> > > > > >> gzip.
> > > > >> > > > > >> > >>> but they are all
two~three times better than
> > parquet.
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> 4) I compared
tajo-0.10 and Impala-2.1.2,
> > > > >> > > > > >> > >>> Impala can provide
very good support for parquet.
> > > more
> > > > >> > better
> > > > >> > > > than
> > > > >> > > > > >> > Tajo.
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> but impala is
more *slow *with other format than
> > > Tajo.
> > > > >> > > > > >> > >>> such as(I don't
use WHERE because I want query
> all
> > > six
> > > > >> > > > partitions
> > > > >> > > > > >> > >>> together):
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> *Impala*:
> > > > >> > > > > >> > >>>  > select sum
(cast(movie_vv as bigint)),
> > > > >> sum(cast(movie_cv
> > > > >> > as
> > > > >> > > > > >> > >>> bigint)),sum(cast(movie_pt
as bigint)) from par;
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> +-----------------------------
> > > > >> --+---------------------------
> > > > >> > > > > >> > ----+-------------------------------+
> > > > >> > > > > >> > >>> | sum(cast(movie_vv
as bigint)) |
> sum(cast(movie_cv
> > > as
> > > > >> > > bigint))
> > > > >> > > > |
> > > > >> > > > > >> > >>> sum(cast(movie_pt
as bigint)) |
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> +-----------------------------
> > > > >> --+---------------------------
> > > > >> > > > > >> > ----+-------------------------------+
> > > > >> > > > > >> > >>> | 22557920   
                  | 19648838
> > > > >> > > > |
> > > > >> > > > > >> > >>> 2005366694576
          |
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> +-----------------------------
> > > > >> --+---------------------------
> > > > >> > > > > >> > ----+-------------------------------+
> > > > >> > > > > >> > >>> Fetched 1 row(s)
in 6.02s
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> *Tajo:*
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> *default*>
select sum (cast(movie_vv as bigint)),
> > > > >> > > > > sum(cast(movie_cv
> > > > >> > > > > >> as
> > > > >> > > > > >> > >>> bigint)),sum(cast(movie_pt
as bigint)) from
> snappy;
> > > > >> > > > > >> > >>> Progress: 0%,
response time: 1.598 sec
> > > > >> > > > > >> > >>> Progress: 0%,
response time: 1.6 sec
> > > > >> > > > > >> > >>> Progress: 0%,
response time: 2.003 sec
> > > > >> > > > > >> > >>> Progress: 0%,
response time: 2.806 sec
> > > > >> > > > > >> > >>> Progress: 37%,
response time: 3.808 sec
> > > > >> > > > > >> > >>> Progress: 100%,
response time: 4.792 sec
> > > > >> > > > > >> > >>> ?sum_3,  ?sum_4,
 ?sum_5
> > > > >> > > > > >> > >>> -------------------------------
> > > > >> > > > > >> > >>> 22557920,  19648838,
 2005366694576
> > > > >> > > > > >> > >>> (1 rows, 4.792
sec, 32 B selected)
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>
> > > > >> > > > > >> >
> > > > >> > > > > >>
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message