Return-Path: X-Original-To: apmail-tajo-dev-archive@minotaur.apache.org Delivered-To: apmail-tajo-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 95A911768F for ; Tue, 17 Mar 2015 01:49:31 +0000 (UTC) Received: (qmail 81588 invoked by uid 500); 17 Mar 2015 01:49:31 -0000 Delivered-To: apmail-tajo-dev-archive@tajo.apache.org Received: (qmail 81545 invoked by uid 500); 17 Mar 2015 01:49:31 -0000 Mailing-List: contact dev-help@tajo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tajo.apache.org Delivered-To: mailing list dev@tajo.apache.org Received: (qmail 81527 invoked by uid 99); 17 Mar 2015 01:49:31 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Mar 2015 01:49:31 +0000 Received: from mail-ob0-f176.google.com (mail-ob0-f176.google.com [209.85.214.176]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id E693D1A02E4 for ; Tue, 17 Mar 2015 01:49:30 +0000 (UTC) Received: by obbgg8 with SMTP id gg8so49667313obb.1 for ; Mon, 16 Mar 2015 18:49:30 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.202.202.72 with SMTP id a69mr8759890oig.5.1426556970202; Mon, 16 Mar 2015 18:49:30 -0700 (PDT) Received: by 10.76.173.161 with HTTP; Mon, 16 Mar 2015 18:49:30 -0700 (PDT) In-Reply-To: References: Date: Tue, 17 Mar 2015 10:49:30 +0900 Message-ID: Subject: Re: Feedback for tajo-0.10.0 From: Jinho Kim To: dev@tajo.apache.org Content-Type: multipart/alternative; boundary=001a113524087617ca0511722d69 --001a113524087617ca0511722d69 Content-Type: text/plain; charset=UTF-8 Impalad is a c++ daemon with an embedded JVM. They has own c++ parquet code In my opinion, you can find more information from impala community or github -Jinho Best regards 2015-03-17 10:00 GMT+09:00 Azuryy Yu : > Impala using C wrapper, which start JVM internal, you can use jps to find > this JVM after start Impala. > > BTW, I used the same JVM options and HDFS configuration during my test on > Impala and Tajo. > > > On Mon, Mar 16, 2015 at 7:34 PM, Jihoon Son wrote: > > > Even I haven't tried, it seems to be not possible because their > > implementation is in C. > > > > On Mon, Mar 16, 2015 at 6:49 PM Azuryy Yu wrote: > > > > > Does that possible Tajo reuse Parquet bundle from Impala? > > > > > > > > > On Mon, Mar 16, 2015 at 5:35 PM, Jihoon Son > > wrote: > > > > > > > Honestly, even if there are sufficiently many input files, Impala may > > be > > > > faster than Tajo becuase their optimization for Parquet if better > than > > > that > > > > of Tajo. > > > > > > > > On Mon, Mar 16, 2015 at 6:16 PM Jihoon Son > > wrote: > > > > > > > > > Interesting. > > > > > Here are some reasons I think. > > > > > > > > > > Impala can process Parquet files in a distributed manner even if > > their > > > > > size is very large. This is becuase of its DBMS-like architecture. > > > Impala > > > > > has a storage manager and a query executor, which have totally > > > separated > > > > > roles during query processing. Once a query is executed, the > storage > > > > > manager continuously loads data from HDFS into memory. Query > > executors > > > > just > > > > > process data loaded in memory according to the query plan. As a > > result, > > > > > even though the number of data files is small, huge amounts of > query > > > > > executors can process them simultaneously. > > > > > > > > > > However, in Tajo, each query executor (i.e., task) has a scanner to > > > read > > > > > data from HDFS. During query processing, their roles are closely > > > related. > > > > > Thus, the number of tasks is mainly decided based on the number of > > > files. > > > > > (Of course, when input data are dispersed on the entire cluster > > nodes, > > > > the > > > > > number of tasks is decided based on the number of cpu cores and > disks > > > per > > > > > worker.) > > > > > > > > > > So, I expect that Tajo's performance with Parquet will be improved > > when > > > > > there are sufficiently many input files. > > > > > As I aforementioned, this is just my suspection. > > > > > I will investigate further. > > > > > > > > > > Thanks, > > > > > Jihoon > > > > > > > > > > > > > > > On Mon, Mar 16, 2015 at 5:45 PM Azuryy Yu > > wrote: > > > > > > > > > >> HDFS block size is also 1GB > > > > >> > > > > >> > > > > >> On Mon, Mar 16, 2015 at 4:18 PM, Jihoon Son > > > > > wrote: > > > > >> > > > > >> > Right. A large file size can improves the sequential scan on > > > Parquet. > > > > >> > However, if you want to use the large file size, it is > recommended > > > to > > > > >> also > > > > >> > increase the HDFS block size to reduce the remote read cost. > > > > >> > How large size did you set for HDFS blocks? > > > > >> > > > > > >> > On Impala's good performance, I will also investigate it. > > > > >> > It seems to be related with Impala's storage manager. > > > > >> > > > > > >> > Best, > > > > >> > Jihoon > > > > >> > > > > > >> > On Mon, Mar 16, 2015 at 5:05 PM Azuryy Yu > > > wrote: > > > > >> > > > > > >> > > Hi Jihoon, > > > > >> > > > > > > >> > > Impala works on Parquet is more faster than other file > formats. > > > and > > > > >> > Impala > > > > >> > > advice don't make more small parquet files. 1GB would be > better. > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > On Mon, Mar 16, 2015 at 3:57 PM, Jihoon Son < > > jihoonson@apache.org > > > > > > > > >> > wrote: > > > > >> > > > > > > >> > > > Thanks! > > > > >> > > > It is really interesting. > > > > >> > > > I suspect that the large file size of Parquet makes Tajo > > slower. > > > > >> This > > > > >> > is > > > > >> > > > because Parquet is non-splittable, which means that only 4 > > > workers > > > > >> read > > > > >> > > > data from HDFS. In addition, if the HDFS block size is > smaller > > > > than > > > > >> > 1GB, > > > > >> > > a > > > > >> > > > lot of data can be moved over network during the scan phase. > > > > >> > > > > > > > >> > > > But, I have no idea why Impala shows good performance. > > > > >> > > > Maybe, its cache scheme improved it. > > > > >> > > > > > > > >> > > > Best regards, > > > > >> > > > Jihoon > > > > >> > > > > > > > >> > > > On Mon, Mar 16, 2015 at 4:16 PM Azuryy Yu < > azuryyyu@gmail.com > > > > > > > >> wrote: > > > > >> > > > > > > > >> > > > > PS. my Parquet data was generated by Impala: "Insert into > a > > > > >> parquet > > > > >> > > table > > > > >> > > > > [SHUFFLE] ... AS select .... from a text table" > > > > >> > > > > > > > > >> > > > > On Mon, Mar 16, 2015 at 3:11 PM, Azuryy Yu < > > > azuryyyu@gmail.com> > > > > >> > wrote: > > > > >> > > > > > > > > >> > > > > > Hi Jihoon, > > > > >> > > > > > > > > > >> > > > > > Here is an example: > > > > >> > > > > > My data: (Parquet file is 1GB limited) > > > > >> > > > > > hadoop fs -ls /data/basetable/par/dt=20150301/pf=pc > > > > >> > > > > > > > > > >> > > > > > -rw-r--r-- 9 hadoop tajo 1062932057 2015-03-12 15:08 > > > > >> > > > > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3- > > > > >> > > > > 3ead7e35ecf0da8_448517166_data.0.parq > > > > >> > > > > > -rw-r--r-- 9 hadoop tajo 1063205684 2015-03-12 15:11 > > > > >> > > > > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3- > > > > >> > > > > 3ead7e35ecf0da8_448517166_data.1.parq > > > > >> > > > > > -rw-r--r-- 9 hadoop tajo 1063236005 2015-03-12 15:14 > > > > >> > > > > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3- > > > > >> > > > > 3ead7e35ecf0da8_448517166_data.2.parq > > > > >> > > > > > -rw-r--r-- 9 hadoop tajo 543786632 2015-03-12 15:16 > > > > >> > > > > > /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3- > > > > >> > > > > 3ead7e35ecf0da8_448517166_data.3.parq > > > > >> > > > > > > > > > >> > > > > > hadoop fs -ls /data/basetable/snappy/dt=20150301/pf=pc > > > > >> > > > > > > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 144059045 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00000 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 144178118 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00001 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 143642438 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00002 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 143553142 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00003 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 143849627 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00004 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 144648456 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00005 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 144647502 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00006 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 144551053 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00007 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 144017287 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00008 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 144205111 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00009 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 145066506 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00010 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 144740791 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00011 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 144198266 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00012 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 143575440 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00013 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 143922343 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00014 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 143930019 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00015 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 144253019 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00016 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 144175506 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00017 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 143072995 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00018 > > > > >> > > > > > -rw-r--r-- 9 tajo tajo 143818118 2015-03-16 11:48 > > > > >> > > > > > /data/basetable/snappy/dt=20150301/pf=pc/part-r-00019 > > > > >> > > > > > > > > > >> > > > > > Result: > > > > >> > > > > > > > > > >> > > > > > default> select sum (cast(movie_vv as bigint)), > > > > >> sum(cast(movie_cv > > > > >> > as > > > > >> > > > > > bigint)),sum(cast(movie_pt as bigint)) from snappy where > > > > >> pf='pc'; > > > > >> > > > > > Progress: 19%, response time: 1.87 sec > > > > >> > > > > > Progress: 19%, response time: 1.873 sec > > > > >> > > > > > Progress: 19%, response time: 2.276 sec > > > > >> > > > > > Progress: 100%, response time: 2.372 sec > > > > >> > > > > > ?sum_3, ?sum_4, ?sum_5 > > > > >> > > > > > ------------------------------- > > > > >> > > > > > 6928463, 6183665, 6055494385 > > > > >> > > > > > (1 rows, 2.372 sec, 27 B selected) > > > > >> > > > > > default> select sum (cast(movie_vv as bigint)), > > > > >> sum(cast(movie_cv > > > > >> > as > > > > >> > > > > > bigint)),sum(cast(movie_pt as bigint)) from par where > > > pf='pc'; > > > > >> > > > > > Progress: 0%, response time: 0.751 sec > > > > >> > > > > > Progress: 0%, response time: 0.753 sec > > > > >> > > > > > Progress: 0%, response time: 1.155 sec > > > > >> > > > > > Progress: 0%, response time: 1.959 sec > > > > >> > > > > > Progress: 0%, response time: 2.962 sec > > > > >> > > > > > Progress: 0%, response time: 3.965 sec > > > > >> > > > > > Progress: 0%, response time: 4.968 sec > > > > >> > > > > > Progress: 0%, response time: 5.97 sec > > > > >> > > > > > Progress: 12%, response time: 6.974 sec > > > > >> > > > > > Progress: 12%, response time: 7.977 sec > > > > >> > > > > > Progress: 12%, response time: 8.979 sec > > > > >> > > > > > Progress: 12%, response time: 9.982 sec > > > > >> > > > > > Progress: 25%, response time: 10.985 sec > > > > >> > > > > > Progress: 100%, response time: 11.14 sec > > > > >> > > > > > ?sum_3, ?sum_4, ?sum_5 > > > > >> > > > > > ------------------------------- > > > > >> > > > > > 6928463, 6183665, 6055494385 > > > > >> > > > > > (1 rows, 11.14 sec, 27 B selected) > > > > >> > > > > > > > > > >> > > > > > On Mon, Mar 16, 2015 at 2:58 PM, Jihoon Son < > > > > >> jihoonson@apache.org> > > > > >> > > > > wrote: > > > > >> > > > > > > > > > >> > > > > >> Azuryy, thanks for your feedbacks. > > > > >> > > > > >> They are very interesting results. > > > > >> > > > > >> Would you mind telling me how Tajo with Parquet is > slower > > > > than > > > > >> > Tajo > > > > >> > > > with > > > > >> > > > > >> RCFile? > > > > >> > > > > >> > > > > >> > > > > >> Thanks, > > > > >> > > > > >> Jihoon > > > > >> > > > > >> > > > > >> > > > > >> On Mon, Mar 16, 2015 at 3:39 PM Hyunsik Choi < > > > > >> hyunsik@apache.org> > > > > >> > > > > wrote: > > > > >> > > > > >> > > > > >> > > > > >> > Hi Azuryy, > > > > >> > > > > >> > > > > > >> > > > > >> > Thank for sharing the test results. They are very > > > inspiring > > > > >> to > > > > >> > us. > > > > >> > > > > >> > Also, I'll make some jira about the problems that you > > > > found. > > > > >> > > > > >> > > > > > >> > > > > >> > Best regards, > > > > >> > > > > >> > Hyunsik > > > > >> > > > > >> > > > > > >> > > > > >> > On Sun, Mar 15, 2015 at 10:58 PM, Azuryy Yu < > > > > >> azuryyyu@gmail.com > > > > >> > > > > > > >> > > > > wrote: > > > > >> > > > > >> > > Another fix: > > > > >> > > > > >> > > My test result is unfair during compare > Imapla-2.1.2 > > > and > > > > >> > > > > Tajo-0.10.0, > > > > >> > > > > >> > > because I used Parquet with Impala and RCFILE > snappy > > > with > > > > >> > Tajo. > > > > >> > > I > > > > >> > > > > >> should > > > > >> > > > > >> > > use the same file format to compare. > > > > >> > > > > >> > > > > > > >> > > > > >> > > because I've got a clear conclusion that Imapala > > works > > > > >> better > > > > >> > on > > > > >> > > > > >> Parquet > > > > >> > > > > >> > > than Tajo, so I use RCFILE as the test data. > > > > >> > > > > >> > > > > > > >> > > > > >> > > *Tajo*: > > > > >> > > > > >> > > default> select sum (cast(movie_vv as bigint)), > > > > >> > > sum(cast(movie_cv > > > > >> > > > as > > > > >> > > > > >> > > bigint)),sum(cast(movie_pt as bigint)) from snappy; > > > > >> > > > > >> > > Progress: 0%, response time: 1.598 sec > > > > >> > > > > >> > > Progress: 0%, response time: 1.6 sec > > > > >> > > > > >> > > Progress: 0%, response time: 2.003 sec > > > > >> > > > > >> > > Progress: 0%, response time: 2.806 sec > > > > >> > > > > >> > > Progress: 37%, response time: 3.808 sec > > > > >> > > > > >> > > Progress: 100%, response time: 4.792 sec > > > > >> > > > > >> > > ?sum_3, ?sum_4, ?sum_5 > > > > >> > > > > >> > > ------------------------------- > > > > >> > > > > >> > > 22557920, 19648838, 2005366694576 > > > > >> > > > > >> > > (1 rows, 4.792 sec, 32 B selected) > > > > >> > > > > >> > > > > > > >> > > > > >> > > *Impala*: > > > > >> > > > > >> > > > select sum (cast(movie_vv as bigint)), > > > > >> sum(cast(movie_cv as > > > > >> > > > > >> > > bigint)),sum(cast(movie_pt as bigint)) from snappy; > > > > >> > > > > >> > > +----------------------------- > > > > >> --+--------------------------- > > > > >> > > > > >> > ----+-------------------------------+ > > > > >> > > > > >> > > | sum(cast(movie_vv as bigint)) | sum(cast(movie_cv > > as > > > > >> > bigint)) > > > > >> > > | > > > > >> > > > > >> > > sum(cast(movie_pt as bigint)) | > > > > >> > > > > >> > > +----------------------------- > > > > >> --+--------------------------- > > > > >> > > > > >> > ----+-------------------------------+ > > > > >> > > > > >> > > | 22557920 | 19648838 > > > > >> > > | > > > > >> > > > > >> > > 2005366694576 | > > > > >> > > > > >> > > +----------------------------- > > > > >> --+--------------------------- > > > > >> > > > > >> > ----+-------------------------------+ > > > > >> > > > > >> > > Fetched 1 row(s) in 11.12s > > > > >> > > > > >> > > > > > > >> > > > > >> > > > > > > >> > > > > >> > > > > > > >> > > > > >> > > On Mon, Mar 16, 2015 at 1:49 PM, Azuryy Yu < > > > > >> > azuryyyu@gmail.com> > > > > >> > > > > >> wrote: > > > > >> > > > > >> > > > > > > >> > > > > >> > >> There is a typo in my Email. I corrected here: > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> for example: > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> tajo.master.umbilical-rpc.address > > > > >> > > > > >> > >> 1-1-1-1:26001 > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> which does work under tajo-0.9.0, but it complain > > > > >> > > "1-1-1-1:2601" > > > > >> > > > is > > > > >> > > > > >> not > > > > >> > > > > >> > a > > > > >> > > > > >> > >> valid network address under tajo-0.10.0. > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> I have to change to: > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> tajo.master.umbilical-rpc.address > > > > >> > > > > >> > >> 1.1.1.1:26001 > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> On Mon, Mar 16, 2015 at 1:44 PM, Azuryy Yu < > > > > >> > azuryyyu@gmail.com > > > > >> > > > > > > > >> > > > > >> wrote: > > > > >> > > > > >> > >> > > > > >> > > > > >> > >>> Hi, > > > > >> > > > > >> > >>> I compiled tajo-0.10 source based on > hadoop-2.6.0, > > > then > > > > >> post > > > > >> > > > some > > > > >> > > > > >> > >>> feedback here. > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> My cluster: > > > > >> > > > > >> > >>> 1 tajo-master, 9 tajo-worker > > > > >> > > > > >> > >>> 24 CPU(logic), 64GB mem, 4TB*12 HDD > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> Feedback: > > > > >> > > > > >> > >>> 1) tajo task progress estimate is normal on > > > partitioned > > > > >> > table, > > > > >> > > > > >> which is > > > > >> > > > > >> > >>> incorrect sometimes in tajo-0.9.0 > > > > >> > > > > >> > >>> 2) Tajo configuration doesn't support hostname in > > > > >> > > tajo-site.xml. > > > > >> > > > > >> > >>> for example: > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > tajo.master.umbilical-rpc.address > > > > >> > > > > >> > >>> 1-1-1-1:26001 > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> which does work under tajo-0.9.0, but it complain > > > > >> > > "1-1-1-1:2601" > > > > >> > > > > is > > > > >> > > > > >> > not a > > > > >> > > > > >> > >>> valid network address. > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> I have to change to: > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > tajo.master.umbilical-rpc.address > > > > >> > > > > >> > >>> 1.1.1.1:26001 > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> but we don't use IP in our cluster, only > hostname. > > > so I > > > > >> did > > > > >> > a > > > > >> > > > > >> little in > > > > >> > > > > >> > >>> the code: > > > > >> > > > > >> > >>> > > > > org.apache.tajo.validation.NetworkAddressValidator.java: > > > > >> > > > > >> > >>> hostnamePattern = > > > > Pattern.compile("\\d*-\\d*-\\d*-\\d"); > > > > >> > > > > >> > >>> then It works. > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> 3) I did some test on the parquet, RCFILE(snappy > > > > >> > compressed), > > > > >> > > > > >> > >>> RCFILE(GZIP compressed) > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> they are the same data, only different from file > > > > format. > > > > >> > > > > >> > >>> the table has six partitions, 20 RCFILES, each > > > parquet > > > > >> file > > > > >> > is > > > > >> > > > > 1GB. > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> then rcfile with snappy's performance is similiar > > to > > > > >> rcfile > > > > >> > > with > > > > >> > > > > >> gzip. > > > > >> > > > > >> > >>> but they are all two~three times better than > > parquet. > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> 4) I compared tajo-0.10 and Impala-2.1.2, > > > > >> > > > > >> > >>> Impala can provide very good support for parquet. > > > more > > > > >> > better > > > > >> > > > than > > > > >> > > > > >> > Tajo. > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> but impala is more *slow *with other format than > > > Tajo. > > > > >> > > > > >> > >>> such as(I don't use WHERE because I want query > all > > > six > > > > >> > > > partitions > > > > >> > > > > >> > >>> together): > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> *Impala*: > > > > >> > > > > >> > >>> > select sum (cast(movie_vv as bigint)), > > > > >> sum(cast(movie_cv > > > > >> > as > > > > >> > > > > >> > >>> bigint)),sum(cast(movie_pt as bigint)) from par; > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> +----------------------------- > > > > >> --+--------------------------- > > > > >> > > > > >> > ----+-------------------------------+ > > > > >> > > > > >> > >>> | sum(cast(movie_vv as bigint)) | > sum(cast(movie_cv > > > as > > > > >> > > bigint)) > > > > >> > > > | > > > > >> > > > > >> > >>> sum(cast(movie_pt as bigint)) | > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> +----------------------------- > > > > >> --+--------------------------- > > > > >> > > > > >> > ----+-------------------------------+ > > > > >> > > > > >> > >>> | 22557920 | 19648838 > > > > >> > > > | > > > > >> > > > > >> > >>> 2005366694576 | > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> +----------------------------- > > > > >> --+--------------------------- > > > > >> > > > > >> > ----+-------------------------------+ > > > > >> > > > > >> > >>> Fetched 1 row(s) in 6.02s > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> *Tajo:* > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> *default*> select sum (cast(movie_vv as bigint)), > > > > >> > > > > sum(cast(movie_cv > > > > >> > > > > >> as > > > > >> > > > > >> > >>> bigint)),sum(cast(movie_pt as bigint)) from > snappy; > > > > >> > > > > >> > >>> Progress: 0%, response time: 1.598 sec > > > > >> > > > > >> > >>> Progress: 0%, response time: 1.6 sec > > > > >> > > > > >> > >>> Progress: 0%, response time: 2.003 sec > > > > >> > > > > >> > >>> Progress: 0%, response time: 2.806 sec > > > > >> > > > > >> > >>> Progress: 37%, response time: 3.808 sec > > > > >> > > > > >> > >>> Progress: 100%, response time: 4.792 sec > > > > >> > > > > >> > >>> ?sum_3, ?sum_4, ?sum_5 > > > > >> > > > > >> > >>> ------------------------------- > > > > >> > > > > >> > >>> 22557920, 19648838, 2005366694576 > > > > >> > > > > >> > >>> (1 rows, 4.792 sec, 32 B selected) > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >> > > > > >> > > > > >> > > > > > >> > > > > >> > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > --001a113524087617ca0511722d69--