drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Khurram Faraaz <kfar...@maprtech.com>
Subject Re: [Drill-Questions] Speed difference between GZ and BZ2
Date Thu, 04 Aug 2016 09:37:44 GMT
Can you please do an explain plan over the two aggregate queries. That way
we can know where most of the time is being spent, is it in the query
planning phase or is it query execution that is taking longer. Please share
the query plans and the time taken for those explain plan statements.

On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <shankar.mane@games24x7.com>
wrote:

> It is plain json (1 json per line).
> Each json message size = ~4kb
> no. of json messages = ~5 Millions.
>
> store.parquet.compression = snappy ( i don't think, this parameter get
> used. As I am querying select only.)
>
>
> On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kfaraaz@maprtech.com>
> wrote:
>
> > What is the data format within those .gz and .bz2 files ? It is parquet
> or
> > JSON or plain text (CSV) ?
> > Also, what was this config parameter `store.parquet.compression` set to,
> > when ypu ran your test ?
> >
> > - Khurram
> >
> > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> shankar.mane@games24x7.com
> > >
> > wrote:
> >
> > > Awaiting for response..
> > >
> > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <shankar.mane@games24x7.com>
> > wrote:
> > >
> > > >
> > >
> > > > I am Comparing Querying speed between GZ and BZ2.
> > > >
> > > > Below are the 2 files and their sizes (This 2 files have same data):
> > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > >
> > > >
> > > >
> > > > Results:
> > > >
> > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid
> ;
> > > > +------------+----------+
> > > > | channelid  |  EXPR$1  |
> > > > +------------+----------+
> > > > | 3          | 977134   |
> > > > | 0          | 836850   |
> > > > | 2          | 3202854  |
> > > > +------------+----------+
> > > > 3 rows selected (86.034 seconds)
> > > >
> > > >
> > > >
> > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> channelid
> > ;
> > > > +------------+----------+
> > > > | channelid  |  EXPR$1  |
> > > > +------------+----------+
> > > > | 3          | 977134   |
> > > > | 0          | 836850   |
> > > > | 2          | 3202854  |
> > > > +------------+----------+
> > > > 3 rows selected (459.079 seconds)
> > > >
> > > >
> > > >
> > > > Questions:
> > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > > 2. How can we speed to up Bz2.  Are there any configuration to do ?
> > > > 3. As bz2 is splittable format, How drill using it ?
> > > >
> > > >
> > > > regards,
> > > > shankar
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message