drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shankar Mane <shankar.m...@games24x7.com>
Subject Re: [Drill-Questions] Speed difference between GZ and BZ2
Date Fri, 05 Aug 2016 15:40:31 GMT
Yes, i went through the benchmarks and started testing this one.

I have tested this one using Hadoop Map-Reduce. And it seems BZ worked
faster than GZ.  As i know GZ is non-splittable and BZ is splittable.
Hadoop MR takes the advantage of this splittable property and launched
multiple mappers and reducers (multiple CPU's) whereas in case of GZ only
single mapper runs (single CPU) .

Can't drill use this splittable property ?



On Fri, Aug 5, 2016 at 8:50 PM, Khurram Faraaz <kfaraaz@maprtech.com> wrote:

> Shankar,
>
> This is expected behavior, bzip2 decompression is four to twelve times
> slower than decompressing gzip compressed files.
> You can look at the comparison benchmark here for numbers -
> http://tukaani.org/lzma/benchmarks.html
>
> On Thu, Aug 4, 2016 at 5:13 PM, Shankar Mane <shankar.mane@games24x7.com>
> wrote:
>
> > Please find the query plan for both queries. FYI: I am not seeing
> > any planning difference between these 2 queries except Cost.
> >
> >
> > /******************************** Query on GZ
> > ****************************************/
> >
> > 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> > +------+------+
> > | text | json |
> > +------+------+
> > | 00-00    Screen
> > 00-01      Project(channelid=[$0], EXPR$1=[$1])
> > 00-02        UnionExchange
> > 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> > 01-02            Project(channelid=[$0], EXPR$1=[$1])
> > 01-03              HashToRandomExchange(dist0=[[$0]])
> > 02-01                UnorderedMuxExchange
> > 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> > E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> > 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> > 03-03                      Scan(groupscan=[EasyGroupScan
> > [selectionRoot=hdfs://namenode:9000/tmp/stest-gz/
> > kafka_3_25-Jul-2016-12a.json.gz,
> > numFiles=1, columns=[`channelid`, `serverTime`],
> > files=[hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-
> > 2016-12a.json.gz]]])
> >  | {
> >   "head" : {
> >     "version" : 1,
> >     "generator" : {
> >       "type" : "ExplainHandler",
> >       "info" : ""
> >     },
> >     "type" : "APACHE_DRILL_PHYSICAL",
> >     "options" : [ ],
> >     "queue" : 0,
> >     "resultMode" : "EXEC"
> >   },
> >   "graph" : [ {
> >     "pop" : "fs-scan",
> >     "@id" : 196611,
> >     "userName" : "hadoop",
> >     "files" : [
> > "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz" ],
> >     "storage" : {
> >       "type" : "file",
> >       "enabled" : true,
> >       "connection" : "hdfs://namenode:9000",
> >       "config" : null,
> >       "workspaces" : {
> >         "root" : {
> >           "location" : "/tmp/",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         },
> >         "tmp" : {
> >           "location" : "/tmp",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         }
> >       },
> >       "formats" : {
> >         "psv" : {
> >           "type" : "text",
> >           "extensions" : [ "tbl" ],
> >           "delimiter" : "|"
> >         },
> >         "csv" : {
> >           "type" : "text",
> >           "extensions" : [ "csv" ],
> >           "delimiter" : ","
> >         },
> >         "tsv" : {
> >           "type" : "text",
> >           "extensions" : [ "tsv" ],
> >           "delimiter" : "\t"
> >         },
> >         "parquet" : {
> >           "type" : "parquet"
> >         },
> >         "json" : {
> >           "type" : "json",
> >           "extensions" : [ "json" ]
> >         },
> >         "avro" : {
> >           "type" : "avro"
> >         }
> >       }
> >     },
> >     "format" : {
> >       "type" : "json",
> >       "extensions" : [ "json" ]
> >     },
> >     "columns" : [ "`channelid`", "`serverTime`" ],
> >     "selectionRoot" :
> > "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz",
> >     "cost" : 1800981.0
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 196610,
> >     "child" : 196611,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "count(`serverTime`) "
> >     } ],
> >     "cost" : 900490.5
> >   }, {
> >     "pop" : "project",
> >     "@id" : 196609,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     }, {
> >       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >       "expr" : "hash32asdouble(`channelid`) "
> >     } ],
> >     "child" : 196610,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "unordered-mux-exchange",
> >     "@id" : 131073,
> >     "child" : 196609,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "hash-to-random-exchange",
> >     "@id" : 65539,
> >     "child" : 131073,
> >     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "project",
> >     "@id" : 65538,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 65539,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 65537,
> >     "child" : 65538,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "$sum0(`EXPR$1`) "
> >     } ],
> >     "cost" : 90049.05
> >   }, {
> >     "pop" : "union-exchange",
> >     "@id" : 2,
> >     "child" : 65537,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 18009.81
> >   }, {
> >     "pop" : "project",
> >     "@id" : 1,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 2,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 18009.81
> >   }, {
> >     "pop" : "screen",
> >     "@id" : 0,
> >     "child" : 1,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 18009.81
> >   } ]
> > } |
> > +------+------+
> > 1 row selected (0.729 seconds)
> > 0: jdbc:drill:>
> >
> >
> >
> > /******************************** Query on BZ
> > ****************************************/
> >
> >
> > 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> channelid ;
> > +------+------+
> > | text | json |
> > +------+------+
> > | 00-00    Screen
> > 00-01      Project(channelid=[$0], EXPR$1=[$1])
> > 00-02        UnionExchange
> > 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> > 01-02            Project(channelid=[$0], EXPR$1=[$1])
> > 01-03              HashToRandomExchange(dist0=[[$0]])
> > 02-01                UnorderedMuxExchange
> > 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> > E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> > 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> > 03-03                      Scan(groupscan=[EasyGroupScan
> > [selectionRoot=hdfs://namenode:9000/tmp/stest-bz2/
> > kafka_3_25-Jul-2016-12a.json.bz2,
> > numFiles=1, columns=[`channelid`, `serverTime`],
> > files=[hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-
> > 2016-12a.json.bz2]]])
> >  | {
> >   "head" : {
> >     "version" : 1,
> >     "generator" : {
> >       "type" : "ExplainHandler",
> >       "info" : ""
> >     },
> >     "type" : "APACHE_DRILL_PHYSICAL",
> >     "options" : [ ],
> >     "queue" : 0,
> >     "resultMode" : "EXEC"
> >   },
> >   "graph" : [ {
> >     "pop" : "fs-scan",
> >     "@id" : 196611,
> >     "userName" : "hadoop",
> >     "files" : [
> > "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2" ],
> >     "storage" : {
> >       "type" : "file",
> >       "enabled" : true,
> >       "connection" : "hdfs://namenode:9000",
> >       "config" : null,
> >       "workspaces" : {
> >         "root" : {
> >           "location" : "/tmp/",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         },
> >         "tmp" : {
> >           "location" : "/tmp",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         }
> >       },
> >       "formats" : {
> >         "psv" : {
> >           "type" : "text",
> >           "extensions" : [ "tbl" ],
> >           "delimiter" : "|"
> >         },
> >         "csv" : {
> >           "type" : "text",
> >           "extensions" : [ "csv" ],
> >           "delimiter" : ","
> >         },
> >         "tsv" : {
> >           "type" : "text",
> >           "extensions" : [ "tsv" ],
> >           "delimiter" : "\t"
> >         },
> >         "parquet" : {
> >           "type" : "parquet"
> >         },
> >         "json" : {
> >           "type" : "json",
> >           "extensions" : [ "json" ]
> >         },
> >         "avro" : {
> >           "type" : "avro"
> >         }
> >       }
> >     },
> >     "format" : {
> >       "type" : "json",
> >       "extensions" : [ "json" ]
> >     },
> >     "columns" : [ "`channelid`", "`serverTime`" ],
> >     "selectionRoot" :
> > "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2",
> >     "cost" : 1148224.0
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 196610,
> >     "child" : 196611,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "count(`serverTime`) "
> >     } ],
> >     "cost" : 574112.0
> >   }, {
> >     "pop" : "project",
> >     "@id" : 196609,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     }, {
> >       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >       "expr" : "hash32asdouble(`channelid`) "
> >     } ],
> >     "child" : 196610,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "unordered-mux-exchange",
> >     "@id" : 131073,
> >     "child" : 196609,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "hash-to-random-exchange",
> >     "@id" : 65539,
> >     "child" : 131073,
> >     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "project",
> >     "@id" : 65538,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 65539,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 65537,
> >     "child" : 65538,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "$sum0(`EXPR$1`) "
> >     } ],
> >     "cost" : 57411.2
> >   }, {
> >     "pop" : "union-exchange",
> >     "@id" : 2,
> >     "child" : 65537,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 11482.24
> >   }, {
> >     "pop" : "project",
> >     "@id" : 1,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 2,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 11482.24
> >   }, {
> >     "pop" : "screen",
> >     "@id" : 0,
> >     "child" : 1,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 11482.24
> >   } ]
> > } |
> > +------+------+
> > 1 row selected (0.381 seconds)
> > 0: jdbc:drill:>
> >
> >
> > On Thu, Aug 4, 2016 at 3:07 PM, Khurram Faraaz <kfaraaz@maprtech.com>
> > wrote:
> >
> > > Can you please do an explain plan over the two aggregate queries. That
> > way
> > > we can know where most of the time is being spent, is it in the query
> > > planning phase or is it query execution that is taking longer. Please
> > share
> > > the query plans and the time taken for those explain plan statements.
> > >
> > > On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <
> shankar.mane@games24x7.com
> > >
> > > wrote:
> > >
> > > > It is plain json (1 json per line).
> > > > Each json message size = ~4kb
> > > > no. of json messages = ~5 Millions.
> > > >
> > > > store.parquet.compression = snappy ( i don't think, this parameter
> get
> > > > used. As I am querying select only.)
> > > >
> > > >
> > > > On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kfaraaz@maprtech.com
> >
> > > > wrote:
> > > >
> > > > > What is the data format within those .gz and .bz2 files ? It is
> > parquet
> > > > or
> > > > > JSON or plain text (CSV) ?
> > > > > Also, what was this config parameter `store.parquet.compression`
> set
> > > to,
> > > > > when ypu ran your test ?
> > > > >
> > > > > - Khurram
> > > > >
> > > > > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> > > > shankar.mane@games24x7.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Awaiting for response..
> > > > > >
> > > > > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <
> shankar.mane@games24x7.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > I am Comparing Querying speed between GZ and BZ2.
> > > > > > >
> > > > > > > Below are the 2 files and their sizes (This 2 files have
same
> > > data):
> > > > > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Results:
> > > > > > >
> > > > > > > 0: jdbc:drill:> select channelid, count(serverTime)
from
> > > > > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by
> > > channelid
> > > > ;
> > > > > > > +------------+----------+
> > > > > > > | channelid  |  EXPR$1  |
> > > > > > > +------------+----------+
> > > > > > > | 3          | 977134   |
> > > > > > > | 0          | 836850   |
> > > > > > > | 2          | 3202854  |
> > > > > > > +------------+----------+
> > > > > > > 3 rows selected (86.034 seconds)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 0: jdbc:drill:> select channelid, count(serverTime)
from
> > > > > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group
by
> > > > channelid
> > > > > ;
> > > > > > > +------------+----------+
> > > > > > > | channelid  |  EXPR$1  |
> > > > > > > +------------+----------+
> > > > > > > | 3          | 977134   |
> > > > > > > | 0          | 836850   |
> > > > > > > | 2          | 3202854  |
> > > > > > > +------------+----------+
> > > > > > > 3 rows selected (459.079 seconds)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Questions:
> > > > > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that
?
> > > > > > > 2. How can we speed to up Bz2.  Are there any configuration
to
> > do ?
> > > > > > > 3. As bz2 is splittable format, How drill using it ?
> > > > > > >
> > > > > > >
> > > > > > > regards,
> > > > > > > shankar
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message