hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bennie Leo <tben...@hotmail.com>
Subject RE: Hive indexing optimization
Date Tue, 30 Jun 2015 00:25:47 GMT
I've attached the output. Thanks.
  B
                    
 Subject: Re: Hive indexing optimization
From: jpullokkaran@hortonworks.com
To: user@hive.apache.org
Date: Mon, 29 Jun 2015 19:17:44 +0000






Could you post explain extended output?





From: Bennie Leo <tbenleo@hotmail.com>

Reply-To: "user@hive.apache.org" <user@hive.apache.org>

Date: Monday, June 29, 2015 at 10:35 AM

To: "user@hive.apache.org" <user@hive.apache.org>

Subject: RE: Hive indexing optimization







Here is the explain output:

 

STAGE PLANS:

  Stage: Stage-1

    Tez

      Edges:

        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)

      Vertices:

        Map 1 

            Map Operator Tree:

                TableScan

                  alias: logontable

                  filterExpr: isipv4(ip) (type: boolean)

                  Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL Column stats:
NONE

                  Filter Operator

                    predicate: isipv4(ip) (type: boolean)

                    Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

                    Select Operator

                      expressions: ip (type: bigint)

                      outputColumnNames: _col0

                      Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats:
NONE

                      Reduce Output Operator

                        sort order: 

                        Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats:
NONE

                        value expressions: _col0 (type: bigint)

        Map 3 

            Map Operator Tree:

                TableScan

                  alias: ipv4geotable

                  Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE
Column stats: NONE

                  Select Operator

                    expressions: startip (type: bigint), endip (type: bigint), country (type:
string)

                    outputColumnNames: _col0, _col1, _col2

                    Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE
Column stats: NONE

                    Reduce Output Operator

                      sort order: 

                      Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE
Column stats: NONE

                      value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2
(type: string)

        Reducer 2 

            Reduce Operator Tree:

              Join Operator

                condition map:

                     Left Outer Join0 to 1

                condition expressions:

                  0 {VALUE._col0} 

                  1 {VALUE._col0} {VALUE._col1} {VALUE._col2}

                filter predicates:

                  0 {isipv4(VALUE._col0)}

                  1 

                outputColumnNames: _col0, _col1, _col2, _col3

                Statistics: Num rows: 43281312 Data size: 5020632576 Basic stats: COMPLETE
Column stats: NONE

                Filter Operator

                  predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: boolean)

                  Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE
Column stats: NONE

                  Select Operator

                    expressions: _col0 (type: bigint), _col3 (type: string)

                    outputColumnNames: _col0, _col1

                    Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE
Column stats: NONE

                    File Output Operator

                      compressed: false

                      Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE
Column stats: NONE

                      table:

                          input format: org.apache.hadoop.mapred.TextInputFormat

                          output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0

    Fetch Operator

      limit: -1


 

Thank you,

B

 

> Subject: Re: Hive indexing optimization

> From: jpullokkaran@hortonworks.com

> To: user@hive.apache.org

> CC: tbenleo@hotmail.com

> Date: Sat, 27 Jun 2015 16:02:08 +0000

> 

> "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been

> rewritten as a scan against index table.

> 

> BitMap Indexes seems to support inequalities (<=, <, >=).

> 

> Post the explain plan.

> 

> On 6/26/15, 8:56 PM, "Gopal Vijayaraghavan" <gopalv@apache.org> wrote:

> 

> >Hi,

> >

> >Hive indexes won¹t really help you speed up that query right now, because

> >of the plan it generates due to the <= clauses.

> >

> >> CREATETABLE ipv4table

> >> AS

> >> SELECT logon.IP, ipv4.Country

> >> FROM

> >> (SELECT * FROM logontable WHERE isIpv4(IP)) logon

> >> LEFT OUTER JOIN

> >> (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON

> >> isIpv4(logon.IP)

> >> WHERE ipv4.StartIp <=logon.IP AND logon.IP <= ipv4.EndIp;

> >

> >That¹s a cross-product join, which can¹t be distributed at all & will take

> >forever, even if you use Tez/LLAP.

> >

> >Range join queries have partial distribution rewrites, but AFAIK none of

> >them use filter indexes in hive.

> >

> >But before I suggest a rewrite, can you post the ³explain <query>;² for

> >the above query, so that I can check if this is indeed producing a

> >cross-product + filter?

> >

> >Cheers,

> >Gopal

> >

> >

> >

> >

> >

> >

> >

> >

> 





 		 	   		  
Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message