Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Wed, 17 Sep 2014 19:25:34 +0000 (UTC)
From: "Na Yang (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <JIRA.12742231.1410979550000.58358.1410981934295@Atlassian.JIRA>
In-Reply-To: <JIRA.12742231.1410979550000@Atlassian.JIRA>
References: <JIRA.12742231.1410979550000@Atlassian.JIRA>
 <JIRA.12742231.1410979550955@arcas>
Subject: [jira] [Commented] (HIVE-8162) hive.optimize.sort.dynamic.partition
 causes RuntimeException for inserting into dynamic partitioned table when
 map function is used in the subquery
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137784#comment-14137784 ] 

Na Yang commented on HIVE-8162:
-------------------------------

The operator tree for this query is like:
TS0-FIL9-SEL2-GBY4-RS5-GBY6-SEL7-RS10-EX11-FS8.

The task graph for this query is like:
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2
  Stage-3 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: associateddata
            Statistics: Num rows: 25374 Data size: 101496 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (sm_campaign_id) IN (10187171, 1090942, 10541943, 10833443, 8635630, 10187170, 9445296, 10696334, 11398585, 9524211, 1145211) (type: boolean)
              Statistics: Num rows: 12687 Data size: 50748 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: map('x_product_id':'') (type: map<string,string>), day_id (type: int)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 12687 Data size: 50748 Basic stats: COMPLETE Column stats: NONE
                Group By Operator
                  keys: _col0 (type: map<string,string>), _col1 (type: int)
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 12687 Data size: 50748 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: map<string,string>), _col1 (type: int)
                    sort order: ++
                    Map-reduce partition columns: _col0 (type: map<string,string>), _col1 (type: int)
                    Statistics: Num rows: 12687 Data size: 50748 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Group By Operator
          keys: KEY._col0 (type: map<string,string>), KEY._col1 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 6343 Data size: 25372 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: 2 (type: int), _col0 (type: map<string,string>), _col1 (type: int)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 6343 Data size: 25372 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col2 (type: int), _col0 (type: map<string,string>), _col1 (type: int)
              sort order: +++
              Map-reduce partition columns: _col2 (type: int)
              Statistics: Num rows: 6343 Data size: 25372 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col0 (type: int), _col1 (type: map<string,string>), _col2 (type: int)
      Reduce Operator Tree:
        Extract
          Statistics: Num rows: 6343 Data size: 25372 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 6343 Data size: 25372 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                name: default.agg_pv_associateddata_c

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            day_id 
          replace: false
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: default.agg_pv_associateddata_c

  Stage: Stage-3
    Stats-Aggr Operator

The exception happens when executing task stage-2. The ReduceSinkDesc for RS10 has keycols type as {int, map<string,string>, int} and the intermediate file for this table is stored in SequenceFileInputFormat and using LazyBinarySerDe. However, the LazyBinarySerDe is not able to deserialize non-primitive type from the intermediate file which causes the exception.   

Using the TextInputFormat and LazySimpleSerDe for the intermediate file, the exception is gone. However, changing the intermediate file InputFormat and SerDe is not a preferred solution.  

> hive.optimize.sort.dynamic.partition causes RuntimeException for inserting into dynamic partitioned table when map function is used in the subquery 
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8162
>                 URL: https://issues.apache.org/jira/browse/HIVE-8162
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Na Yang
>         Attachments: 47rows.txt
>
>
> Exception:
> Diagnostic Messages for this Task:
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error: Unable to deserialize reduce input key from x1x129x51x83x14x1x128x0x0x2x1x1x1x120x95x112x114x111x100x117x99x116x95x105x100x0x1x0x0x255 with properties {columns=reducesinkkey0,reducesinkkey1,reducesinkkey2, serialization.lib=org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe, serialization.sort.order=+++, columns.types=int,map<string,string>,int}
> 	at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:283)
> 	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:518)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:462)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:282)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1122)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:271)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error: Unable to deserialize reduce input key from x1x129x51x83x14x1x128x0x0x2x1x1x1x120x95x112x114x111x100x117x99x116x95x105x100x0x1x0x0x255 with properties {columns=reducesinkkey0,reducesinkkey1,reducesinkkey2, serialization.lib=org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe, serialization.sort.order=+++, columns.types=int,map<string,string>,int}
> 	at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:222)
> 	... 7 more
> Caused by: org.apache.hadoop.hive.serde2.SerDeException: java.io.EOFException
> 	at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:189)
> 	at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:220)
> 	... 7 more
> Caused by: java.io.EOFException
> 	at org.apache.hadoop.hive.serde2.binarysortable.InputByteBuffer.read(InputByteBuffer.java:54)
> 	at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserializeInt(BinarySortableSerDe.java:533)
> 	at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:236)
> 	at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:185)
> 	... 8 more
> Step to reproduce the exception:
> -------------------------------------------------
> CREATE TABLE associateddata(creative_id int,creative_group_id int,placement_id
> int,sm_campaign_id int,browser_id string, trans_type_p string,trans_time_p
> string,group_name string,event_name string,order_id string,revenue
> float,currency string, trans_type_ci string,trans_time_ci string,f16
> map<string,string>,campaign_id int,user_agent_cat string,geo_country
> string,geo_city string,geo_state string,geo_zip string,geo_dma string,geo_area
> string,geo_isp string,site_id int,section_id int,f16_ci map<string,string>)
> PARTITIONED BY(day_id int, hour_id int) ROW FORMAT DELIMITED FIELDS TERMINATED
> BY '\t';
> LOAD DATA LOCAL INPATH '/tmp/47rows.txt' INTO TABLE associateddata
> PARTITION(day_id=20140814,hour_id=2014081417);
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict; 
> CREATE  EXTERNAL TABLE IF NOT EXISTS agg_pv_associateddata_c (
>  vt_tran_qty             int                     COMMENT 'The count of view
> thru transactions'
> , pair_value_txt          string                  COMMENT 'F16 name values
> pairs'
> )
> PARTITIONED BY (day_id int)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE
> LOCATION '/user/prodman/agg_pv_associateddata_c';
> INSERT INTO TABLE agg_pv_associateddata_c PARTITION (day_id)
> select 2 as vt_tran_qty, pair_value_txt, day_id
>  from (select map( 'x_product_id',coalesce(F16['x_product_id'],'') ) as pair_value_txt , day_id , hour_id 
> from associateddata where hour_id = 2014081417 and sm_campaign_id in
> (10187171,1090942,10541943,10833443,8635630,10187170,9445296,10696334,11398585,9524211,1145211)
> ) a GROUP BY pair_value_txt, day_id;
> The query worked fine in Hive-0.12 and Hive-0.13. It starts failing in Hive-0.13. If hive.optimize.sort.dynamic.partition is turned off in Hive-0.13, the exception is gone. 


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)