hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-6326) Split generation in ORC may generate wrong split boundaries because of unaccounted padded bytes
Date Sat, 15 Feb 2014 00:22:20 GMT

    [ https://issues.apache.org/jira/browse/HIVE-6326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902193#comment-13902193
] 

Owen O'Malley commented on HIVE-6326:
-------------------------------------

You may also want to protect line 732 with

{code}
if (sarg != null &&
   stripeStats != null &&
   idx < stripeStats.size() &&
   !isStripeSatisfyPredicate(...) {
{code}

> Split generation in ORC may generate wrong split boundaries because of unaccounted padded
bytes
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-6326
>                 URL: https://issues.apache.org/jira/browse/HIVE-6326
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: orcfile
>         Attachments: HIVE-6326.1.patch, HIVE-6326.2.patch, HIVE-6326.3.patch
>
>
> HIVE-5091 added padding to ORC files to avoid ORC stripes straddling HDFS blocks. The
length of this padded bytes are not stored in stripe information. OrcInputFormat.getSplits()
uses stripeInformation.getLength() for split computation. stripeInformation.getLength() is
sum of index length, data length and stripe footer length. It does not account for the length
of padded bytes which may result in wrong split boundary.
> The fix for this is to use the offset of next stripe as the length of current stripe
which includes the padded bytes as well.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message