hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abhishek Somani (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15390) Orc reader unnecessarily reading stripe footers with hive.optimize.index.filter set to true
Date Thu, 08 Dec 2016 15:38:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15732538#comment-15732538
] 

Abhishek Somani commented on HIVE-15390:
----------------------------------------

I think this is happening due to a bug in OrcRawRecordMerger initialization where we reset
the split's length to Long.MAX_VALUE. 

{code:java}
if (isOriginal) {
  options = options.clone();
  options.range(options.getOffset(), Long.MAX_VALUE);
  pair = new OriginalReaderPair(key, reader, bucket, minKey, maxKey,
           options);
}
{code}

This causes the reader to incorrectly assume that all the stripes from the start offset to
the end of the file is it's responsibility to read. In RecordReaderImpl.java: 

{code:java}
long offset = options.getOffset();
long maxOffset = options.getMaxOffset();
for(StripeInformation stripe: fileReader.getStripes()) {
  long stripeStart = stripe.getOffset();
  if (offset > stripeStart) {
    skippedRows += stripe.getNumberOfRows();
  } else if (stripeStart < maxOffset) {
    this.stripes.add(stripe);
    rows += stripe.getNumberOfRows();
  }
}
{code}

Although the task does not make the mistake of reading data that it should not be reading
because it also relies on the min and max row key to make that decision, in some cases it
does up reading the footers of all the stripes, increasing task completion time.

Once all the rows from a stripe have been read, RecordReaderImpl.advanceToNextRow() is invoked
which reads the footer of the next stripe and evaluates the sarg against the rowgroups in
that stripe using the metadata in the footer. If sarg evaluation fails for all the rowgroups
in the stripe, it moves on the next stripe and repeats the same process there, ending up reading
stripe footers for all the stripes in the file.

Is my understanding correct?

> Orc reader unnecessarily reading stripe footers with hive.optimize.index.filter set to
true
> -------------------------------------------------------------------------------------------
>
>                 Key: HIVE-15390
>                 URL: https://issues.apache.org/jira/browse/HIVE-15390
>             Project: Hive
>          Issue Type: Bug
>          Components: ORC
>    Affects Versions: 1.2.1
>            Reporter: Abhishek Somani
>            Assignee: Abhishek Somani
>
> In a split given to a task, the task's orc reader is unnecessarily reading stripe footers
for stripes that are not its responsibility to read. This is happening with hive.optimize.index.filter
set to true.
> Assuming one split per task(no tez grouping considered), a task should not need to read
beyond the split's end offset. Even in some split computation strategies where a split's end
offset can be in the middle of a stripe, it should not need to read more than one stripe beyond
the split's end offset(to fully read a stripe that started in it). However I see that some
tasks make unnecessary filesystem calls to read all the stripe footers in a file from the
split start offset till the end of the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message