hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth J (JIRA)" <>
Subject [jira] [Commented] (HIVE-5632) Eliminate splits based on SARGs using stripe statistics in ORC
Date Thu, 31 Oct 2013 00:49:25 GMT


Prasanth J commented on HIVE-5632:

[~ehans] Row groups (10,000 rows) level skipping is already implemented as part of PPD. This
patch adds stripe-level skipping. With this patch, stripes will NOT be read if its min/max
metadata prunes it. 

To make it more clear. OrcInputFormat creates input splits based on the following map reduce
configs mapred.min.split.size and mapred.max.split.size. The default mapred.min.split.size
is 16MB and default mapred.max.split.size is 256MB. If an orc stripe is smaller than mapred.max.split.size
then it will be merged with adjacent orc stripe. Multiple orc stripes are merged until mapred.max.split.size
is reached. So a split can have more than one orc stripe. Now, before merging the stripes
to a split, this patch will check if min/max conditions are met. If the condition is met,
stripes will be merged to form a split else it will eliminate the stripe and will start a
new split. The final list of input splits will be submitted for execution which makes sure
byte ranges (essentially orc stripes) that are not required are not read.

> Eliminate splits based on SARGs using stripe statistics in ORC
> --------------------------------------------------------------
>                 Key: HIVE-5632
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: orcfile
>         Attachments: HIVE-5632.1.patch.txt, HIVE-5632.2.patch.txt, orc_split_elim.orc
> HIVE-5562 provides stripe level statistics in ORC. Stripe level statistics combined with
predicate pushdown in ORC (HIVE-4246) can be used to eliminate the stripes (thereby splits)
that doesn't satisfy the predicate condition. This can greatly reduce unnecessary reads.

This message was sent by Atlassian JIRA

View raw message