orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shardul Mahadik (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ORC-220) Stripe size too small for wide tables
Date Mon, 31 Jul 2017 18:03:00 GMT
Shardul Mahadik created ORC-220:

             Summary: Stripe size too small for wide tables
                 Key: ORC-220
                 URL: https://issues.apache.org/jira/browse/ORC-220
             Project: ORC
          Issue Type: Bug
    Affects Versions: 1.4.0, 1.3.0, 1.2.0, 1.1.0, 1.0.0
            Reporter: Shardul Mahadik

For a wide table having, eg. 100 columns, I observed that really small stripes were generated.
As an example, for a table with 133 columns, Stripe Size=128MB with ZLIB, Hive 1.1 generated
35k stripes of 0.03MB; with Hive 2 the situation bettered with 1.2k stripes of 0.8MB (Mostly
because Hive 2 selected 64KB compression buffer size instead of the specified 256KB).
I came across this PR https://github.com/apache/hive/pull/118 which was sent to the Hive repo.
The PR suggests using ByteBuffer.postion() instead of ByteBuffer.capacity() to estimate the
stripe size. This is really useful for wide tables where the difference between position and
capacity of the buffers can add up significantly. In our case, with this patch, I saw that
the number of stripes went down to 115, each stripe being 8.3MB. The patch reduced the value
returned by estimateStripeSize() by approx 15MB which delayed the flushing on the stripes.
Would like to know your thoughts on this.

This message was sent by Atlassian JIRA

View raw message