hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC
Date Mon, 02 May 2016 15:16:12 GMT

    [ https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266769#comment-15266769
] 

Owen O'Malley commented on HIVE-9660:
-------------------------------------

After looking at this patch, I feel like we can do it more cleanly. I'd propose that we:
* add a capability to register callbacks on PositionedOutputStream that get called immediately
if there are no uncompressed bytes, or after the next compression block finishes.
* add a similar capability to the run length encoders that wait until the end of the current
run and then pass the callback down to the PositionedOutputStream.
* the ORC WriterImpl then creates callbacks that finalize the RowIndexEntry when all of the
streams for that column have completed their run length encoding block and compression block.

This makes most of the column types really straightforward. The only one that is a mess is
the string column types because of the delayed writing caused by the dictionary.

I should have a first draft of such a patch today for everyone to look at.

Thoughts?

> store end offset of compressed data for RG in RowIndex in ORC
> -------------------------------------------------------------
>
>                 Key: HIVE-9660
>                 URL: https://issues.apache.org/jira/browse/HIVE-9660
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, HIVE-9660.03.patch, HIVE-9660.04.patch,
HIVE-9660.05.patch, HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, HIVE-9660.08.patch,
HIVE-9660.09.patch, HIVE-9660.10.patch, HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch,
HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of extra data
being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of compressed
buffers for each RG, or end offset, or something, to remove this estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message