hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC
Date Mon, 02 May 2016 18:23:13 GMT

    [ https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267179#comment-15267179
] 

Sergey Shelukhin edited comment on HIVE-9660 at 5/2/16 6:22 PM:
----------------------------------------------------------------

{quote}
The run length encoder doesn't perform the callback, but when its RLE block is finished passes
the same callback to the OutStream for when the OutStream finishes the next compression block.
Thus it is easy to guarantee that you only get called back when compression block finishes
after the RLE finishes, which is the required condition. Obviously, for cases where there
isn't an RLE, it just puts the callback directly on the OutStream and it works exactly the
same way.
{quote}
RG can have several RLE blocks; RL reader will need to know when to pass the callback (assuming
the callback maps to RG; otherwise, how does the WriterImpl know which RG is done after a
callback?); RLE block can contain several RGs, too. Moreover, in case of a boolean writer,
there are two levels of buffering - the current byte, and the RLE buffer in the underlying
byte writer.

There's also the issue of dictionaries and strings, where isPresent is written normally but
the entries cannot be finalized.
In general, I feel like all the coordination complexity will still be necessary, it would
just end up moving around a bit.

For uncompressed, if the exact boundary had to be determined, a callback would need to be
called every RLE buffer, and in some cases like for boolean writer it could be as often as
every few bytes.


was (Author: sershe):
{quote}
The run length encoder doesn't perform the callback, but when its RLE block is finished passes
the same callback to the OutStream for when the OutStream finishes the next compression block.
Thus it is easy to guarantee that you only get called back when compression block finishes
after the RLE finishes, which is the required condition. Obviously, for cases where there
isn't an RLE, it just puts the callback directly on the OutStream and it works exactly the
same way.
{quote}
RG can have several RLE blocks; RLE block can contain several RGs. Moreover, in case of a
boolean writer, there are two levels of buffering - the current byte, and the RLE buffer in
the underlying byte writer.

There's also the issue of dictionaries and strings, where isPresent is written normally but
the entries cannot be finalized.
In general, I feel like all the coordination complexity will still be necessary, it would
just end up moving around a bit.

For uncompressed, if the exact boundary had to be determined, a callback would need to be
called every RLE buffer, and in some cases like for boolean writer it could be as often as
every few bytes.

> store end offset of compressed data for RG in RowIndex in ORC
> -------------------------------------------------------------
>
>                 Key: HIVE-9660
>                 URL: https://issues.apache.org/jira/browse/HIVE-9660
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, HIVE-9660.03.patch, HIVE-9660.04.patch,
HIVE-9660.05.patch, HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, HIVE-9660.08.patch,
HIVE-9660.09.patch, HIVE-9660.10.patch, HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch,
HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of extra data
being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of compressed
buffers for each RG, or end offset, or something, to remove this estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message