hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth Jayachandran (JIRA)" <>
Subject [jira] [Commented] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC
Date Fri, 08 Apr 2016 21:18:25 GMT


Prasanth Jayachandran commented on HIVE-9660:

I still don't think we need a config for writer. I can see that the config is added to avoid
writing wrong lengths or disable that feature. But the problem is that the we won't be able
to identify the files that are already written wrongly. So I would recommend bumping up the
writerVersion to reflect this jira (HIVE-9660). With this we can identify files that are written
after HIVE-9660. In future if we find anything wrong, we bump up the writerVersion again and
make reader resilient by ignoring lengths from files written with HIVE-9660. There should
also be a reader config that use lengths when available or fallback to old codepath.

> store end offset of compressed data for RG in RowIndex in ORC
> -------------------------------------------------------------
>                 Key: HIVE-9660
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, HIVE-9660.03.patch, HIVE-9660.04.patch,
HIVE-9660.05.patch, HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, HIVE-9660.patch,
> Right now the end offset is estimated, which in some cases results in tons of extra data
being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of compressed
buffers for each RG, or end offset, or something, to remove this estimation magic

This message was sent by Atlassian JIRA

View raw message