Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Fri, 18 Jan 2013 14:38:15 +0000 (UTC)
From: "Yin Huai (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <JIRA.12626790.1357748365038.158684.1358519895986@arcas>
In-Reply-To: <JIRA.12626790.1357748365038@arcas>
References: <JIRA.12626790.1357748365038@arcas>
Subject: [jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar
 file format for Hive
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557245#comment-13557245 ] 

Yin Huai commented on HIVE-3874:
--------------------------------

one question. Why a small row group size has to be used if we want to speed up secondary index lookups? When using a large row group size, if we store a column to multiple blocks, with a index of this column in this large row group, we do not need to read the entire column from the disk. Also, we can use row numbers to locate what blocks should be read from other columns in the row group.

If a small row group size is used, the size of a single column can be very small and a single buffered read may retrieve lots of unnecessary data from those unneeded columns from the disk.
                
> Create a new Optimized Row Columnar file format for Hive
> --------------------------------------------------------
>
>                 Key: HIVE-3874
>                 URL: https://issues.apache.org/jira/browse/HIVE-3874
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira