hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <>
Subject [jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
Date Thu, 10 Jan 2013 20:32:13 GMT


He Yongqiang commented on HIVE-3874:

I want to list a few thoughts why i think the orc solution is a much more appealing one.

1. For a BIG data warehouse that stores more than 90% of existing data in rcfile (like FB's
>100PB warehouse), data conversion from one format to another is something that definitely
should be avoided. It is possible to convert some tables if there is a big space saving advantage.
But managing two distinct formats which do not have any compatibility, inter-operability,
or even in two different code repositories is another big headache that would avoid at the
first place.
2. Developing the new ORC format in the hive/hcatalog codebase will make hive development/operations
much easier.
3. Letting new ORC format have some backward compatibility with RCFile will save a lot of

> Create a new Optimized Row Columnar file format for Hive
> --------------------------------------------------------
>                 Key: HIVE-3874
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: OrcFileIntro.pptx
> There are several limitations of the current RC File format that I'd like to address
by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is required for
external indexes.
> * there is no mechanism for storing light weight indexes within the file to enable push-down
filters to skip entire row groups.
> * the type of the rows aren't stored in the file

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message