hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phabricator (JIRA)" <>
Subject [jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
Date Thu, 14 Feb 2013 19:21:18 GMT


Phabricator commented on HIVE-3874:

kevinwilfong has commented on the revision "HIVE-3874 [jira] Create a new Optimized Row Columnar
file format for Hive".

  A couple of minor style comments, according to the style guide

  There are a number of places in the code where you are missing spaces around + operators
(e.g. line 58 in DynamicByteArray), you're missing a space between for and ( (e.g. line 63
in DynamicByteArray), and you're missing a space before a : in a for-each loop (e.g. line
191 in OrcStruct).

  Mentioning these now as I don't want them to hold up a commit later.

  ql/src/java/org/apache/hadoop/hive/ql/orc/ Is this loop necessary?
 result is a boolean array so all of these entries will default to false anyway
  ql/src/java/org/apache/hadoop/hive/ql/orc/ I'm a little confused by
this, if compressed is null, why aren't you initializing overflow as well?
  ql/src/java/org/apache/hadoop/hive/ql/orc/ I saw issues with this, and
TypeInfoUtils expecting array instead of list.
  ql/src/java/org/apache/hadoop/hive/ql/orc/ As far as I can tell,
by storing the intermediate string data in these structures which do not write to a stream
until writeStripe is called, the size of string columns is not being accounted for at all
when determining whether or not to write out the stripe.  (This could be fixed as a follow


To: JIRA, omalley
Cc: kevinwilfong

> Create a new Optimized Row Columnar file format for Hive
> --------------------------------------------------------
>                 Key: HIVE-3874
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, OrcFileIntro.pptx, orc.tgz
> There are several limitations of the current RC File format that I'd like to address
by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is required for
external indexes.
> * there is no mechanism for storing light weight indexes within the file to enable push-down
filters to skip entire row groups.
> * the type of the rows aren't stored in the file

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message