avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-806) add a column-major codec for data files
Date Fri, 22 Jul 2011 21:36:09 GMT

    [ https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069792#comment-13069792

Doug Cutting commented on AVRO-806:

Yes, CIF file looks promising.  It's great to see all the benchmarks!

I wonder if the advantages of CIF could be had without a custom HDFS block placement strategy?
 For example, one might pack the files of a split directory into a single file whose block
size was set to the size of the file, forcing it into a single block.  This would guarantee
locality for the columns of a split.

In other words, instead of groups of column-major records within a file ("block columnar"
in Raymie's document) on one hand or a file-per-column on the other ("file columnar"), we
have a single group per file.  Since splits might often be bigger than RAM, creating these
would probably require two steps: writing a set of temporary local files, one per column,
then appending these into the final output.  The file would have an index indicating where
each column lies, and each column within the file would permit efficient skipping, in the
style of CIF.

> add a column-major codec for data files
> ---------------------------------------
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-806-v2.patch, AVRO-806.patch, avro-file-columnar.pdf
> Define a codec that, when a data file's schema is a record schema, writes blocks within
the file in column-major order.  This would permit better compression and also permit efficient
skipping of fields that are not of interest.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message