avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raymie Stata (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-806) add a column-major codec for data files
Date Tue, 24 Apr 2012 17:37:35 GMT

    [ https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260732#comment-13260732

Raymie Stata commented on AVRO-806:

This is the second attempt at a column-major codec.  The whole goal of col-major formats is
to optimize performance.  Thus, to drive this exercise forward it seems necessary to have
some kind of benchmark to do some testing.  (I don't think a micro-benchmark is sufficient
-- rather the right benchmark is with a query planner (Hive?) that can take advantage of these
formats.)  With such a benchmark in place, we'd compare the performance of the existing row-major
(as a baseline) Avro formats with the various, proposed col-major formats to make sure that
we're getting the kind of performance improvements (2x, 4x or more) to justify the complexity
of a col-major format.

Some comments more specific to this proposal: First, I'd like to see the Type Mapping section
for Avro filled in; this would give us a much better idea of what you're trying.  Second,
at first glance, it seems like your design replicates some of the features of RCFiles that
the CIF paper claims cause performance problems (but, again, maybe this issue is better addressed
via some benchmarking).

Regarding your implementation of this proposal, it re-implements all the lower-levels of Avro.
 It seems like this double-implementation will be a maintenance problem.  
> add a column-major codec for data files
> ---------------------------------------
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.7.0
>         Attachments: AVRO-806-v2.patch, AVRO-806.patch, avro-file-columnar.pdf
> Define a codec that, when a data file's schema is a record schema, writes blocks within
the file in column-major order.  This would permit better compression and also permit efficient
skipping of fields that are not of interest.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message