avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-806) add a column-major codec for data files
Date Wed, 05 Sep 2012 23:40:09 GMT

    [ https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449265#comment-13449265
] 

Doug Cutting commented on AVRO-806:
-----------------------------------

Jakob, I think the more common case will be that fields whose values are small will produce
small columns where seek time becomes significant.  When seek time is significant the returns
of greater parallelism are diminished unless replication is also increased, which is unlikely.

With multiple row groups per file you have to choose a size for the row groups.  Would you
ever choose a size smaller than 64MB, the typical HDFS block size?  Column files are only
an advantage when there are multiple columns, so the amount read will typically be a fraction
of the row group size.

What cases do you imagine where having a row group size less than a file is useful?
                
> add a column-major codec for data files
> ---------------------------------------
>
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-806.patch, AVRO-806.patch, AVRO-806-v2.patch, avro-file-columnar.pdf
>
>
> Define a codec that, when a data file's schema is a record schema, writes blocks within
the file in column-major order.  This would permit better compression and also permit efficient
skipping of fields that are not of interest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message