avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-806) add a column-major codec for data files
Date Wed, 20 Apr 2011 22:08:06 GMT

    [ https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022395#comment-13022395

Doug Cutting commented on AVRO-806:

The question is not whether the elements of depth > 1 are included, but whether they're
each stored in a distinct column.  Regardless, one will read the data file in the same way,
using a schema with a subset of the fields, even if you're not using the column-major codec
at all.  So if you have a query that scans only field x.y.z, then storing values for x.y in
a column will still make things faster than a row-order, but perhaps not as fast as if x.y.z
values were stored in their own column, especially if y has a lot of other fields.  Note that
Avro's already fast at skipping string and binary values that are not desired: it reads the
length and increments the buffer pointer.  So column-major will provide the biggest speedup
for structures that have a lot of numeric fields that are often ignored queries. 

> add a column-major codec for data files
> ---------------------------------------
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
> Define a codec that, when a data file's schema is a record schema, writes blocks within
the file in column-major order.  This would permit better compression and also permit efficient
skipping of fields that are not of interest.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message