avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-806) add a column-major codec for data files
Date Wed, 20 Apr 2011 18:57:06 GMT

    [ https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022298#comment-13022298

Doug Cutting commented on AVRO-806:

I was thinking of just creating columns for the fields of the fields of the top-level record.
 In this approach, a union would be written as a union, prefixed with a varint indicating
the branch taken.

If we stored union branches separately then we'd also need a column that has the varint. 
Iterators would then use this to decide when a column has a value.  For nested unions I think
the iterators would need to have a list of pointers to varints.

The use case is to accelerate scans of a subset of fields.  Further acceleration is possible
if things are columnized more deeply, but we probably want to stop at some fixed depth in
each block regardless.  So I'm effectively proposing a depth of 1.  Increasing the depth increases
the number of buffer pointers and the complexity of row iteration.  I don't have a clear sense
of when that becomes significant.  One way to limit the depth would be to specify a maximum
number of columns, and use a breadth-first walk of the schema until that number of columns
are encountered.  However I wonder whether we're over-engineering this.

> add a column-major codec for data files
> ---------------------------------------
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
> Define a codec that, when a data file's schema is a record schema, writes blocks within
the file in column-major order.  This would permit better compression and also permit efficient
skipping of fields that are not of interest.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message