crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema
Date Fri, 07 Nov 2014 14:59:34 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202113#comment-14202113
] 

Tom White commented on CRUNCH-480:
----------------------------------

We hashed out the differences between projection and read schemas on https://github.com/Parquet/parquet-mr/pull/246,
and came to the conclusion that they are orthogonal. Read schemas are for schema evolution
in the usual Avro fashion, whereas projection schemas are just a convenient way to select
a subset of the columns that you want to read.

I think the change to AvroParquetFileSource that is needed is adding a constructor that takes
both a projection schema and a read schema. The existing constructors that take a schema are
both for projection schemas and those can't be changed (e.g. to read schemas) for compatibility
reasons.

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
>                 Key: CRUNCH-480
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-480
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Assignee: Gabriel Reid
>            Priority: Blocker
>
> It seems like AvroParquetFileSource doesn't properly set the configuration param required
to use a user-supplied read schema that differs from the schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore the supplied
requestedSchema and, instead, looks for the key avro.read.schema in the readSupportMetadata
map. This is seriously kookie code in Parquet (i.e. wrong), but because Crunch doesn't supply
readSupportMetadata, we can never properly supply a read schema. Boooo hisssss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message