crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema
Date Thu, 06 Nov 2014 13:41:33 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200171#comment-14200171
] 

Josh Wills commented on CRUNCH-480:
-----------------------------------

Is there a Parquet issue that fixes the underlying problem that I should keep an eye on?

I don't see how to communicate the readSupportMetadata to the InputFormat right now-- is this
on a later version of Parquet than the one we're using here?

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
>                 Key: CRUNCH-480
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-480
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Priority: Blocker
>
> It seems like AvroParquetFileSource doesn't properly set the configuration param required
to use a user-supplied read schema that differs from the schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore the supplied
requestedSchema and, instead, looks for the key avro.read.schema in the readSupportMetadata
map. This is seriously kookie code in Parquet (i.e. wrong), but because Crunch doesn't supply
readSupportMetadata, we can never properly supply a read schema. Boooo hisssss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message