crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema
Date Thu, 06 Nov 2014 15:33:35 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200332#comment-14200332
] 

Gabriel Reid commented on CRUNCH-480:
-------------------------------------

It looks like the situation in [AvroReadSupport in Parquet|https://github.com/apache/incubator-parquet-mr/blob/0148455170be07f89bd6b9230960a6cd510c7ca6/parquet-avro/src/main/java/parquet/avro/AvroReadSupport.java#L54-L87]
has been cleaned up quite a bit in the mean time. 

If upgrading to Parquet 1.4.x is an option, a short-term workaround that I tried out and that
seems to work is as follows: you pass the read schema to the constructor of AvroParquetFileSource,
and then you make this call:
{code}
AvroReadSupport.setAvroReadSchema(
        pipeline.getConfiguration(),
        readSchema);
{code}

Unfortunately, that sets that read schema globally for the pipeline, so if you're reading
multiple Parquet sources within the one pipeline that'll be a problem.

As far as the structural fix, I think the following should do it:
* upgrade to Parquet 1.4.x or later
* do the equivalent of {{AvroReadSupport.setAvroReadSchema}} in {{AvroParquetFileSource#getBundle}}
based on the schema that is passed in to the AvroParquetFileSource constructor (if there is
one)

Does that sound right to you [~tomwhite]? Or are there other nuances to projection and/or
read schemas in Parquet that I'm missing?

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
>                 Key: CRUNCH-480
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-480
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Priority: Blocker
>
> It seems like AvroParquetFileSource doesn't properly set the configuration param required
to use a user-supplied read schema that differs from the schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore the supplied
requestedSchema and, instead, looks for the key avro.read.schema in the readSupportMetadata
map. This is seriously kookie code in Parquet (i.e. wrong), but because Crunch doesn't supply
readSupportMetadata, we can never properly supply a read schema. Boooo hisssss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message