Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@crunch.apache.org
Date: Thu, 6 Nov 2014 18:48:34 +0000 (UTC)
From: "Gabriel Reid (JIRA)" <jira@apache.org>
To: crunch-dev@incubator.apache.org
Message-ID: <JIRA.12753324.1415259323000.436272.1415299714398@Atlassian.JIRA>
In-Reply-To: <JIRA.12753324.1415259323000@Atlassian.JIRA>
References: <JIRA.12753324.1415259323000@Atlassian.JIRA>
 <JIRA.12753324.1415259323294@arcas>
Subject: [jira] [Commented] (CRUNCH-480) AvroParquetFileSource doesn't
 properly configure user-supplied read schema
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200632#comment-14200632 ] 

Gabriel Reid commented on CRUNCH-480:
-------------------------------------

And now having thought about this a bit more, I see that I was over-simplifying things a bit with my proposed fix of just doing the equivalent of {{AvroReadSupport.setAvroReadSchema}} when a custom schema is provided, as this means that a projection schema always means that a custom read schema is used, and vice versa. 

I guess the situations that need to be supported are:
* no projection and use the write schema for reading
* use projection, but use the write schema for reading (which means some fields will just be null)
* use projection and a custom read schema

I'm not clear if a custom read schema without a projection is something that would be needed. [~esammer], could you elaborate on your use case?
I'm guessing that using a projection

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
>                 Key: CRUNCH-480
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-480
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Assignee: Gabriel Reid
>            Priority: Blocker
>
> It seems like AvroParquetFileSource doesn't properly set the configuration param required to use a user-supplied read schema that differs from the schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore the supplied requestedSchema and, instead, looks for the key avro.read.schema in the readSupportMetadata map. This is seriously kookie code in Parquet (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can never properly supply a read schema. Boooo hisssss.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)