Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9451E17290 for ; Thu, 6 Nov 2014 18:48:34 +0000 (UTC) Received: (qmail 94385 invoked by uid 500); 6 Nov 2014 18:48:34 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 94351 invoked by uid 500); 6 Nov 2014 18:48:34 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 94330 invoked by uid 500); 6 Nov 2014 18:48:34 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 94327 invoked by uid 99); 6 Nov 2014 18:48:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Nov 2014 18:48:34 +0000 Date: Thu, 6 Nov 2014 18:48:34 +0000 (UTC) From: "Gabriel Reid (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200632#comment-14200632 ] Gabriel Reid commented on CRUNCH-480: ------------------------------------- And now having thought about this a bit more, I see that I was over-simplifying things a bit with my proposed fix of just doing the equivalent of {{AvroReadSupport.setAvroReadSchema}} when a custom schema is provided, as this means that a projection schema always means that a custom read schema is used, and vice versa. I guess the situations that need to be supported are: * no projection and use the write schema for reading * use projection, but use the write schema for reading (which means some fields will just be null) * use projection and a custom read schema I'm not clear if a custom read schema without a projection is something that would be needed. [~esammer], could you elaborate on your use case? I'm guessing that using a projection > AvroParquetFileSource doesn't properly configure user-supplied read schema > -------------------------------------------------------------------------- > > Key: CRUNCH-480 > URL: https://issues.apache.org/jira/browse/CRUNCH-480 > Project: Crunch > Issue Type: Bug > Components: IO > Affects Versions: 0.10.0 > Reporter: E. Sammer > Assignee: Gabriel Reid > Priority: Blocker > > It seems like AvroParquetFileSource doesn't properly set the configuration param required to use a user-supplied read schema that differs from the schema in the file. > Deep in the guts of Parquet (InternalParquetReader#initialize()), I found this: > {code} > this.recordConverter = readSupport.prepareForRead( > configuration, extraMetadata, fileSchema, > new ReadSupport.ReadContext(requestedSchema, readSupportMetadata)); > {code} > Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore the supplied requestedSchema and, instead, looks for the key avro.read.schema in the readSupportMetadata map. This is seriously kookie code in Parquet (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can never properly supply a read schema. Boooo hisssss. -- This message was sent by Atlassian JIRA (v6.3.4#6332)