crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema
Date Thu, 06 Nov 2014 18:48:34 GMT


Gabriel Reid commented on CRUNCH-480:

And now having thought about this a bit more, I see that I was over-simplifying things a bit
with my proposed fix of just doing the equivalent of {{AvroReadSupport.setAvroReadSchema}}
when a custom schema is provided, as this means that a projection schema always means that
a custom read schema is used, and vice versa. 

I guess the situations that need to be supported are:
* no projection and use the write schema for reading
* use projection, but use the write schema for reading (which means some fields will just
be null)
* use projection and a custom read schema

I'm not clear if a custom read schema without a projection is something that would be needed.
[~esammer], could you elaborate on your use case?
I'm guessing that using a projection

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>                 Key: CRUNCH-480
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Assignee: Gabriel Reid
>            Priority: Blocker
> It seems like AvroParquetFileSource doesn't properly set the configuration param required
to use a user-supplied read schema that differs from the schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore the supplied
requestedSchema and, instead, looks for the key in the readSupportMetadata
map. This is seriously kookie code in Parquet (i.e. wrong), but because Crunch doesn't supply
readSupportMetadata, we can never properly supply a read schema. Boooo hisssss.

This message was sent by Atlassian JIRA

View raw message