crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema
Date Wed, 12 Nov 2014 19:37:34 GMT


Gabriel Reid commented on CRUNCH-480:

[~jwills] that looks good to me. I think that the constructor issue is now a non-issue since
the last patch that I posted, as the projection schema is now only set if it has been explicitly
set in the builder. I believe the situation is now the following:
* the avro "writer" (i.e. file) schema is taken from the parquet file
* the avro "reader" schema is taken from the PType or supplied schema in the builder
* the parquet projection is by default null (which means that it is the same as the writer
schema), but can be supplied by the builder or AvroParquetFileSource constructor

The issue that I was referring to previously, where the defaults would not get filled in if
you supplied a reader schema that was different than the file schema but didn't supply a projection
schema, is no longer an issue, and there is a test my the last patch(es) that demonstrate
this. I think this is ready to go as-is.

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>                 Key: CRUNCH-480
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Assignee: Gabriel Reid
>            Priority: Blocker
>         Attachments: CRUNCH-480.1.patch, CRUNCH-480.2.patch, CRUNCH-480.3.patch, CRUNCH-480.patch
> It seems like AvroParquetFileSource doesn't properly set the configuration param required
to use a user-supplied read schema that differs from the schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore the supplied
requestedSchema and, instead, looks for the key in the readSupportMetadata
map. This is seriously kookie code in Parquet (i.e. wrong), but because Crunch doesn't supply
readSupportMetadata, we can never properly supply a read schema. Boooo hisssss.

This message was sent by Atlassian JIRA

View raw message