crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema
Date Tue, 11 Nov 2014 19:05:34 GMT


Gabriel Reid commented on CRUNCH-480:

[~tomwhite] - I wouldn't really call it a bug in the default handling in Parquet, maybe more
the consequence of misuse of the API, but the situation is as follows: if you have Parquet
files written according to Schema v1, and then create Schema v2 with a new field that has
a default value, and try to read the original file using Schema v2, the defaults won't get
filled in during reading.

This happens because the reader schema (Schema v2) is then also used as the Parquet projection
schema. In the constructor of parquet.avro.AvroIndexedRecordConverter, the default value handling
is based on the difference between the projection schema and the reader schema, and because
in this case these are both the same schema, no default value handling is done at all.

...and while writing this and then going back over your comments and seeing the question of
why getBundle sets a projection schema even if one isn't set up, it seems that by removing
this maybe everything (or almost everything) gets fixed. In any case, the situation I just
described is fixed by conditionally adding the projection schema. I'll upload the patch that
does that in just a moment.

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>                 Key: CRUNCH-480
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Assignee: Gabriel Reid
>            Priority: Blocker
>         Attachments: CRUNCH-480.1.patch, CRUNCH-480.patch
> It seems like AvroParquetFileSource doesn't properly set the configuration param required
to use a user-supplied read schema that differs from the schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore the supplied
requestedSchema and, instead, looks for the key in the readSupportMetadata
map. This is seriously kookie code in Parquet (i.e. wrong), but because Crunch doesn't supply
readSupportMetadata, we can never properly supply a read schema. Boooo hisssss.

This message was sent by Atlassian JIRA

View raw message