crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema
Date Tue, 11 Nov 2014 13:47:34 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gabriel Reid updated CRUNCH-480:
--------------------------------
    Attachment: CRUNCH-480.1.patch

I think you've got me convinced [~jwills]. I was actually finally taking a closer look at
this too, and had put together some integration tests which I've added to your changes in
the attached patch.

I think it's still probably necessary to do something with the builder AvroParquetFileSource.Builder
class to make the setting of a reader schema more clear. As it currently stands, doing something
like this:
{code}
AvroParquetFileSource.builder(readSchemaWithSupersetOfFields).build()
{code}
will create an AvroParquetFileSource instance that uses the same schema for the parquet projection
and Avro reading. This seems to work ok, except for the fact that default handling doesn't
work completely correctly within parquet when you do this, and seeing as default handling
is a basic requirement for using a custom read schema, that's an issue. If you do specify
a subset of the writer fields to the builder (i.e. build a projection schema) that is a (proper
or not proper) subset of the writer schema, then everything seems to work fine.

Maybe supplying a custom read schema with Parquet is enough of a non-default option that it
can just be made clear that you need to use the constructor (and not the builder) if you want
to supply a custom reader schema. I'm not sure, but it seems difficult to fit in the ability
to specify a different reader schema with the builder as-is without making it's API overly
complicated.

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
>                 Key: CRUNCH-480
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-480
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Assignee: Gabriel Reid
>            Priority: Blocker
>         Attachments: CRUNCH-480.1.patch, CRUNCH-480.patch
>
>
> It seems like AvroParquetFileSource doesn't properly set the configuration param required
to use a user-supplied read schema that differs from the schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore the supplied
requestedSchema and, instead, looks for the key avro.read.schema in the readSupportMetadata
map. This is seriously kookie code in Parquet (i.e. wrong), but because Crunch doesn't supply
readSupportMetadata, we can never properly supply a read schema. Boooo hisssss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message