flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno Dumon (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-5039) Avro GenericRecord support is broken
Date Wed, 09 Nov 2016 13:43:58 GMT
Bruno Dumon created FLINK-5039:
----------------------------------

             Summary: Avro GenericRecord support is broken
                 Key: FLINK-5039
                 URL: https://issues.apache.org/jira/browse/FLINK-5039
             Project: Flink
          Issue Type: Bug
          Components: Batch Connectors and Input/Output Formats
    Affects Versions: 1.1.3
            Reporter: Bruno Dumon
            Priority: Minor


Avro GenericRecord support was introduced in FLINK-3691, but it seems like the GenericRecords
are not properly (de)serialized.

This can be easily seen with a program like this:

{noformat}
  env.createInput(new AvroInputFormat<>(new Path("somefile.avro"), GenericRecord.class))
    .first(10)
    .print();
{noformat}

which will print records in which all fields have the same value:

{noformat}
{"foo": 1478628723066, "bar": 1478628723066, "baz": 1478628723066, ...}
{"foo": 1478628723179, "bar": 1478628723179, "baz": 1478628723179, ...}
{noformat}

If I'm not mistaken, the AvroInputFormat does essentially TypeExtractor.getForClass(GenericRecord.class),
but GenericRecords are not POJOs.

Furthermore, each GenericRecord contains a pointer to the record schema. I guess the current
naive approach will serialize this schema with each record, which is quite inefficient (the
schema is typically more complex and much larger than the data). We probably need a TypeInformation
and TypeSerializer specific to Avro GenericRecords, which could just use avro serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message