hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anthony Hsu (JIRA)" <>
Subject [jira] [Commented] (HIVE-17394) AvroSerde is regenerating TypeInfo objects for each nullable Avro field for every row
Date Tue, 12 Sep 2017 22:44:00 GMT


Anthony Hsu commented on HIVE-17394:

Thanks, [~cwsteinbach] and [~rdsr] for the reviews!

> AvroSerde is regenerating TypeInfo objects for each nullable Avro field for every row
> -------------------------------------------------------------------------------------
>                 Key: HIVE-17394
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 1.1.0, 3.0.0
>            Reporter: Ratandeep Ratti
>            Assignee: Anthony Hsu
>             Fix For: 3.0.0
>         Attachments: AvroSerDe.nps, AvroSerDeUnionTypeInfo.png, HIVE-17394.1.patch
> The following methods in {{AvroDeserializer}} keeps regenerating {{TypeInfo}} objects
for every nullable  field in a row.
> This is happening in the following methods.
> {code}
> private Object deserializeNullableUnion(Object datum, Schema fileSchema, Schema recordSchema)
throws AvroSerdeException {
> // elided
> line 312:  return worker(datum, fileSchema, newRecordSchema,
>             SchemaToTypeInfo.generateTypeInfo(newRecordSchema, null));
> }
> ..
> private Object deserializeSingleItemNullableUnion(Object datum, Schema Schema recordSchema)
> // elided
> line 357: return worker(datum, currentFileSchema, schema,
>       SchemaToTypeInfo.generateTypeInfo(schema, null));
> {code}
> This is really bad in terms of performance. I'm not sure why didn't we use the TypeInfo
we already have instead of generating again for each nullable field.  If you look at the {{worker}}
method which calls the method {{deserializeNullableUnion}} the typeInfo corresponding to the
nullable field column is already determined. 
> Moreover the cache in {{SchemaToTypeInfo}} class does not help in nullable Avro records
case as checking if an Avro record schema object already exists in the cache requires traversing
all the fields in the record schema.
> I've attached profiling snapshot which shows maximum time is being spent in the cache.
> One way of fixing this IMO might be to make use of the column TypeInfo which is already
passed in the worker method.

This message was sent by Atlassian JIRA

View raw message