avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Kreps (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-6) Better support for using customized in memory types with Avro GenericDatumReader and GenericDatumWriter
Date Sat, 11 Apr 2009 18:16:14 GMT

    [ https://issues.apache.org/jira/browse/AVRO-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698118#action_12698118

Jay Kreps commented on AVRO-6:

With respect to (2), one approach would be the following:

Avro schemas implicitly assign a slot number to each field. The Schema.getFields currently
returns field names. It should instead return Map<String, Field> (or something like
that) where Field is a class that has (a) the field name, (b) the calculated slot number,
and (c) the schema.

The record can then support get(Field) as well as get(String).

Internally the implementation of GenericRecord can stop using HashMap, and just use Object[]
where the index in the array is the slot calculated by the field. A get("my_key") translates
to get(schema.getField("my_key")).

This should be a tiny bit faster since for hadoop you will be able to calculate the field
once per mapper/reducer and use it many times to avoid re-hashing. It should be denser because
your array is just of Object[] not Map.Entry[] so you avoid creating lots of entry objects
and maintaining the 25% sparseness.

> Better support for using customized in memory types with Avro GenericDatumReader and
> -------------------------------------------------------------------------------------------------------
>                 Key: AVRO-6
>                 URL: https://issues.apache.org/jira/browse/AVRO-6
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Hong Tang
> Currently Avro's GenericDatumReader/Writer requires Record, Array, and Map be subclasses
of GenericRecord, GenericArray, and Map. Additionally, STRING and BYTES are mapped to Utf8
and ByteBuffer. Finally, Record fields are accessed through field names, this may be less
efficient if a user-defined record class supports field access by positions (such as PIG Tuples).
> I suggest we improve the interface to (1) have more flexibility to use user-types with
Avro; (2) support access to RECORDs by either field names or field positions.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message