avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Bockrath-Vandegrift <llas...@gmail.com>
Subject Re: Mapreduce Strings from reader, when Avro is clearly Utf8
Date Tue, 27 Aug 2013 21:11:36 GMT
Anna Lahoud <annalahoud@gmail.com> writes:

> I am experiencing a problem and I found that another user wrote in
> about this same issue in March 2013 but there were no replies to his
> question. I am really hoping that there is someone who can explain
> this or offer suggestions. I cut and paste his message in since I
> could only find it in an archive.
> I have Avro files that clearly contain Utf8 and if I run
> non-mapreduce, I get Utf8 out. However, with the same files, I get
> String objects back from the mapper. Help!?!?!

There are some confusing differences between the now-named “data models”
used by the `mapred` vs `mapreduce` APIs.  

The Generic{Data,Datum{Reader,Writer}} and Specific implementations
generate `Utf8` instances by default.  The Reflect implementation
generates `String` instances only(?).

In 1.7.4 and earlier: The `mapred` API defaults to using the Specific
implementations (producing `Utf8`s), but may be configured to use the
Reflect implementations via the `...mapred.AvroJob.setReflect()` method.
The `mapreduce` API uses the Reflect implementations and cannot be
configured – and thus always produces `String` instances.  So no dice.

In 1.7.5 (and I hope later): Both the APIs allow you to specify the data
model as a sub-class of `GenericData`.  For example:

    import org.apache.avro.mapreduce.AvroJob;
    AvroJob.setDataModelClass(job, GenericData.class);

So-setting the job data model should yield the `Utf8` instances you’re
hoping for.



View raw message