hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: How to write a complex Writable
Date Sat, 22 May 2010 21:51:07 GMT
Thanks for sharing your precious experience.

Hadoop achieves higher efficiency with Writables than with Serializables
through object resuse. That's why clearFields() function is needed.

In my job, I found it useful to persist the size of one record into
DataInput before the actual record if your Writable instance has more than
one record in it. This is for preparation of corrupt record (which we faced
in our trial) so that you can skip to the next record and reduce data loss.

Also DataOutput.writeUTF() may surprise you. At least it gave us a little
surprise.

Versioning is an interesting subject. I hope to see more suggestions on this
topic.

Cheers

On Sat, May 22, 2010 at 2:09 PM, Oded Rosen <oded@legolas-media.com> wrote:

> Since there are few sources on how to write a good Writable,
> I want to share some tips I've learned over the last few days, while
> writing
> a job that demanded some complex Writable object.
> I will be glad to hear more tips, corrections, etc.
>
>
> Wrtiables are a way to transfer complex data types from the mapper to the
> reducer (/ combiner), or as a flexible output format for mapreduce jobs.
> However, Hadoop does not always use them the way you think it does:
> 1. Hadoop reuses writable objects (at least the old API does) - so strange
> data miraculously appears in your writables, if they are not cleaned right
> before being used.
> 2. Hadoop compares Writables and hashes them a lot - so writing good
> "hashCode" and "equals" functions is a necessity.
> 3. Hadoop needs an empty constructor for writables - so if you write
> another
> constructor, be sure to also implement the empty one.
>
> Any complex writable object you write (complex = more then just a couple of
> fields) should:
>
> *. Override Object's "*equals*": compare all available fields (deep
> compare), check unique fields first, to avoid checking the rest.
>
> *. Override Object's "*hashCode*": the simplest way is XORing (^) the hash
> codes of the most important fields.
>
> *. Create an *empty constructor* - even if you don't need one. Implementing
> a different constructor is ok, as long as the empty is also available.
>
> *. Implement (the mandatory) Writable's *readFields()* and *write()*. Use
> versioning to allow scalability over time.
> In the very begining of readFields(), clear all available fields (lists,
> primitives, etc).
> The best way to to do that is to create a clearFields() function, that will
> be called both from "readFields()" and from the empty constructor.
> Remember Hadoop reuses writables (again, at least the old API - "mapred" -
> does), so this is not just a good habit, but clearly a must.
>
> *. implement "*read()*" - this isn't mandatory but it's simple and helpful:
>
> public static UserWritable read(DataInput in) throws IOException {
>   UserWritable u = new UserWritable();
>   u.readFields(in);
>   return u;
> }
>
>
> More golden tips are welcomed. So does remarks.
>
> --
> Oded
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message