hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oded Rosen <o...@legolas-media.com>
Subject Re: How to write a complex Writable
Date Sat, 22 May 2010 22:11:21 GMT
Thanks,
The "record size" tip looks helpful,
But I do not like surprises (at least not while programming):
what do you mean by "DataOutput.writeUTF() may surprise you"?

On Sun, May 23, 2010 at 12:51 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> Thanks for sharing your precious experience.
>
> Hadoop achieves higher efficiency with Writables than with Serializables
> through object resuse. That's why clearFields() function is needed.
>
> In my job, I found it useful to persist the size of one record into
> DataInput before the actual record if your Writable instance has more than
> one record in it. This is for preparation of corrupt record (which we faced
> in our trial) so that you can skip to the next record and reduce data loss.
>
> Also DataOutput.writeUTF() may surprise you. At least it gave us a little
> surprise.
>
> Versioning is an interesting subject. I hope to see more suggestions on
> this
> topic.
>
> Cheers
>
> On Sat, May 22, 2010 at 2:09 PM, Oded Rosen <oded@legolas-media.com>
> wrote:
>
> > Since there are few sources on how to write a good Writable,
> > I want to share some tips I've learned over the last few days, while
> > writing
> > a job that demanded some complex Writable object.
> > I will be glad to hear more tips, corrections, etc.
> >
> >
> > Wrtiables are a way to transfer complex data types from the mapper to the
> > reducer (/ combiner), or as a flexible output format for mapreduce jobs.
> > However, Hadoop does not always use them the way you think it does:
> > 1. Hadoop reuses writable objects (at least the old API does) - so
> strange
> > data miraculously appears in your writables, if they are not cleaned
> right
> > before being used.
> > 2. Hadoop compares Writables and hashes them a lot - so writing good
> > "hashCode" and "equals" functions is a necessity.
> > 3. Hadoop needs an empty constructor for writables - so if you write
> > another
> > constructor, be sure to also implement the empty one.
> >
> > Any complex writable object you write (complex = more then just a couple
> of
> > fields) should:
> >
> > *. Override Object's "*equals*": compare all available fields (deep
> > compare), check unique fields first, to avoid checking the rest.
> >
> > *. Override Object's "*hashCode*": the simplest way is XORing (^) the
> hash
> > codes of the most important fields.
> >
> > *. Create an *empty constructor* - even if you don't need one.
> Implementing
> > a different constructor is ok, as long as the empty is also available.
> >
> > *. Implement (the mandatory) Writable's *readFields()* and *write()*. Use
> > versioning to allow scalability over time.
> > In the very begining of readFields(), clear all available fields (lists,
> > primitives, etc).
> > The best way to to do that is to create a clearFields() function, that
> will
> > be called both from "readFields()" and from the empty constructor.
> > Remember Hadoop reuses writables (again, at least the old API - "mapred"
> -
> > does), so this is not just a good habit, but clearly a must.
> >
> > *. implement "*read()*" - this isn't mandatory but it's simple and
> helpful:
> >
> > public static UserWritable read(DataInput in) throws IOException {
> >   UserWritable u = new UserWritable();
> >   u.readFields(in);
> >   return u;
> > }
> >
> >
> > More golden tips are welcomed. So does remarks.
> >
> > --
> > Oded
> >
>



-- 
Oded

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message