hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jimmy Wan <ji...@indeed.com>
Subject Re: Batching key/value pairs to map
Date Mon, 23 Feb 2009 22:19:56 GMT
Great, thanks Owen. I actually ran into the object reuse problem a
long time ago. The output of my MR processes gets turned into a series
of large INSERT statements that wasn't performing unless I batched
them in inserts of several K entries. I'm not sure if this is
possible, but it would certainly be nice to either:
1) pass the OutputCollector and Reporter to the close() method.
2) Provide accessors to the OutputCollector and the Reporter.

Now every single one of my maps is going to have a pair of 1-2 extra no-ops.

I'll check to see if that's on the list of outstanding FRs.

On Mon, Feb 23, 2009 at 15:30, Owen O'Malley <owen.omalley@gmail.com> wrote:
> On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan <jimmy@indeed.com> wrote:
>> part of my map/reduce process could be greatly sped up by mapping
>> key/value pairs in batches instead of mapping them one by one.
>> Can I safely hang onto my OutputCollector and Reporter from calls to map?
> Yes. You can even use them in the close, so that you can process the last
> batch of records. *smile* One problem that you will quickly hit is that
> Hadoop reuses the objects that are passed to map and reduce. So, you'll need
> to clone them before putting them into the collection.
> I'm currently running Hadoop Is this something I could do in
>> Hadoop 0.19.X?
> I don't think any of this changed between 0.17 and 0.19, other than in 0.17
> the reduce's inputs were always new objects. In 0.18 and after, the reduce's
> inputs are reused.

View raw message