hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Kerzner <markkerz...@gmail.com>
Subject Re: How to give consecutive numbers to output records?
Date Wed, 28 Oct 2009 04:34:39 GMT
Aaron, although your notes are not a ready solution, but they are a great
help.

Thank you,
Mark

On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <aaron@cloudera.com> wrote:

> There is no in-MapReduce mechanism for cross-task synchronization. You'll
> need to use something like Zookeeper for this, or another external
> database.
> Note that this will greatly complicate your life.
>
> If I were you, I'd try to either redesign my pipeline elsewhere to
> eliminate
> this need, or maybe get really clever. For example, do your numbers need to
> be sequential, or just unique?
>
> If the latter, then take the byte offset into the reducer's current output
> file and combine that with the reducer id (e.g.,
> <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're
> all
> building unique sequences. If the former... rethink your pipeline? :)
>
> - Aaron
>
> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <markkerzner@gmail.com>
> wrote:
>
> > Hi,
> >
> > I need to number all output records consecutively, like, 1,2,3...
> >
> > This is no problem with one reducer, making recordId an instance variable
> > in
> > the Reducer class, and setting conf.setNumReduceTasks(1)
> >
> > However, it is an architectural decision forced by processing need, where
> > the reducer becomes a bottleneck. Can I have a global variable for all
> > reducers, which would give each the next consecutive recordId? In the
> > database scenario, this would be the unique autokey. How to do it in
> > MapReduce?
> >
> > Thank you
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message