mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Multiple data-local passes?
Date Thu, 28 Jan 2010 19:39:03 GMT
That is quite doable.  Typically, the way that you do this is to buffer the
data either in memory or on local disk.  Both work fine.  You can munch on
the data until the cows come home that way.  Hadoop will still schedule your
tasks and handle failures for you.

The downside is that you lose communication between chunks of your data.
Sometimes that is fine.  Sometimes it isn't.  The specific case where it is
just fine is where you have multiple map functions that need to be applied
to individual input records.  These can trivially be smashed together into a
single map pass and that is just what frameworks like Pig and Cascading do.

This doesn't help you if you want to have lots of communication or global
summaries, but I think you know that.

On Thu, Jan 28, 2010 at 11:30 AM, Markus Weimer <> wrote:

> In a way, I want a sequential program scheduled through hadoop. I will
> loose the parallelism, but I want to keep data locality, scheduling
> and restart-on-failure.

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message