mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <>
Subject Re: Multiple data-local passes?
Date Thu, 28 Jan 2010 19:11:24 GMT
Glad that you asked because I have been asking the same question myself when
creating a Text->Vector convertor where i need to iterate over the same data
converting them to vectors using a chunk of dictionary at a time. If i had
the option of running multiple passes. It would have taken me just a single
mapreduce. Here i have to do 1 pass over the data for every chunk of
dictionary in memory.  True, I can run n sequential job using a HDFS client
on different servers. The network data transfer  wasn't worth it.


On Fri, Jan 29, 2010 at 12:30 AM, Markus Weimer

> Hi,
> I have a question about hadoop, which most likely someone in mahout
> must have solved before:
> Many online ML algorithms require multiple passes over data for best
> performance. When putting these algorithms on hadoop, one would want
> to run the code close to the data (same machine/rack). Mappers offer
> this data-local execution but do not offer means to run multiple times
> over the data. Of course, one could run the code outside of the hadoop
> mapreduce framework as a HDFS client, but that does not offer the
> data-locality advantage, in addition to not being scheduled through
> the hadoop schedulers.
> How is this solved in mahout?
> Thanks for any pointer,
> Markus

Robin Anil

Mahout in Action - Mammoth Scale machine learning
Read Chapter 1 - Its Frrreeee

Try out Swipeball for iPhone

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message