mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Multiple data-local passes?
Date Thu, 28 Jan 2010 19:11:24 GMT
Glad that you asked because I have been asking the same question myself when
creating a Text->Vector convertor where i need to iterate over the same data
converting them to vectors using a chunk of dictionary at a time. If i had
the option of running multiple passes. It would have taken me just a single
mapreduce. Here i have to do 1 pass over the data for every chunk of
dictionary in memory.  True, I can run n sequential job using a HDFS client
on different servers. The network data transfer  wasn't worth it.

Robin

On Fri, Jan 29, 2010 at 12:30 AM, Markus Weimer
<mailinglists2008@weimo.de>wrote:

> Hi,
>
> I have a question about hadoop, which most likely someone in mahout
> must have solved before:
>
> Many online ML algorithms require multiple passes over data for best
> performance. When putting these algorithms on hadoop, one would want
> to run the code close to the data (same machine/rack). Mappers offer
> this data-local execution but do not offer means to run multiple times
> over the data. Of course, one could run the code outside of the hadoop
> mapreduce framework as a HDFS client, but that does not offer the
> data-locality advantage, in addition to not being scheduled through
> the hadoop schedulers.
>
> How is this solved in mahout?
>
> Thanks for any pointer,
>
> Markus
>



-- 
------
Robin Anil
Blog: http://techdigger.wordpress.com
-------

Mahout in Action - Mammoth Scale machine learning
Read Chapter 1 - Its Frrreeee
http://www.manning.com/owen

Try out Swipeball for iPhone
http://itunes.com/apps/swipeball

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message