mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <ssc.o...@googlemail.com>
Subject Re: Number of Clustering MR-Jobs
Date Thu, 28 Mar 2013 09:41:04 GMT
Sebastian,

For CPU-bound problems like matrix factorization with ALS, we have
recently seen good results with multithreaded mappers, where we had the
users specify the number of cores to use per mapper.

On 28.03.2013 10:20, Ted Dunning wrote:
> This is a longstanding Hadoop issue.
> 
> Your suggestion is interesting, but only a few cases would benefit.  The
> problem is that splitting involves reading from a very small number of
> nodes and thus is not much better than just running the program with few
> mappers.  If the data is large enough to make splitting fast, then Hadoop
> will just do it.
> 
> The only win for splitting is when the cost per chunk is very high.  I
> think that only random forest might fit into that category.
> 
> On Thu, Mar 28, 2013 at 10:04 AM, Sebastian Briesemeister <
> sebastian.briesemeister@unister-gmbh.de> wrote:
> 
>> Splitting the files leads to multiple MR-tasks!
>>
>> Only changing the MR settings of hadoop did not help. In the future it
>> would be nice if the drivers would scale themself and would split the
>> data according to the dataset size and the number of available MR-slots.
>>
> 


Mime
View raw message