mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuhrmann Alpert, Galit" <galp...@ebay.com>
Subject Modify number of mappers for a mahout process?
Date Wed, 31 Jul 2013 08:57:27 GMT

Hi,

It sounds to me like this could be related to one of the Qs I've posted several days ago (is
it?):
My mahout clustering processes seem to be running very slow (several good hours on just ~1M
items), and I'm wondering if there's anything that needs to be changed in setting/configuration.
(and how?)
	I'm running on a large cluster and could potentially use thousands of nodes (mappers/reducers).
However, my mahout processes (kmeans/canopy.) are only using max 5 mappers (I tried it on
several data sets). 
	I've tried to define the number of mappers by something like: -Dmapred.map.tasks=100 but
this didn't seem to have an effect, it still only uses <=5 mappers.
	Is there a different way to set the number of mappers/reducers for a mahout process?
	Or is there another configuration issue I need to consider?

I'd definitely be happy to use such a parameter, does it not exist?
(I'm running mahout as installed on the cluster)

Is there currently a workaround, besides running a mahout jar as an hadoop job?
When I originally tried to run a mahout jar that uses KMeansDriver (and that runs great on
my local machine)- it did not even initiate a job on the hadoop cluster. It seemed to be running
parallel but in fact it was running only on the local node. 	Is this a known issue? Is there
a fix for this? (I ended up dropping it and calling mahout step by step from command line,
but I'd be happy to know if there a fix for this).

Thanks,

Galit.

-----Original Message-----
From: Ryan Josal [mailto:rjosal@gmail.com] 
Sent: Monday, July 29, 2013 9:33 PM
To: Adam Baron
Cc: Ryan Josal; user@mahout.apache.org
Subject: Re: Run more than one mapper for TestForest?

If you're running mahout from the CLI, you'll have to modify the Hadoop config file or your
env manually for each job.  This is code I put in to my custom job executions so I didn't
have to calculate and set that up every time.  Maybe that's your best route in that position.
 You could just provide your own mahout jar and run it as you would any other Hadoop job and
ignore the installed Mahout.  I do think this could be a useful parameter for a number of
standard mahout jobs though; I know I would use it.  Does anyone in the mahout community see
this as a generally useful feature for a Mahout job?

Ryan

On Jul 29, 2013, at 10:25, Adam Baron <adam.j.baron@gmail.com> wrote:

> Ryan,
> 
> Thanks for the fix, the code looks reasonable to me.  Which version of Mahout will this
be in?  0.9?
> 
> Unfortunately, I'm using a large shared Hadoop cluster which is not administered by my
team.   So I'm not in a position push the latest from the Mahout dev trunk into our environment;
the admins will only install official releases.
> 
> Regards,
>           Adam
> 
> On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <ryan@josal.com> wrote:
>> Late reply, but for what it's still worth, since I've seen a couple other threads
here on the topic of too few mappers, I added a parameter to set a minimum number of mappers.
 Some of my mahout jobs needed more mappers, but were not given many because of the small
input file size.
>> 
>>         addOption("minMapTasks", "m", "Minimum number of map tasks to 
>> run", String.valueOf(1));
>> 
>> 
>>         int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
>>         int mapTasksThatWouldRun = (int) (vectorFileSizeBytes/getSplitSize()) + 1;
>>         log.info("map tasks min: " + minMapTasks + " current: " + mapTasksThatWouldRun);
>>         if (minMapTasks > mapTasksThatWouldRun) {
>>             String splitSizeBytes = String.valueOf(vectorFileSizeBytes/minMapTasks);
>>             log.info("Forcing mapred.max.split.size to " + splitSizeBytes + " to
ensure minimum map tasks = " + minMapTasks);
>>             hadoopConf.set("mapred.max.split.size", splitSizeBytes);
>>         }
>> 
>>     // there is actually a private method in hadoop to calculate this
>>     private long getSplitSize() {
>>         long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024 * 1024);
>>         long maxSize = hadoopConf.getLong("mapred.max.split.size", Long.MAX_VALUE);
>>         int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
>>         long splitSize = Math.max(minSize, Math.min(maxSize, blockSize));
>>         log.info(String.format("min: %,d block: %,d max: %,d split: %,d", minSize,
blockSize, maxSize, splitSize));
>>         return splitSize;
>>     }
>> 
>> It seems like there should be a more straightforward way to do this, but it works
for me and I've used it on a lot of jobs to set a minimum number of mappers.
>> 
>> Ryan
>> 
>> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:
>> 
>> > I'm attempting to run 
>> > org.apache.mahout.classifier.df.mapreduce.TestForest
>> > on a CSV with 200,000 rows that have 500,000 features per row.
>> > However, TestForest is  running extremely slow, likely because only 
>> > 1 mapper was assigned to the job.  This seems strange because the 
>> > org.apache.mahout.classifier.df.mapreduce.BuildForest step on the 
>> > same data used 1772 mappers and took about 6 minutes.  (BTW: I know 
>> > I
>> > *shouldn't* use the same data set for the training and the testing 
>> > steps; this is purely a technical experiment to see if Mahout's 
>> > Random Forest can handle the data sizes we typically deal with).
>> >
>> > Any idea on how to get 
>> > org.apache.mahout.classifier.df.mapreduce.TestForest
>> > to use more mappers?  Glancing at the code (and thinking about what 
>> > is happening intuitively), it should be ripe for parallelization.
>> >
>> > Thanks,
>> >        Adam
> 

Mime
View raw message