mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From deneche abdelhakim <>
Subject Re : Reg: Maximum Split size in Random Forest
Date Wed, 09 Jun 2010 04:19:33 GMT
mapred.max.split.size controls how many partitions will be generated from the data.
the current implementation of random forest is pretty memory intensive, and because all the
work is done in the mappers' close method, when the data is Big, Hadoop just thinks that the
mappers have failed (I will solve this problem some day).
You should try to increase the number of partitions by reducing the size of mapred.max.split.size.
A value of "3200000" should give you 10 partitions, which should be Ok, if not try reducing
the size further, for example "1000000"
In general, you should start working with a large number of partitions, then try reducing
this number as long as the Job don't fail. Depending on your data, the number of partitions
can influence the quality of the generated Random Forest.

I hope this solves your problem.

Thank you for choosing Mahout Air Lines;)

--- En date de : Mar 8.6.10, Karan Jindal <> a écrit :

> De: Karan Jindal <>
> Objet: Reg: Maximum Split size in Random Forest
> À:
> Date: Mardi 8 juin 2010, 13h21
> Hi all,
> In the following
> tutorial for running the random forest,  maximum split
> size of "1874231"
> is used. When I didn't mention this in the command line and
> the block size
> of data on HDFS is 32MB it gives "StackOverFlow" error. It
> overcome this I
> increase the head size of child jvm to 2GB , then either it
> gives the same
> overflow error  or the process get hanged.
> Does anyone has any idea about this?
> Regards
> Karan
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.


View raw message