mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lu <>
Subject Re: LDA on single node is much faster than 20 nodes
Date Wed, 07 Sep 2011 17:27:39 GMT
Thanks for the suggestions!

I finally managed to get Hadoop parallel the mapping processes.

I changed not only the "mapred.max.split.size" setting, but also 
"dfs.block.size", because of how compute the split 

   protected long computeSplitSize(long blockSize, long minSize, long 
maxSize) {
     return Math.max(minSize, Math.min(maxSize, blockSize));

Now seems all nodes are running in parallel!


On 09/06/2011 04:44 PM, Jake Mannix wrote:
> On Tue, Sep 6, 2011 at 4:44 PM, Chris Lu<>  wrote:
>> I see, thanks!
>> Seems it should build into Mahout LDA algorithms, since the input file is
>> usually not too large, but really needs parallel mapping processes.
> If your input is not large, running a multithreaded in-memory algorithm on a
> relatively beefy box (16+ cores, enough RAM to fit your data + model + some
> spare) will be *much* faster than putting the same data on cluster,
> actually.
>    -jake

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message