mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Hall <d...@cs.berkeley.edu>
Subject Re: LDA only executes a single map task per iteration when running in actual distributed mode?
Date Tue, 12 Jan 2010 05:53:30 GMT
On Mon, Jan 11, 2010 at 2:00 PM, Chad Hinton <chadhinton@gmail.com> wrote:
> I saw two comments related to an actual distributed run of the LDA example
> but no answer to this question. A previous message in the list confirms that
> at least one other person has experienced this issue. I am submitting a map
> reduce job to a 20 node Hadoop cluster as follows:
>
> hadoop jar /root/mahout-core-0.2.job
> org.apache.mahout.clustering.lda.LDADriver -i
> hdfs://master/lda/input/vectors -o hdfs://master/lda/output -k 20 -v 10000
> --maxIter 40
>
> where lda/input/vectors is the vectors file generated from the stand alone
> build-reuters.sh example. I can only get a single map task to execute while
> approx. 57 task slots are available. Has anyone actually ran distributed LDA
> successfully? This will help me figure out if I have a hadoop config issue
> or if there is an actual algorithm implementation problem. The Hadoop
> examples run successfully in distributed mode utilizing all available map
> tasks. I'm not sure if there is an issue with the InputSplit for the
> SequenceFile or something else... Any help is appreciated.

I myself haven't actually run LDA distributed (though I've spoken with
someone who has). The Reuters example is pretty simplistic, and
doesn't set any input splits for the single vectors file, and so it's
only going to run on one machine. If you shard the vectors it should
just work. I can brush up on my hadoop foo to figure out how to have
hadoop split up a single file, if you want.

-- David

Mime
View raw message