mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chad Hinton <chadhin...@gmail.com>
Subject Re: LDA only executes a single map task per iteration when running in actual distributed mode?
Date Tue, 12 Jan 2010 22:06:51 GMT
I tried it first with the smaller number and got a lot of maps! But
that's good... so long as I got past the single map task. I am
tweaking it now but this did the trick. Thanks again.

On Tue, Jan 12, 2010 at 12:33 PM, deneche abdelhakim <a_deneche@yahoo.fr> wrote:
> oups, sorry the size should be specified in bytes and not kB. so 8.8Mb ~ 9227468b. to
get 10 mappers use a mapred.max.split.size=922747
>
> --- En date de : Mar 12.1.10, deneche abdelhakim <a_deneche@yahoo.fr> a écrit :
>
>> De: deneche abdelhakim <a_deneche@yahoo.fr>
>> Objet: Re: LDA only executes a single map task per iteration when running in  actual
distributed mode?
>> À: mahout-user@lucene.apache.org
>> Date: Mardi 12 Janvier 2010, 17h43
>> try using a small value for Hadoop
>> parameter "mapred.max.split.size". For a file size of 8.8 Mb
>> (~9000 Kb) if you want 10 mappers you should use a max split
>> size of 9000/10=900.
>>
>> I don't now if LDADriver implements Hadoop Tool interface,
>> but if it does you can pass the desired value in the command
>> line as follows:
>>
>> hadoop jar /root/mahout-core-0.2.job
>> org.apache.mahout.clustering.lda.LDADriver
>> -Dmapred.max.split.size=900 -i
>> hdfs://master/lda/input/vectors -o hdfs://master/lda/output
>> -k 20 -v 10000
>> --maxIter 40
>>
>> Please note that it won't work if LDADriver is using a
>> fancy InputFormat other than InputFileFormat. The easiest
>> way to now is just to try it !
>>
>> --- En date de : Mar 12.1.10, Chad Hinton <chadhinton@gmail.com>
>> a écrit :
>>
>> > De: Chad Hinton <chadhinton@gmail.com>
>> > Objet: Re: LDA only executes a single map task per
>> iteration when running in  actual distributed mode?
>> > À: "mahout-user" <mahout-user@lucene.apache.org>
>> > Date: Mardi 12 Janvier 2010, 17h13
>> > Ted, David - thanks for your replies.
>> > I thought Hadoop would
>> > automatically split the file but it is not. The
>> vectors
>> > file generated
>> > from build-reuters.sh (by using
>> > org.apache.mahout.utils.vectors.lucene.Driver over
>> the
>> > Lucene index)
>> > comes out to around 8.8 mb. Perhaps that is to small
>> and
>> > won't be
>> > split if it's below the HDFS block size. I'm using
>> the
>> > default 64mb
>> > for the HDFS. Perhaps a custom InputSplit/RecordReader
>> is
>> > needed to
>> > split the sequence file. I'll investigate further. If
>> > anyone has
>> > further pointers or more info please chime in.
>> >
>> > Thanks,
>> > Chad
>> >
>> > > It should just happen if the file is large enough
>> and
>> > the program is
>> > > configured for more than one mapper task and the
>> file
>> > type is correct.
>> >
>> > > If you are reading an uncompressed sequence file
>> you
>> > should be set.
>> >
>> > > On Mon, Jan 11, 2010 at 9:53 PM, David Hall
>> <dlwh@cs.berkeley.edu>
>> > wrote:
>> >
>> > >>  I can brush up on my hadoop foo to figure
>> > out how to have
>> > >> hadoop split up a single file, if you want.
>> > >>
>> >
>> > >--
>> > >Ted Dunning, CTO
>> > >DeepDyve
>> >
>>
>>
>>
>>
>
>
>
>

Mime
View raw message