mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <shashik...@gmail.com>
Subject Re: Vector creation - out of memory error
Date Wed, 22 Jul 2009 05:22:47 GMT
My understanding of MR is embarassingly low, but I am unable to see
how it can drastically improve the performance. Each node in Hadoop
cluster has to load the document frequencies consuming same amount of
memory. They may operate on different document ranges, giving linear
performance improvement.

Here are couple of thoughts. See if these are sensible.

a) Vector generation process is serial right now. During each pass
some time is spent to retrive term vectors (which incurs disk access),
some to calculate weights (cpu intensive) and finally writing vector
top disk (again disk access.)  If the process is multi-threaded, there
will be a nice pipeline which keeps cpu and disk busy all the time. I
suspect, disk will hit saturation before cpu.

b)  I am not sure if you are storing term positions in the index. To
calculate vectors you don't require term positions. If you only store
term frequency during indexing, index size will come down quite a bit.
In which case, you can load the entire index in memory.  Disk access
is completely eliminated and vector generation will be fast. But
again, this is assuming you don't need term vectors.

As Grant has pointed out JSON is not useful for clustering. You may
switch it off.

--shashi

On Tue, Jul 21, 2009 at 11:44 PM, Florian Leibert<flo@leibert.de> wrote:
> Hi Shashi,
> great - I'm trying the settings maxDFPercent 50 and minDF 4 - I have a lot
> of very short documents of which some can be very descriptive.
> I'm thinking I should have used the StopWordAnalyzer in Lucene when creating
> the index - that way the creation of the vectors would be much faster.
>
> It took yesterday about 8 hours to process these vectors on a quad core
> machine with 4 GB of heap - using the sequence file writer - I assume that
> the bottleneck might have been the constant transfer into HDFS - that's why
> I'm using the file writer now. It's running on my 6 GB index since about 90
> minutes now and while the vector sequence file yesterday was 3 GB large
> (without filtering) - the JSON file is already at 16 GB (with filtering) -
> which I attribute to the compression of the sequence file...
>
> I'm trying to allot some time to transform the vector creation process to
> M/R if nobody else is working on that at the moment...
>
> Florian
>
>
> On Mon, Jul 20, 2009 at 10:46 PM, Shashikant Kore <shashikant@gmail.com>wrote:
>
>> You can restrict the term set by applying "minDf"  & "maxDFPercent"
>> filters.
>>
>> Idea behind the parameters is that the terms occurring too frequently
>> or too rarely are not very useful. If you set "minDf" parameter to 10,
>> the term has to appear in at least 10 documents in the index.
>> Similarly, if "maxDFPercent" is set to 50, all terms appearing in more
>> than 50% documents are ignored.
>>
>> These two parameters prune the term set drastically. I wouldn't be
>> suprised if the term set shrinks to less 10% of the original set.
>> Since, the vector generation code keeps term->doc-freq map in memory,
>> the memory footprint is now at a "manageable" level. Also, vector
>> generation will be faster as there are fewer features features per
>> vector.
>>
>> BTW, how slow is vector generation? I don't have exact figures with
>> me, but on a single box, I recall it to be higher than 50 vectors per
>> second.
>>
>> --shashi
>>
>> On Tue, Jul 21, 2009 at 12:10 AM, Florian Leibert<flo@leibert.de> wrote:
>> > Hi,
>> > I'm trying to create vectors with Mahout as explained in
>> >
>> http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
>> ,
>> > however I keep running out of heap. My heap is set to 2 GB already and I
>> use
>> > these parameters:
>> > "java org.apache.mahout.utils.vectors.Driver --dir /LUCENE/ind --output
>> > /user/florian/index-vectors-01 --field content --dictOut
>> > /user/florian/index-dict-01 --weight TF".
>> >
>> > My index currently is about 6 GB large. Is there any way to compute the
>> > vectors in a distributed manner? What's the largest index someone has
>> > created vectors from?
>> >
>> > Thanks!
>> >
>> > Florian
>> >
>>

Mime
View raw message