mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Florian Leibert <...@leibert.de>
Subject Re: Vector creation - out of memory error
Date Tue, 21 Jul 2009 18:14:40 GMT
Hi Shashi,
great - I'm trying the settings maxDFPercent 50 and minDF 4 - I have a lot
of very short documents of which some can be very descriptive.
I'm thinking I should have used the StopWordAnalyzer in Lucene when creating
the index - that way the creation of the vectors would be much faster.

It took yesterday about 8 hours to process these vectors on a quad core
machine with 4 GB of heap - using the sequence file writer - I assume that
the bottleneck might have been the constant transfer into HDFS - that's why
I'm using the file writer now. It's running on my 6 GB index since about 90
minutes now and while the vector sequence file yesterday was 3 GB large
(without filtering) - the JSON file is already at 16 GB (with filtering) -
which I attribute to the compression of the sequence file...

I'm trying to allot some time to transform the vector creation process to
M/R if nobody else is working on that at the moment...

Florian


On Mon, Jul 20, 2009 at 10:46 PM, Shashikant Kore <shashikant@gmail.com>wrote:

> You can restrict the term set by applying "minDf"  & "maxDFPercent"
> filters.
>
> Idea behind the parameters is that the terms occurring too frequently
> or too rarely are not very useful. If you set "minDf" parameter to 10,
> the term has to appear in at least 10 documents in the index.
> Similarly, if "maxDFPercent" is set to 50, all terms appearing in more
> than 50% documents are ignored.
>
> These two parameters prune the term set drastically. I wouldn't be
> suprised if the term set shrinks to less 10% of the original set.
> Since, the vector generation code keeps term->doc-freq map in memory,
> the memory footprint is now at a "manageable" level. Also, vector
> generation will be faster as there are fewer features features per
> vector.
>
> BTW, how slow is vector generation? I don't have exact figures with
> me, but on a single box, I recall it to be higher than 50 vectors per
> second.
>
> --shashi
>
> On Tue, Jul 21, 2009 at 12:10 AM, Florian Leibert<flo@leibert.de> wrote:
> > Hi,
> > I'm trying to create vectors with Mahout as explained in
> >
> http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
> ,
> > however I keep running out of heap. My heap is set to 2 GB already and I
> use
> > these parameters:
> > "java org.apache.mahout.utils.vectors.Driver --dir /LUCENE/ind --output
> > /user/florian/index-vectors-01 --field content --dictOut
> > /user/florian/index-dict-01 --weight TF".
> >
> > My index currently is about 6 GB large. Is there any way to compute the
> > vectors in a distributed manner? What's the largest index someone has
> > created vectors from?
> >
> > Thanks!
> >
> > Florian
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message