mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Who owns mahout bucket on s3?
Date Sun, 28 Feb 2010 20:48:22 GMT
I thought you were doing the secondary sort idea?  That's certainly the
way to make sure you need nothing significant kept in memory, and this
clearly won't scale without that optimization...

I'd say this should get fixed before we release 0.3

  -jake

On Sun, Feb 28, 2010 at 7:30 AM, Drew Farris <drew.farris@gmail.com> wrote:

> So one option would be to do the frequency counts in another pass, but
> I don't really like that idea. I think a compound key / secondary sort
> would work so.that the ngrams don't have to be tracked in a set.
>
> I will give it a try later today.
>
> On Sunday, February 28, 2010, Drew Farris <drew.farris@gmail.com> wrote:
> > Bah, that's not correct. I do end up keeping each unique ngram for a
> > given n-1gram in memory in the CollocCombiner and CollocReducer to do
> > frequency counting. There's likely a more elegant solution to this.
> >
> > On Sun, Feb 28, 2010 at 10:00 AM, Drew Farris <drew.farris@gmail.com>
> wrote:
> >> Argh, I'll look into it and see where Grams are kept in memory. There
> >> really shouldn't be any place where they're retained beyond what's
> >> needed for a single document. I doubt that there are documents in
> >> wikipedia that would blow the heap in this way, but I supposed it's
> >> possible. You're just doing bigrams, or did you end up going up to
> >> 5-grams?
> >>
> >> On Sun, Feb 28, 2010 at 7:50 AM, Robin Anil <robin.anil@gmail.com>
> wrote:
> >>> after 9 hours of compute,  it failed. It never went past the colloc
> combiner
> >>> pass :(
> >>>
> >>> reason. I will have to tag drew along to identify the possible cause of
> this
> >>> out of memory error
> >>>
> >>>
> >>> java.lang.OutOfMemoryError: Java heap space
> >>>        at
> org.apache.mahout.utils.nlp.collocations.llr.Gram.<init>(Gram.java:67)
> >>>        at
> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:62)
> >>>        at
> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:30)
> >>>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:921)
> >>>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1077)
> >>>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:719)
> >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:233)
> >>>        at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2216)
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message