mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: RowSimilarity ?'s
Date Thu, 14 Jul 2011 19:24:27 GMT
On Thu, Jul 14, 2011 at 8:00 PM, Grant Ingersoll <gsingers@apache.org>wrote:

>
> > You need all cooccurrences since some implementations need that value,
> and
> > you're computing all-pairs.
>
> Can you explain the diffs from the cited paper?  (Per the comment in the
> top of the Job file)
>
> For the record, I'm currently running this on ~500K rows and ~150K terms
> (each vector is pretty sparse) and it is taking a long time, way longer than
> what is cited in the paper for what appears to be a bigger corpus with more
> terms on crappier hardware.
>
>
It's the same approach in the important aspects, as far as I can tell. One
thing they did is removed the top 1% longest posting lists entirely, which
is exactly the long-rows (here, 'columns') issue Ted mentioned.

There is not a lever here that helps truncate these big rows/columns, which
can dominate the run time -- that could be useful. Just needs a rule for
tossing data -- you could simply throw away such columns (ouch), or at least
use only a sampled subset of it.

That's my guess.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message