mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: anybody want to set a record with Mahout?
Date Thu, 25 Feb 2010 21:43:23 GMT
On Thu, Feb 25, 2010 at 12:49 PM, Robin Anil <robin.anil@gmail.com> wrote:

> unigrams > 3 = 384 MB dictionary... with all ngrams(pruned by llr >1) we
> might hit some 5-10GB of entries. With some 25 char average for 5 grams it
> might be safe to say that we might say hit 100 million rows easily ?
>

Wait what are you saying here?  You're looking at the wikipedia set?  384
MB
for a dictionary doesn't tell us about our density.  We have 4 million rows
(
one per document / page), and each page has some number of entries from this
set of ngrams - if it's more than 1000 unique ngrams per page, we're up to
4B
nonzero entries in our corpus matrix right?  Of course, at this point we've
got
too many terms to properly do the decomposition directly on the input
matrix,
we'd have to do it on the transpose, or the gram (to overuse a term) 4M x 4M
similarity matrix.  The former would probably be much more performant.

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message