mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/
Date Tue, 03 Nov 2009 19:23:44 GMT
Another alternative is to simply store a Boolean matrix.  That would require
4 bytes per term.

Both forms would compress pretty well.  In the boolean case, I would expect
that the average cost per term would be just under 2 bytes per term.  For
the vector with actual counts stored in doubles, the compression would
probably be nearly as good with about another byte or less for the value.

If we assume 2 bytes per term and 1 byte for the count on average after
compression, this should be about a quarter of what the original text was
(assuming an average term count of 2).  Markup will be stripped which will
allow a bit more savings.

These numbers are very much in-line with Jake's estimates.

On Tue, Nov 3, 2009 at 10:14 AM, Jake Mannix <jake.mannix@gmail.com> wrote:

> Well the minimum size, for the IntDoubleVector which isn't yet in trunk
> (it's on Ted's patch which hasn't worked its way in yet) would entail one
> int and one double per unique term in the document, so that's 12 bytes
> each.  Typical documents have lots of repeat terms, but most terms are
> smaller than 12 bytes as well... so the fraction is probably more than 10%
> and less than 50% is my guess.  But I'm sure others around here have more
> experience producing large vector sets out of the text in Mahout.
>
>  -jake
>
> On Tue, Nov 3, 2009 at 7:49 AM, Ken Krugler <kkrugler_lists@transpac.com
> >wrote:
>
> >
> > On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
> >
> >  Might be of interest to all you Mahouts out there...
> >> http://bixolabs.com/datasets/public-terabyte-dataset-project/
> >>
> >> Would be cool to get this converted over to our vector format so that we
> >> can cluster, etc.
> >>
> >
> >
> > How much additional space would be required for the vectors, in some
> > optimal compressed format? Say as a percentage of raw text size.
> >
> > I'm asking because I have some flexibility in the processing and
> associated
> > metadata I can store as part of the dataset.
> >
> > -- Ken
> >
> > --------------------------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://bixolabs.com
> > e l a s t i c   w e b   m i n i n g
> >
> >
> >
> >
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message