mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/
Date Tue, 03 Nov 2009 18:14:11 GMT
Well the minimum size, for the IntDoubleVector which isn't yet in trunk
(it's on Ted's patch which hasn't worked its way in yet) would entail one
int and one double per unique term in the document, so that's 12 bytes
each.  Typical documents have lots of repeat terms, but most terms are
smaller than 12 bytes as well... so the fraction is probably more than 10%
and less than 50% is my guess.  But I'm sure others around here have more
experience producing large vector sets out of the text in Mahout.

  -jake

On Tue, Nov 3, 2009 at 7:49 AM, Ken Krugler <kkrugler_lists@transpac.com>wrote:

>
> On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
>
>  Might be of interest to all you Mahouts out there...
>> http://bixolabs.com/datasets/public-terabyte-dataset-project/
>>
>> Would be cool to get this converted over to our vector format so that we
>> can cluster, etc.
>>
>
>
> How much additional space would be required for the vectors, in some
> optimal compressed format? Say as a percentage of raw text size.
>
> I'm asking because I have some flexibility in the processing and associated
> metadata I can store as part of the dataset.
>
> -- Ken
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message