mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <shashik...@gmail.com>
Subject Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/
Date Wed, 04 Nov 2009 08:05:22 GMT
First, we need to create lucene index from this text. Typically, index
size is close to 30% of the raw text. (Though, I have seen cases,
where it could be as high as 45%). The vectors take 25% of index size
(Or, roughly 10% of original text)

The space taken by index could be reclaimed after creating the vectors.

--shashi

On Tue, Nov 3, 2009 at 9:19 PM, Ken Krugler <kkrugler_lists@transpac.com> wrote:
>
> On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
>
>> Might be of interest to all you Mahouts out there...
>>  http://bixolabs.com/datasets/public-terabyte-dataset-project/
>>
>> Would be cool to get this converted over to our vector format so that we
>> can cluster, etc.
>
>
> How much additional space would be required for the vectors, in some optimal
> compressed format? Say as a percentage of raw text size.
>
> I'm asking because I have some flexibility in the processing and associated
> metadata I can store as part of the dataset.
>
> -- Ken
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

Mime
View raw message