mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/
Date Wed, 04 Nov 2009 19:46:55 GMT
In the intermediate representation, it is very good to keep string -> double
mappings in some form.  In memory, we probably need to separate this into
String -> index and index -> double representations so that we have
flexibility of representation.

I am not sure which you intended.

On Wed, Nov 4, 2009 at 1:16 AM, Robin Anil <robin.anil@gmail.com> wrote:

> I had always thought we should be using Hadoop to number these features and
> create the vector the way Bayes Classifier does it. In Bayes classifier, I
> don't bother to number the feature. Instead use String=>double mapping. I
> will see If feature numbering could be done by a single map/reduce job. If
> thats the case, We can use the TfIdfDriver to generate the tfidf scores and
> then convert the docs into array(int=>double) vectors. That way it would be
> done in a distributed manner
>
>
> Robin
>
>
> On Wed, Nov 4, 2009 at 1:35 PM, Shashikant Kore <shashikant@gmail.com
> >wrote:
>
> > First, we need to create lucene index from this text. Typically, index
> > size is close to 30% of the raw text. (Though, I have seen cases,
> > where it could be as high as 45%). The vectors take 25% of index size
> > (Or, roughly 10% of original text)
> >
> > The space taken by index could be reclaimed after creating the vectors.
> >
> > --shashi
> >
> > On Tue, Nov 3, 2009 at 9:19 PM, Ken Krugler <kkrugler_lists@transpac.com
> >
> > wrote:
> > >
> > > On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
> > >
> > >> Might be of interest to all you Mahouts out there...
> > >>  http://bixolabs.com/datasets/public-terabyte-dataset-project/
> > >>
> > >> Would be cool to get this converted over to our vector format so that
> we
> > >> can cluster, etc.
> > >
> > >
> > > How much additional space would be required for the vectors, in some
> > optimal
> > > compressed format? Say as a percentage of raw text size.
> > >
> > > I'm asking because I have some flexibility in the processing and
> > associated
> > > metadata I can store as part of the dataset.
> > >
> > > -- Ken
> > >
> > > --------------------------------------------
> > > Ken Krugler
> > > +1 530-210-6378
> > > http://bixolabs.com
> > > e l a s t i c   w e b   m i n i n g
> > >
> > >
> > >
> > >
> > >
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message