mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Grisel <olivier.gri...@ensta.org>
Subject Re: producing vectors from composite documents
Date Tue, 08 Jun 2010 23:10:55 GMT
2010/6/8 Ted Dunning <ted.dunning@gmail.com>:
> Got it.
>
> This really needs to be done before vectorization, but you can segregate the
> output vector for different handling by passing in a view to different parts
> of the vector.
>
> My recommendation is that you apply IDF using the weight dictionary in the
> vectorizer.  That will let you have multiple text fields with different
> weighting schemes but still put all the results into a single result vector.
>  As a side effect, if you put everything into a vector of dimension 1, then
> you get multi-field weighted inputs for free.

Instead of storing the exact IDF values in an explicit dictionnary,
one could use a counting bloom filters datastructure to reduce the
memory footprint and speedup the lookups (though lucene is able to
handle millions of terms without any perf issues).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Mime
View raw message