mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Collocation and Seq2Sparse Questions
Date Thu, 27 May 2010 15:58:26 GMT

On May 27, 2010, at 11:52 AM, Drew Farris wrote:

> 
> Not at all.
> 
> The alternative that's been discussed here in the past would involve some
> custom analyzer work. The general idea is to load the output from the
> CollocDriver into a bloom filter and then when processing documents at
> indexing time, set up a field where you generate shingles and only index
> those that appear in the bloom filter. This way you wind up getting a set of
> ngrams indexed that are ranked high across the entire corpus instead of
> simply the best ones for each document.
> 

I'd be happy with each doc at this point
Mime
View raw message