lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathieu Lecarme <>
Subject Re: frequent phrases
Date Fri, 10 Aug 2007 16:40:36 GMT
some tools exist for finding duplicated parts in document.
You split document in phrase, and build ngram with word. If you wont  
complete phrase, work with all words, for a partial, work with 5  
words ngram, for example. ngram list is convert to hash, and hash is  
used as an indexed Field for the document. With this trick, you can  
use phrase as you use to do with word, without using too much space.

I'm not sure to be as clear as I want.


Le 9 août 07 à 09:34, Akanksha Baid a écrit :

> I was wondering if there is a "search based" method to find the top-k
> frequent phrases in a set of documents.( I do not have a particular  
> phrase
> in mind so PhraseQuery can probably be ruled out).
> I have implemented something that works using termvectors and  
> termpositions
> but the performance is not great so far since I am basically iterating
> multiple times and hacking my way around. I was wondering if an API  
> exists
> for finding frequent phrases and/or if someone could point me to  
> some code
> for the same.
> Thanks.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message