lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Dimulescu <adrian.dimule...@gmail.com>
Subject Re: number of hits of pages containing two terms
Date Tue, 17 Mar 2009 11:12:14 GMT
Michael McCandless wrote:
> I don't understand how this would address the "docFreq does
> not reflect deletions".
>
Bad mail-quoting, sorry. I am not interested by document deletion, I 
just index Wikipedia once, and want to get a co-occurrence-based 
similarity distance between words called NGD (normalized google 
distance). No documents get deleted. I wanted to do it the cheap way by 
indexing everything and than counting hits but it's not fast enough (I 
want to calculate all distances between all the words, that is approx 
10,000 x 10,000 comparisons).
> You can use the shingles analyzer (under contrib/analyzers)
> to create and index bigrams.  (But the docFreq would still not
> reflect deletions).
This similarity measure is based on co-occurence within a text window. 
If I'm not mistaken, bigrams require the words to be exactly next to 
each other. But thanks for pointing out that analyzer, I may be 
interested in some n-grams processing later.


Adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message