lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Morus Walter <>
Subject Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Date Mon, 20 Sep 2004 13:06:46 GMT
David Spencer writes:
> > 
> > could you put the current version of your code on that website as a java
> Weblog entry updated:
> Great suggestion and thanks for that idiom - I should know such things 
> by now. To clarify the "issue", it's just a performance one, not other 
> functionality...anyway I put in the code - and to be scientific I 
> benchmarked it two times before the change and two times after - and the 
> results were suprising the same both times (1:45 to 1:50 with an index 
> that takes up > 200MB). Probably there are cases where this will run 
> faster, and the code seems more "correct" now so it's in.
Ahh, I see, you check the field later.
The logging made me think, you index all fields you loop over, in which
case one might get unwanted words into the ngram index.
> > 
> > 
> > An interesting application of this might be an ngram-Index enhanced version
> > of the FuzzyQuery. While this introduces more complexity on the indexing
> > side, it might be a large speedup for fuzzy searches.
> I also thinking of reviewing the list to see if anyone had done a "Jaro 
> Winkler" fuzzy query yet and doing that....
I went into another direction, and changed the ngram index and search
to use a simliarity that computes 

   m * m / ( n1 * n2)

where m is the number of matches and n1 is the number of ngrams in the
query and n2 is the number of ngrams in the word.
(At least if I got that right; I'm not sure if I understand all parts
of the similarity class correctly)

After removing the document boost in the ngram index based on the 
word frequency in the original index I find the results pretty good.
My data is a number of encyclopedias and dictionaries and I only use the
headwords for the ngram index. Term frequency doesn't seem relevent
in this case.

I still use the levenshtein distance to modify the score and sort according
to  score / distance  but in most cases this does not make a difference.
So I'll probably drop the distance calculation completely.

I also see few difference between using 2- and 3-grams on the one hand
and only using 2-grams on the other. So I'll presumably drop the 3-grams.

I'm not sure, if the similarity I use, is useful in general, but I 
attached it to this message in case someone is interested.
Note that you need to set the similarity for the index writer and searcher
and thus have to reindex in case you want to give it a try.


View raw message