lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Morus Walter <morus.wal...@tanto.de>
Subject Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Date Thu, 16 Sep 2004 11:03:57 GMT
Hi David,
> 
> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
> phases. First you build a "fast lookup index" as mentioned above. Then 
> to correct a word you do a query in this index based on the ngrams in 
> the misspelled word.
> 
> Let's see.
> 
> [1] Source is attached and I'd like to contribute it to the sandbox, esp 
> if someone can validate that what it's doing is reasonable and useful.
> 
great :-)
> 
> [4] Here's source in HTML:
> 
> http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
> 
could you put the current version of your code on that website as a java
source also? At least until it's in the lucene sandbox.


I created an ngram index on one of my indexes and think I found an issue
in the indexing code:

There is an option -f to specify the field on which the ngram index will
be created. 
However there is no code to restrict the term enumeration on this field.

So instead of 
		final TermEnum te = r.terms();
i'd suggest
		final TermEnum te = r.terms(new Term(field, ""));
and a check within the loop over the terms if the enumerated term
still has fieldname field, e.g.
			Term t = te.term();
			if ( !t.field().equals(field) ) {
			    break;
			}

otherwise you loop over all terms in all fields.


An interesting application of this might be an ngram-Index enhanced version
of the FuzzyQuery. While this introduces more complexity on the indexing
side, it might be a large speedup for fuzzy searches.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message