lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Date Thu, 16 Sep 2004 18:58:32 GMT
Morus Walter wrote:

> Hi David,
> 
>>Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
>>phases. First you build a "fast lookup index" as mentioned above. Then 
>>to correct a word you do a query in this index based on the ngrams in 
>>the misspelled word.
>>
>>Let's see.
>>
>>[1] Source is attached and I'd like to contribute it to the sandbox, esp 
>>if someone can validate that what it's doing is reasonable and useful.
>>
> 
> great :-)
> 
>>[4] Here's source in HTML:
>>
>>http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
>>
> 
> could you put the current version of your code on that website as a java

Weblog entry updated:

http://searchmorph.com/weblog/index.php?id=23

To link to source code:

http://www.searchmorph.com/pub/ngramspeller/NGramSpeller.java

> source also? At least until it's in the lucene sandbox.
> 
> 
> I created an ngram index on one of my indexes and think I found an issue
> in the indexing code:
> 
> There is an option -f to specify the field on which the ngram index will
> be created. 
> However there is no code to restrict the term enumeration on this field.
> 
> So instead of 
> 		final TermEnum te = r.terms();
> i'd suggest
> 		final TermEnum te = r.terms(new Term(field, ""));
> and a check within the loop over the terms if the enumerated term
> still has fieldname field, e.g.
> 			Term t = te.term();
> 			if ( !t.field().equals(field) ) {
> 			    break;
> 			}
> 
> otherwise you loop over all terms in all fields.

Great suggestion and thanks for that idiom - I should know such things 
by now. To clarify the "issue", it's just a performance one, not other 
functionality...anyway I put in the code - and to be scientific I 
benchmarked it two times before the change and two times after - and the 
results were suprising the same both times (1:45 to 1:50 with an index 
that takes up > 200MB). Probably there are cases where this will run 
faster, and the code seems more "correct" now so it's in.



> 
> 
> An interesting application of this might be an ngram-Index enhanced version
> of the FuzzyQuery. While this introduces more complexity on the indexing
> side, it might be a large speedup for fuzzy searches.

I also thinking of reviewing the list to see if anyone had done a "Jaro 
Winkler" fuzzy query yet and doing that....

> 


Thanks,
  Dave

> Morus
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message