lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <>
Subject [jira] Commented: (LUCENE-537) Refactor of spell check
Date Sat, 01 Apr 2006 01:13:37 GMT
    [ ] 

Karl Wettin commented on LUCENE-537:

This just came on Lucene-users and might explain what I thought was thread safty. I'll take
a look at update my refactored code some time soon.

	Ämne: 	Spellchecker bug (or feature?)
	Datum: 	lördag 1 apr 2006 00.20.08 GMT+02:00
	Svara till:

Not sure if this is the right place to report this issue:

  The accuracy value, which can be set via setAccuracy(), is being modified in
when a word is checked. As a result, the "min" may be pushed
  very high and will not suggest anything for later requests.

  One workaround would be to call setAccuracy() each time before a word is checked, I'm not
sure if this is a feature (intended behavior) or a bug.
  By the way, I'm using spellchecker 1.9.1 that comes with Lucene 1.9.1.



> Refactor of spell check
> -----------------------
>          Key: LUCENE-537
>          URL:
>      Project: Lucene - Java
>         Type: Improvement
>     Reporter: Karl Wettin
>  Attachments: lucene_spellcheck.tar.gz
> I use the same ngram index for multiple categories, but only want to spell check per
category. The old implementation did not support this as it used docFreq as controller source.
> The spell check returns suggestions with score and not just the suggested word.
> TokenFrequencyVector replace the IndexReader used for docFreq. 
> LuceneTokenFrequencyVector wraps an IndexReader and works just as the old implementation.
> LuceneQueryDictionary creates an ngram dictionary based on a query and not the whole
> MultiTokenFrequencyVector treats a number of TokenFrequencyVector:s as one. I.e. for
use when spell checking in multiple contexts.
> TokenFrequencyVectorMap is a HashMap facade. Comes with static factory to create the
vector based on the the tokens in a specific field from a search.
> I use the TokenFrequencyVectorMap to build one vector per category and instanciate a
MultiTokenFrequencyVector for each  user query. Could probably save a couple of clock ticks
by buffering MultiVectors rather than creating new once all the time.
> Also it seems as the ngram-code might not be thread safe. This also include the source
in CVS. Have not succeded to prove it when when testing, only in the live environment. Each
instance of Spellchecker only suggest once. And it takes quite some resources to create new
instances of the spellchecker as it is designed today. Might get back on that subject.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message