lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Spellchecker design was Re: Solr 3.1 back compat
Date Tue, 26 Oct 2010 12:39:52 GMT
On Tue, Oct 26, 2010 at 8:19 AM, Andrzej Bialecki <ab@getopt.org> wrote:
> Sometimes you want a dictionary that is cleaned up and re-weighted by an
> external process (human-based or other), even if it originally came from
> your index. So it's not either/or - you can have a file-based dictionary
> that nonetheless gives you stuff that _is_ in your index.

right, and i would like to possibly support this in my spellchecker
via DFA intersection at runtime (intersect the special cleaned-up DFA
with the levenshtein query DFA).
but the underlying "dictionary" (the lucene index) is unchanged,
instead this would act like a filter.

it would be nice if the concept was somehow more general and for the
other spellcheckers *implemented* via Dictionary, but that shouldn't
be the only way.

>
> (Yeah, and sorted vs. unsorted ... I tried to hack it by tagging some
> classes with a SortedIterator, but it was indeed a half-hearted
> attempt... it needs to be fixed, not worked around).
>

It would be cool to add this to Lucene in the short term, so we could
mark the LuceneDictionary as being in sorted order... then we could
explore the TermEnum optimization i spoke of, rather than calling
IndexReader.docFreq() on the spellcheck index for every term in the
dictionary to see if it already exists.

Yeah, i know if they are sorted they will tend to be in the same TII
block, and the term dictionary cache will generally work, but I think
it would still end out faster... and no need to completely hose the
term dictionary cache to rebuild a spellcheck index.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message