lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Spellchecker design was Re: Solr 3.1 back compat
Date Tue, 26 Oct 2010 12:27:38 GMT
On Tue, Oct 26, 2010 at 8:11 AM, Grant Ingersoll <gsingers@apache.org> wrote:

>>
>> And, lets say i have a hunspell dictionary for my language... how do i
>> plug this in? I don't want it to implement Dictionary, because I'm not
>> stupid enough to return something thats not in my index (see below),
>> maybe i only want to use it as a 'filter' to prevent suggestions that
>> are spelled incorrectly...
>
> Implement an Index backed Dictionary that filters by Hunspell and feeds into the Spellchecker.
 I've seen that done on more than one occasion.
>

again though, i don't think it should be at the Dictionary level. For
example, my spellchecker (DirectSpellChecker) uses no dictionary... so
if i want to filter its results with Hunspell, i mean this is
perfectly reasonable... and maybe i want to filter the results from
AutoSuggest with Hunspell?!

Certainly i can add hunspell support to DirectSpellChecker myself, but
you see how this is sorta silly, if someone wants it with the
IndexBasedSpellChecker then it has to be implemented there too, yet I
think we could add some idea like SpellCheckFilter (filters spellcheck
results) where people could plug this stuff in themselves and it works
with all these checkers/suggesters/whatever.

I felt other things were at the dictionary level and shouldn't be, for
example "HighFrequencyDictionary" (which is only in Solr, and should
probably be factored into Lucene).
In my case i wanted to provide this to Lucene users, so i just do it
at runtime via thresholdFrequency, since the docfreq is free from the
TermsEnum anyway.

> Sounds great.  I also think the notion of onlyMorePopular is screwed up too and needs
to be revisited.

yes, i don't really understand this... and some of the behavior around it!


>> PlainTextDictionary? useless... why the hell would you return
>> something that isn't in your index?!
>
> It can be quite useful to have an external source for tokens and I've seen it in action
on several occasions.  Just because they are fed in from an external source doesn't mean
they aren't in the index.  For instance, dump your terms from the index, do some downstream
processing according to user logs or whatever (or Hunspell if you want) and then load them
back into the Spell checker.

Right, but see above, in my case i don't "load anything" since i have
no datastructure... So i think the API can/should be flexible enough
to do these kinda things without the notion of taking data from one
index and shoving it into another.

And, this special use case shouldn't slow down the common use case
where its a LuceneDictionary.

In general, i know i sound like a big whiner, but i actually think we
have a huge opportunity here. It looked to me (at a glance) that now
that Lucene/Solr are merged we can fix this stuff across both Lucene
and Solr more easily.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message