lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Spellchecker design was Re: Solr 3.1 back compat
Date Tue, 26 Oct 2010 12:11:16 GMT
Some thoughts inline...

On Oct 26, 2010, at 7:24 AM, Robert Muir wrote:

> On Tue, Oct 26, 2010 at 6:59 AM, Grant Ingersoll <gsingers@apache.org> wrote:
>>> I felt the entire framework in Solr is built around the idea of  "take
>>> stuff from one field in an index, shove it into another field of an
>>> index", but my spellchecker doesn't need any of this.
>>> 
>> 
>> Not really, but...
> 
> I think really? I can only "see" part of the query (i think one field
> at once) via Tokens...
> 
>> 
>> I guess no one has upgraded it yet.  This is 1.3 stuff.  I don't have any problem
with upgrading it.
> 
> I'm not saying we have to use the Attributes API, it was just an idea.

I think that is a reasonable idea.  I was just pointing out that this stuff predates the move
away from Token.  At the time, I would argue Token made sense.  FWIW, I still dread the day
when I have to start explaining BytesRefs to new Lucene programmers when I really mean a Token,
but heh, I'll get over it.  For all of it's inflexibility, Token was quite nice in that the
word conveys its meaning quite nicely to most programmers. 

> but we really have to move the stuff from this component from
> "solr-makes-the-decisions" into "user-makes-the-decisions". This is
> the number 1 problem with the current spellchecker (ok, maybe #2, #1
> being the index-based one doesnt close its indexreader).

I would suggest that the current architecture was aimed at making it easy for users to plug
in their own capabilities and it allows you to do so at pretty much every step.  Did it hit
that mark 100%?  Of course not.  But, I do know there are plenty of people who have implemented
their own pieces to it using their own logic.

> 
>> 
>>> 
>>> Even the input format that comes into the spellchecker in
>>> getSuggestions(SpellingOptions options) is just Tokens, but this is
>>> pretty limiting. For instance, I think it makes way more sense for a
>>> spellchecker API to take Query and return corrected Querys, and in my
>>> situation i could give better results, but the Solr APIs stop me.
>> 
>> And you are then going to do Query.toString() to display that back to the user?
> 
> why do you care?

I don't.  The SpellCheckComponent was meant for spellchecking a string and rendering it back
to the user in a meaningful way, i.e. something that they would recognize.  To me, at the
time, that meant operating on the string that the user passed in, not a Query object that
has potentially been rewritten and is not mappable back to a user in meaningful way.  Given
my requirements at the time, I thought it was a reasonable decision.  In light of your requirements,
we can likely satisfy both.  In fact, with the proposal I'm putting forth about refactoring
this stuff, I think it would likely make it easier for you to implement your own Component
that does what you need to do it, while reusing as much as you want.

> maybe that works fine for me, i don't use the dismax
> parser that generates horrific queries so everything is fine... and
> thats my point... something more like a pipeline/attributes-based
> thing woudl work much better here, its up to the user.
> 
> certainly it makes sense to keep the original query around... why hide
> it?

Let's just add it to the SpellingOptions. 

> and the hairy mess of code that converts it into tokens, this
> needs to be something like a pipeline, because some people don't want
> it, or want to do it their own way.

The QueryConverter was designed to be pluggable right from the get go.  I don't see this as
not fitting in that model, other than the Token issue, which we can change.

> 
> And, lets say i have a hunspell dictionary for my language... how do i
> plug this in? I don't want it to implement Dictionary, because I'm not
> stupid enough to return something thats not in my index (see below),
> maybe i only want to use it as a 'filter' to prevent suggestions that
> are spelled incorrectly...

Implement an Index backed Dictionary that filters by Hunspell and feeds into the Spellchecker.
 I've seen that done on more than one occasion. 

> 
> 
> we really need to seriously clean house on the spellchecker stuff
> (lucene too)

+1.  

> and to answer your question, if we can fix these APIs in
> any way, I'm all for just doing a backwards break, because I think the
> existing APIs are completely broken.
> 
> For example, the whole index-based spellchecker in lucene has bad
> performance because its APIs were made overly generic:
> I think its important that it doesn't call docFreq() on every single
> term in the Dictionary when rebuilding, it should walk a TermEnum in
> parallel.

Sounds great.  I also think the notion of onlyMorePopular is screwed up too and needs to be
revisited.

> But, it can't do this because it can't assume the Dictionary is in
> sorted order!?
> I guess thats because the "Dictionary" idea was made overly generic,
> abstracted into useless PlainTextDictionary and LuceneDictionary.
> 
> PlainTextDictionary? useless... why the hell would you return
> something that isn't in your index?!

It can be quite useful to have an external source for tokens and I've seen it in action on
several occasions.  Just because they are fed in from an external source doesn't mean they
aren't in the index.  For instance, dump your terms from the index, do some downstream processing
according to user logs or whatever (or Hunspell if you want) and then load them back into
the Spell checker.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message