lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Lucene and Phrase Correction
Date Mon, 06 Apr 2009 19:46:15 GMT
6 apr 2009 kl. 14.59 skrev Glyn Darkin:

Hi Glyn,

> to be able to spell check phrases
>
> E.g
>
> "Harry Poter" is converted to "Harry Potter"
>
> We have a fixed dataset so can build indexes/ dictionaries from our
> own data.

the most obvious solution is index your contrib/spell checker with  
shingles. This will however probably only help out with exact phrases.  
Perhaps that is enough for you.

If your example is a real one that you came up with by analyzing query  
logs then you might want consider creating an index "stemmed" to  
handle various problems associated with reading and writing disorders.  
Dyslectic people often miss out on vowels, they who suffer from  
dysgraphia have problems with q/p/d/b, other have problems with  
reoccuring characters, et c. A combination of these problems could end  
up in a secondary "fuzzy" index that contains weighted shingles like  
this for the document that points at "harry potter":

"hary poter"^0.9
"harry #otter"^0.8
"hary #oter"^0.7
"hrry pttr"^0.7
"hry ptr"^0.5

In order to get a good precision/recall your query to such an index  
would have to produce a boolean query containing all of the "stems"  
above if the input was spelled correct.


One alternative to the contrib/spell checker is Spelt: http://groups.google.com/group/spelt/

  and I believe it is supposed to handle phrases.


Note the difference between spell checking and suggestion schemes.  
Something can be wrong even though the spelling is correct. Consider  
the game "Heroes of might and magic", people might have fogotten what  
it was called and search for "Heroes of light and magic" instead.  
Hopefully your query would still yield a fairly good result for the  
correct document if the latter was entered, but if you require all  
terms or something similar then it might return no hits.


More advanced strategies for contextual spell checking of phrases  
usually involve statistical models such as neural networs, hidden  
markov models, et c. LingPipe contains such an implementation.


You can also take a look at reinforcement learning, learning from the  
misstakes and corrections made by your users. It requires a lot of  
data (user query logs) in order to work but will yeild very cool  
results such as acronyms.

LUCENE-626 is a multi layered spell checker with reinforcement  
learning in the top, backed by an a priori corpus (that can be  
compiled from old user queries) used to find context. It also use a  
refactored version of the contrib/spell checker as second level  
suggestion when there is nothing to pick up from previous user  
behaviour. I never deployed this in a real system, it does however  
seem to work great when trained with a few hundred thousand query  
sessions.


Finally I recommend that you take some time to analyze user query  
sessions to find what the most common problems your users have and try  
to find a solution that best fit those problems. Too often features  
are implemented because they are listed in a specification and not  
because the users need them.


I hope this helps.

      karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message