lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Glyn Darkin <g...@darkinsystems.com>
Subject Re: Lucene and Phrase Correction
Date Tue, 07 Apr 2009 13:31:57 GMT
Karl,

Thankyou for your in-depth reply. This has given me good grounds to go on.

Regards

Glyn


2009/4/6 Karl Wettin <karl.wettin@gmail.com>:
> 6 apr 2009 kl. 14.59 skrev Glyn Darkin:
>
> Hi Glyn,
>
>> to be able to spell check phrases
>>
>> E.g
>>
>> "Harry Poter" is converted to "Harry Potter"
>>
>> We have a fixed dataset so can build indexes/ dictionaries from our
>> own data.
>
> the most obvious solution is index your contrib/spell checker with shingles.
> This will however probably only help out with exact phrases. Perhaps that is
> enough for you.
>
> If your example is a real one that you came up with by analyzing query logs
> then you might want consider creating an index "stemmed" to handle various
> problems associated with reading and writing disorders. Dyslectic people
> often miss out on vowels, they who suffer from dysgraphia have problems with
> q/p/d/b, other have problems with reoccuring characters, et c. A combination
> of these problems could end up in a secondary "fuzzy" index that contains
> weighted shingles like this for the document that points at "harry potter":
>
> "hary poter"^0.9
> "harry #otter"^0.8
> "hary #oter"^0.7
> "hrry pttr"^0.7
> "hry ptr"^0.5
>
> In order to get a good precision/recall your query to such an index would
> have to produce a boolean query containing all of the "stems" above if the
> input was spelled correct.
>
>
> One alternative to the contrib/spell checker is Spelt:
> http://groups.google.com/group/spelt/ and I believe it is supposed to handle
> phrases.
>
>
> Note the difference between spell checking and suggestion schemes. Something
> can be wrong even though the spelling is correct. Consider the game "Heroes
> of might and magic", people might have fogotten what it was called and
> search for "Heroes of light and magic" instead. Hopefully your query would
> still yield a fairly good result for the correct document if the latter was
> entered, but if you require all terms or something similar then it might
> return no hits.
>
>
> More advanced strategies for contextual spell checking of phrases usually
> involve statistical models such as neural networs, hidden markov models, et
> c. LingPipe contains such an implementation.
>
>
> You can also take a look at reinforcement learning, learning from the
> misstakes and corrections made by your users. It requires a lot of data
> (user query logs) in order to work but will yeild very cool results such as
> acronyms.
>
> LUCENE-626 is a multi layered spell checker with reinforcement learning in
> the top, backed by an a priori corpus (that can be compiled from old user
> queries) used to find context. It also use a refactored version of the
> contrib/spell checker as second level suggestion when there is nothing to
> pick up from previous user behaviour. I never deployed this in a real
> system, it does however seem to work great when trained with a few hundred
> thousand query sessions.
>
>
> Finally I recommend that you take some time to analyze user query sessions
> to find what the most common problems your users have and try to find a
> solution that best fit those problems. Too often features are implemented
> because they are listed in a specification and not because the users need
> them.
>
>
> I hope this helps.
>
>     karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Glyn Darkin

Darkin Systems Ltd
Mob: 07961815649
Fax: 08717145065
Web: www.darkinsystems.com

Company No: 6173001
VAT No: 906350835

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message