lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From c...@alias-i.com
Subject Re: Contextual suggestions (for spelling)
Date Tue, 11 Apr 2006 00:20:12 GMT
karl wettin wrote:
> I've been working a bit with the spell checker. It does a pretty good  
> job when it comes to finding a smiple typo.
> I was thinking it would be nice if I could turn "heros light and  magic" 
> to "did you mean: heroes of might and magic?".
> 
> My strategy is to combine Markov, A* and Levenstein.
> 
> Algorithm:
> 
> First I have to train the Markov chain with the token offsets from  Lucene.

You really need to train on the token sequence (or char sequence).

> At query time I choose the cheapest A* path though the Markov chain  
> with as short Levenstien distance as possible.

You really need to balance Levenstein (edit) distance
and Markov chain score (query likelihood) to get good
performance here.

> Any comments on this? Questions?

You've sketched a slightly degenerate version of Shannon's
noisy channel model.  Here are three refs to help
you sort out the usual approach:

- Kukich, Karen. 1992. Techniques for automatically correcting words in text. ACM Computing
Surveys 24(4):377-437.
- Jurafsky, Dan and James H. Martin. 2000. Speech and Language Processing. Prentice-Hall.
Chapter 5.
- Shannon, Claude. 1949. Communication in the presence of noise. Proceedings of the IRE. 37(1):10-21.

There've been some nice recent papers from Microsoft
Research (by Bob Moore, Eric Brill, Silviu Cucerzan,
and Kristina Toutanova).

We've implemented the standard noisy channel model
in LingPipe (available online with source),
with a tutorial to get you started:

http://www.alias-i.com/lingpipe/demos/tutorial/querySpellChecker/read-me.html

We've run this on data sets as large as 50GB.  You need
a fair amount of data to make this approach work --
at least a few megabytes.  (In other words, the trivial
amount of data in the demo won't give you a good basis
for evaluation.)  If the data gets too big,
you'll need to prune the models.  And there are also
a bunch of params that need some loving care in tuning
in order to make it work.  Having said that, this is
the approach with the best accuracy of which we're aware.
But it's fairly expensive at run time compared to
simpler methods, and can require a lot of memory to
store the Markov models.

The noisy channel approach is built on two sub-models:

source:  what you expect in the way of queries

             Prob(query)

channel:  what you expect to observe from the user
           in the way of typos/brainos given a query

             Prob(observation|query)

We implement a Markov chain at the character level
and generalize Levenstein in the usual way to weighted
edit distances (with transposition).  If these are normalized
to a probability, you get a proper noisy channel model.

The likelihood of a given query being the source
of the observed query is given by:

   ARGMAX_obs Prob(query|obs)              [problem]
= ARGMAX_obs Prob(query,obs) / Prob(obs)  [def. of cond. prob]
= ARGMAX_obs Prob(query,obs)              [Prob(obs) constant]
= ARGMAX_obs Prob(obs|query) * P(query)   [chain rule]

In words, you find the query that was most likely
to produce the observation.

You can find this using A*, but the problem is
a completion estimate that's useful.  We use
a Viterbi-style search with a pretty tight beam.

We also allow you to restrict edits to tokens
that appear in the original data.  And to provide
a set of do-not-edit tokens.  And there are
a few spelling-specific weights that have been shown
to help (for us and others), such as penalizing edits of
the first and second characters in a token.  You can
think of these as tweaks to the channel.

Now whether you could correct your query above is
another question.  "light" is a common word.

You'll also have problems if your corpus contains
a lot of misspellings.  This problem dogs the big
search engines' spelling correctors because the
web is very noisy.  To make matters worse, they're
also usually multi-lingual.

 > I choose A* over breadth-first to allow zero-cost for stop words and
 > future contextual boosting.

I don't see how this relates to zero-cost for stop words.
Our API includes a set of "do not edit" tokens.  It's easy
to do in a breadth-first (synchronous) search.

Further contextual boosting is a good motivation for A*
if you can make it work online.  Otherwise, it'll just
devolve to an n-best rescoring, which you can do with
the output of a synchronous search, such as ours (and which is pretty
standard in these kinds of decoding efforts).

- Bob Carpenter  http://www.colloquial.com/carp
   Alias-i        http://www.alias-i.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message