lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <>
Subject [jira] Created: (LUCENE-626) Adaptive, user query session analyzing spell checker.
Date Thu, 13 Jul 2006 09:20:29 GMT
Adaptive, user query session analyzing spell checker.

         Key: LUCENE-626
     Project: Lucene - Java
        Type: New Feature

  Components: Search  
    Reporter: Karl Wettin
    Priority: Minor
 Attachments: spellcheck_0.0.1.tar.gz

>From javadocs:

 This is an adaptive, user query session analyzing spell checker. In plain words, a word and
phrase dictionary that will learn from how users act while searching.

Be aware, this is a beta version. It is not finished, but yeilds great results if you have
enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed
if you implement your own subclass of SpellChecker as the abstract methods of this class are
the CRUD methods. This will most probably change to a strategy class in future version.


1. Gram up results to detect compositewords that should not be composite words, and vice verse.

2. Train a gramed token (markov) chain with output from an expectation maximization algorithm
(weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions
on queries that never was placed.



At user query time, create an instance of QueryResults containg the query string, number of
hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList
makes sense) that you pass on to train(sessionQueries) as the session times out.

You also want to call the bootstrap() method every 100000 queries or so.

Spell checking

Call getSuggestions(query) and look at the results. Don't modify it! This method call will
be hidden in a facade in future version.

Note that the spell checker is case sensitive, so you want to clean up query the same way
when you train as when you request the suggestions.

I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim() 

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message