lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-626) Extended spell checker with phrase support and adaptive user session analysis.
Date Sat, 03 Feb 2007 02:19:05 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wettin updated LUCENE-626:
-------------------------------

      Description: 
Some minor changes to how the single token ngram spell checker in contrib/spellcheck, but
nothing that breaks any old implementation I think. Also fixed the broken test.

NgramPhraseSuggestier tokenizes a query and suggests combinations of the single token suggestions
matrix.

They must match as a query against an apriori index. By using a span near query (default)
you get features like this:

    assertEquals("lost in translation", ngramSuggester.didYouMean("lost on translation"));

If term position vectors are available it is possible to make it context sensitive (or what
one may call it) to suggest a new term order.

    assertEquals("heroes might magic", ngramSuggester.didYouMean("magic light heros"));
    assertEquals("heroes of might and magic", ngramSuggester.didYouMean("heros on light and
magik"));
    assertEquals("best game made", ngramSuggester.didYouMean("game best made"));
    assertEquals("game made", ngramSuggester.didYouMean("made game"));
    assertEquals("game made", ngramSuggester.didYouMean("made lame"));
    assertEquals("the game", ngramSuggester.didYouMean("the game"));
    assertEquals("in the fame", ngramSuggester.didYouMean("in the game"));
    assertEquals("game", ngramSuggester.didYouMean("same"));
    assertEquals(0, ngramSuggester.suggest("may game").size());

SessionAnalyzedDictionary is the adaptive layer, that learns from how users changed their
queries, what data they inspected, et c. It will automagically find and suggest synonyms,
decomposed words, and probably a lot of other neat features I still have not detected.

A bit depending on the situation, ignored suggestions get suppresed and followed suggestions
get suggeted even more.

    assertEquals("the da vinci code", dictionary.didYouMean("thedavincicode"));
    assertEquals("the da vinci code", dictionary.didYouMean("the davinci code"));

    assertEquals("homm", dictionary.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", dictionary.didYouMean("homm"));

    assertEquals("heroes of might and magic 2", dictionary.didYouMean("heroes of might and
magic ii"));
    assertEquals("heroes of might and magic ii", dictionary.didYouMean("heroes of might and
magic 2"));


The adaptive layer is not yet(tm) persistent, but soft referenced so that the dictionary don't
go eat up all your RAM.


  was:
>From javadocs:

 This is an adaptive, user query session analyzing spell checker. In plain words, a word and
phrase dictionary that will learn from how users act while searching.

Be aware, this is a beta version. It is not finished, but yeilds great results if you have
enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed
if you implement your own subclass of SpellChecker as the abstract methods of this class are
the CRUD methods. This will most probably change to a strategy class in future version.

TODO:

1. Gram up results to detect compositewords that should not be composite words, and vice verse.

2. Train a gramed token (markov) chain with output from an expectation maximization algorithm
(weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions
on queries that never was placed.

Usage:

Training

At user query time, create an instance of QueryResults containg the query string, number of
hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList
makes sense) that you pass on to train(sessionQueries) as the session times out.

You also want to call the bootstrap() method every 100000 queries or so.

Spell checking

Call getSuggestions(query) and look at the results. Don't modify it! This method call will
be hidden in a facade in future version.

Note that the spell checker is case sensitive, so you want to clean up query the same way
when you train as when you request the suggestions.

I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim() 

    Lucene Fields: [Patch Available]
         Assignee: Karl Wettin
       Issue Type: Improvement  (was: New Feature)
          Summary: Extended spell checker with phrase support and adaptive user session analysis.
 (was: Adaptive, user query session analyzing spell checker.)

All of the old comments was obsolete, so I re-initialized the whole issue.

> Extended spell checker with phrase support and adaptive user session analysis.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: https://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> Some minor changes to how the single token ngram spell checker in contrib/spellcheck,
but nothing that breaks any old implementation I think. Also fixed the broken test.
> NgramPhraseSuggestier tokenizes a query and suggests combinations of the single token
suggestions matrix.
> They must match as a query against an apriori index. By using a span near query (default)
you get features like this:
>     assertEquals("lost in translation", ngramSuggester.didYouMean("lost on translation"));
> If term position vectors are available it is possible to make it context sensitive (or
what one may call it) to suggest a new term order.
>     assertEquals("heroes might magic", ngramSuggester.didYouMean("magic light heros"));
>     assertEquals("heroes of might and magic", ngramSuggester.didYouMean("heros on light
and magik"));
>     assertEquals("best game made", ngramSuggester.didYouMean("game best made"));
>     assertEquals("game made", ngramSuggester.didYouMean("made game"));
>     assertEquals("game made", ngramSuggester.didYouMean("made lame"));
>     assertEquals("the game", ngramSuggester.didYouMean("the game"));
>     assertEquals("in the fame", ngramSuggester.didYouMean("in the game"));
>     assertEquals("game", ngramSuggester.didYouMean("same"));
>     assertEquals(0, ngramSuggester.suggest("may game").size());
> SessionAnalyzedDictionary is the adaptive layer, that learns from how users changed their
queries, what data they inspected, et c. It will automagically find and suggest synonyms,
decomposed words, and probably a lot of other neat features I still have not detected.
> A bit depending on the situation, ignored suggestions get suppresed and followed suggestions
get suggeted even more.
>     assertEquals("the da vinci code", dictionary.didYouMean("thedavincicode"));
>     assertEquals("the da vinci code", dictionary.didYouMean("the davinci code"));
>     assertEquals("homm", dictionary.didYouMean("heroes of might and magic"));
>     assertEquals("heroes of might and magic", dictionary.didYouMean("homm"));
>     assertEquals("heroes of might and magic 2", dictionary.didYouMean("heroes of might
and magic ii"));
>     assertEquals("heroes of might and magic ii", dictionary.didYouMean("heroes of might
and magic 2"));
> The adaptive layer is not yet(tm) persistent, but soft referenced so that the dictionary
don't go eat up all your RAM.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message