lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Reuschling <christian.reuschl...@gmail.com>
Subject Re: FuzzySuggester EXACT_FIRST criteria
Date Wed, 20 Nov 2013 17:40:50 GMT
I created a test class for rapid testing that should be runnable out of the box, with
LUCENE-SUGGEST-5.0-SNAPSHOT (maven) dependency. (see attachment)

Because I can't subclass from the final FuzzySuggester I subclassed AnalyzingSuggester, delegating
all 3 method calls 'convertAutomaton, getFullPrefixPaths and getTokenStreamToAutomaton' +
constructor over an internal FuzzySuggester member.

Then I overrided AnalyzingSuggester.lookup(..) by copying it from AnalyzingSuggester, sadly
invoking
2 methods and reading some field members with reflection api because of their private declaration
(alternative would be to copy everything). Everything worked as expected so far.

I added our slight modification - moving the getFullPrefixPaths invocation to the first prefix
path
creation.

The main class checks a simple scenario with KeywordAnalyzer, three term dictionary and some
query
term variations.
Here is the output. Sadly some (for me) unexpected results:


Dictionary: [screen, screensaver, mouse]

query: 'screan' - exact result as expected (correct). But not in any case! This is when one
letter
is changed, which is not the first or last one.
Exact results:
  screen/1
All results: - double entry of 'screen'?
  screen/1
  screen/1
  screensaver/1

query: 'screew' - last letter changed: exact result empty (incorrect).
Exact results:
All results:
  screen/1
  screensaver/1

query: 'wcreen' - first letter changed: nothing found at all.
Exact results:
All results:

query: 'scree' - last letter removed.
Exact results:
All results:
  screen/1
  screensaver/1

query: 'scren' - 5th letter removed. Same as with last removed letter.
Exact results:
All results:
  screen/1
  screensaver/1

query: 'sreen' - 2th letter removed. Why different?
Exact results:
  screen/1
All results: - double entry of 'screen'?
  screen/1
  screen/1
  screensaver/1

query: 'screen' - correct query: screen not found at all?
Exact results:
All results:
  screensaver/1



Now, my latin is at the end (as we say in Germany ;) ). Don't know how to proceed further,
as the
deeper code starts to become very complex.

Thanks a lot!

Christian Reuschling



On 15.11.2013 18:49, Michael McCandless wrote:
> Hmm, I'm not sure offhand why that change gives you no results.
> 
> The fullPrefixPaths should have been a super-set of the original
> prefix paths, since the LevA just adds further paths.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Thu, Nov 14, 2013 at 2:43 PM, Christian Reuschling
> <christian.reuschling@gmail.com> wrote:
>> I tried it by changing the first prefixPath initialization to
>>
>> List<FSTUtil.Path<Pair<Long,BytesRef>>> prefixPaths =
>>     FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), fst);
>> prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, fst);
>>
>> inside AnalyzingSuggester.lookup(..). (simply copied the line from below)
>>
>> Sadly, FuzzySuggester now gives no hits at all, even with a correct spelled query.
>>
>> Correct spelled query:
>> prefixPaths size == 1
>> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, bytesReader)
>>   (without getFullPrefixPath: non-null)
>>
>> Query within edit distance - the same:
>> prefixPaths size == 1   (without getFullPrefixPath: 0)
>> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, bytesReader)
>>
>> Query outside of edit distance:
>> prefixPaths size = 0
>>
>> Seems like the fuzziness is there, but getFullPrefixPaths kicks all END_BYTEs ?
>>
>>
>>
>> On 14.11.2013 17:05, Michael McCandless wrote:
>>> On Wed, Nov 13, 2013 at 12:04 PM, Christian Reuschling <christian.reuschling@gmail.com>
wrote:
>>>> We started to implement a named entity recognition on the base of AnalyzingSuggester,
which
>>>> offers the great support for Synonyms, Stopwords, etc. For this, we slightly
modified
>>>> AnalyzingSuggester.lookup() to only return the exactFirst hits (considering
the exactFirst
>>>> code block only, skipping the 'sameSurfaceForm' check and break, to get the
synonym hits
>>>> too).
>>>>
>>>> This works pretty good, and our next step would be to bring in some fuzzyness
against
>>>> spelling mistakes. For this, the idea was to do exactly the same, but with
FuzzySuggester
>>>> instead.
>>>>
>>>> Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only relies
on sharing the
>>>> same prefix - also different/misspelled terms inside the edit distance are
considered as 'not
>>>> exact', which means we get the same results as with AnalyzingSuggester.
>>>>
>>>>
>>>> query: "screen" misspelled query: "screan" dictionary: "screen", "screensaver"
>>>>
>>>> AnalyzingSuggester hits: screen, screensaver AnalyzingSuggester hits on misspelled
query:
>>>> <empty> AnalyzingSuggester EXACT_FIRST hits: screen AnalyzingSuggester
EXACT_FIRST hits on
>>>> misspelled query: <empty>
>>>>
>>>> FuzzySuggester hits: screen, screensaver FuzzySuggester hits on misspelled
query: screen,
>>>> screensaver FuzzySuggester EXACT_FIRST hits: screen FuzzySuggester EXACT_FIRST
hits on
>>>> misspelled query: <empty> => TARGET: screen
>>>>
>>>>
>>>> Is there a possibility to distinguish? I see that the 'exact' criteria relies
on an FST
>>>> aspect 'END_BYTE arc leaving'. Maybe these can be set differently when building
the
>>>> Levenshtein automata? I have no clue.
>>>
>>> It seems like the problem is that AnalyzingSuggester checks for exactFirst before
calling
>>> .getFullPrefixPaths (which, in FuzzySuggester subclass, applies the fuzziness)?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> --------------------------------------------------------------------- To unsubscribe,
e-mail:
>>> java-user-unsubscribe@lucene.apache.org For additional commands, e-mail:
>>> java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

Mime
View raw message