lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: FuzzySuggester EXACT_FIRST criteria
Date Fri, 15 Nov 2013 17:49:07 GMT
Hmm, I'm not sure offhand why that change gives you no results.

The fullPrefixPaths should have been a super-set of the original
prefix paths, since the LevA just adds further paths.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Nov 14, 2013 at 2:43 PM, Christian Reuschling
<christian.reuschling@gmail.com> wrote:
> I tried it by changing the first prefixPath initialization to
>
> List<FSTUtil.Path<Pair<Long,BytesRef>>> prefixPaths =
>     FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), fst);
> prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, fst);
>
> inside AnalyzingSuggester.lookup(..). (simply copied the line from below)
>
> Sadly, FuzzySuggester now gives no hits at all, even with a correct spelled query.
>
> Correct spelled query:
> prefixPaths size == 1
> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, bytesReader)
>   (without getFullPrefixPath: non-null)
>
> Query within edit distance - the same:
> prefixPaths size == 1   (without getFullPrefixPath: 0)
> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, bytesReader)
>
> Query outside of edit distance:
> prefixPaths size = 0
>
> Seems like the fuzziness is there, but getFullPrefixPaths kicks all END_BYTEs ?
>
>
>
> On 14.11.2013 17:05, Michael McCandless wrote:
>> On Wed, Nov 13, 2013 at 12:04 PM, Christian Reuschling <christian.reuschling@gmail.com>
wrote:
>>> We started to implement a named entity recognition on the base of AnalyzingSuggester,
which
>>> offers the great support for Synonyms, Stopwords, etc. For this, we slightly
modified
>>> AnalyzingSuggester.lookup() to only return the exactFirst hits (considering the
exactFirst
>>> code block only, skipping the 'sameSurfaceForm' check and break, to get the synonym
hits
>>> too).
>>>
>>> This works pretty good, and our next step would be to bring in some fuzzyness
against
>>> spelling mistakes. For this, the idea was to do exactly the same, but with FuzzySuggester
>>> instead.
>>>
>>> Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only relies
on sharing the
>>> same prefix - also different/misspelled terms inside the edit distance are considered
as 'not
>>> exact', which means we get the same results as with AnalyzingSuggester.
>>>
>>>
>>> query: "screen" misspelled query: "screan" dictionary: "screen", "screensaver"
>>>
>>> AnalyzingSuggester hits: screen, screensaver AnalyzingSuggester hits on misspelled
query:
>>> <empty> AnalyzingSuggester EXACT_FIRST hits: screen AnalyzingSuggester
EXACT_FIRST hits on
>>> misspelled query: <empty>
>>>
>>> FuzzySuggester hits: screen, screensaver FuzzySuggester hits on misspelled query:
screen,
>>> screensaver FuzzySuggester EXACT_FIRST hits: screen FuzzySuggester EXACT_FIRST
hits on
>>> misspelled query: <empty> => TARGET: screen
>>>
>>>
>>> Is there a possibility to distinguish? I see that the 'exact' criteria relies
on an FST
>>> aspect 'END_BYTE arc leaving'. Maybe these can be set differently when building
the
>>> Levenshtein automata? I have no clue.
>>
>> It seems like the problem is that AnalyzingSuggester checks for exactFirst before
calling
>> .getFullPrefixPaths (which, in FuzzySuggester subclass, applies the fuzziness)?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> --------------------------------------------------------------------- To unsubscribe,
e-mail:
>> java-user-unsubscribe@lucene.apache.org For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message