lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Reuschling <christian.reuschl...@gmail.com>
Subject FuzzySuggester EXACT_FIRST criteria
Date Wed, 13 Nov 2013 17:04:30 GMT
We started to implement a named entity recognition on the base of AnalyzingSuggester, which
offers
the great support for Synonyms, Stopwords, etc.
For this, we slightly modified AnalyzingSuggester.lookup() to only return the exactFirst hits
(considering the exactFirst code block only, skipping the 'sameSurfaceForm' check and break,
to get
the synonym hits too).

This works pretty good, and our next step would be to bring in some fuzzyness against spelling
mistakes. For this, the idea was to do exactly the same, but with FuzzySuggester instead.

Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only relies on sharing the
same
prefix - also different/misspelled terms inside the edit distance are considered as 'not exact',
which means we get the same results as with AnalyzingSuggester.


query: "screen"
misspelled query: "screan"
dictionary: "screen", "screensaver"

AnalyzingSuggester hits: screen, screensaver
AnalyzingSuggester hits on misspelled query: <empty>
AnalyzingSuggester EXACT_FIRST hits: screen
AnalyzingSuggester EXACT_FIRST hits on misspelled query: <empty>

FuzzySuggester hits: screen, screensaver
FuzzySuggester hits on misspelled query: screen, screensaver
FuzzySuggester EXACT_FIRST hits: screen
FuzzySuggester EXACT_FIRST hits on misspelled query: <empty> => TARGET: screen


Is there a possibility to distinguish? I see that the 'exact' criteria relies on an FST aspect
'END_BYTE arc leaving'. Maybe these can be set differently when building the Levenshtein automata?
I
have no clue.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message