lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4481) AnalyzingSuggester may fail to return correct topN suggestions
Date Sat, 20 Oct 2012 11:08:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480698#comment-13480698
] 

Michael McCandless commented on LUCENE-4481:
--------------------------------------------

Another possible fix for AnalyzingSuggester would be to "guess" an appropriate maxQueueDepth,
run the search, and if the pruning becomes inadmissible (you can easily detect this by counting
how many dup paths were actually rejected), then re-run the search with a larger guess, and
iterate until you succeed.

For syn-heavy (or otherwise graph-heavy) analyzers this could be a win over the current patch.

Though if the analyzer is doing so much expansion presumably the app would have set the limit
on max expansions which would then make the current patch fast(er) again.

But I think we should separately explore that ...

                
> AnalyzingSuggester may fail to return correct topN suggestions
> --------------------------------------------------------------
>
>                 Key: LUCENE-4481
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4481
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.1, 5.0
>
>         Attachments: LUCENE-4481.patch, LUCENE-4481.patch, LUCENE-4481.patch, LUCENE-4481.patch,
LUCENE-4481.patch
>
>
> I hit this when working on LUCENE-4480.
> Because AnalyzingSuggester may prune some of the topN paths found by FST's Util.TopNSearcher,
this means the queue size limit of topN makes the overall search inadmissible, ie it may incorrectly
prune paths that would have lead to a competitive path.
> However, such pruning is rare: it happens only for graph token streams, and even then
only when competitive analyzed forms share the same surface forms.
> The simplest way to fix this is to make the queue unbounded but this is likely a sizable
performance hit ... I haven't tested yet.  It's even possible the way the dups happen (always
at the "end" of the suggestion, because we tack on 0 byte followed by ord dedup byte) prevent
this bug from even occurring and so this could all be a false alarm!  I have to try to make
a test case showing it ...
> A cop-out solution would be to expose a separate queueSize or queueMultiplier (over the
topN) so that if users are affected by this they could crank up the queue size or multiplier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message