lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5354) Blended score in AnalyzingInfixSuggester
Date Tue, 17 Dec 2013 17:45:11 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13850713#comment-13850713
] 

Michael McCandless commented on LUCENE-5354:
--------------------------------------------

Thanks Remi, patch looks great!

Can you move that {{boolean finished}} inside the {{if (lastToken != null)}}?  (If there was
no lastToken then we should not be calling {{offsetEnd.endOffset}}).

Can we leave AnalyzingInfixSuggester with DOCS_ONLY?  I.e., open up a method (maybe getTextFieldType?)
that the subclass would override and set to DOCS_AND_FREQS_AND_POSITIONS.

In createCoefficient, instead of splitting the incoming key on space, I think you should ask
the analyzer to do so?  In fact, since the lookup (in super) already did that (break into
tokens, figure out if last token is a "prefix" or not), maybe we can just pass that down to
createResult?

If the query has more than one term, it looks like you only use the first?  Maybe instead
we should visit all the terms and record which one has the lowest position?

Have you done any performance testing?  Visiting term vectors for each hit can be costly.
 It should be more performant to pull a DocsAndPositionsEnum up front and then do .advance
to each (sorted) docID to get the position ... but this is likely more complex (it inverts
the "stride", so you'd do term by term on the outer loop, then
docs on the inner loop, vs the opposite that you have now).

key.toString() can be pulled out of the while loop and done once up front.

Why do you use key.toString().contains(docTerm) for the finished case? Won't that result in
false positives, e.g. if key is "foobar" and docTerm is "oba"?

Can you rewrite the embedded ternary operator in the LookUpComparator to just use simple if
statements?  I think that's more readable...


> Blended score in AnalyzingInfixSuggester
> ----------------------------------------
>
>                 Key: LUCENE-5354
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5354
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spellchecker
>    Affects Versions: 4.4
>            Reporter: Remi Melisson
>            Priority: Minor
>              Labels: suggester
>         Attachments: LUCENE-5354.patch
>
>
> I'm working on a custom suggester derived from the AnalyzingInfix. I require what is
called a "blended score" (//TODO ln.399 in AnalyzingInfixSuggester) to transform the suggestion
weights depending on the position of the searched term(s) in the text.
> Right now, I'm using an easy solution :
> If I want 10 suggestions, then I search against the current ordered index for the 100
first results and transform the weight :
> bq. a) by using the term position in the text (found with TermVector and DocsAndPositionsEnum)
> or
> bq. b) by multiplying the weight by the score of a SpanQuery that I add when searching
> and return the updated 10 most weighted suggestions.
> Since we usually don't need to suggest so many things, the bigger search + rescoring
overhead is not so significant but I agree that this is not the most elegant solution.
> We could include this factor (here the position of the term) directly into the index.
> So, I can contribute to this if you think it's worth adding it.
> Do you think I should tweak AnalyzingInfixSuggester, subclass it or create a dedicated
class ?



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message