lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1420) Similarity.lengthNorm and positionIncrement=0
Date Wed, 15 Oct 2008 00:28:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639668#action_12639668
] 

Hoss Man commented on LUCENE-1420:
----------------------------------

1) i only skimmed this quickly, but i don't think the changes to SweetSpotSimilarity are back
compatible ... setLengthNormFactors has a new arg list.

2) ditto for the public "Info" constructor in MemoryIndex.java

3) as long as we are adding a new lengthNorm method that has access to new data about the
stream, would it also make sense to pass in fieldState.position?  and/or a new count of hte
number of times getPositionIncrementGap(fieldInfo.name) is called?  Those also seem like they
could be useful, and should be just as cheap to keep track of as numOverlap and length.  (this
occured to me because of recent threads on solr-user asking about lengthNorm and multivalued
fields ... there may only be one fieldNorm per field name, but with stats like that we could
at least do some interesting things based on the average length of each field value.

4) independent of #3, we may want to consider making FieldInvertState a public class and passing
it directly to lengthNorm ... that way lengthNorm can utilize whatever data it wants, and
we can add more available data later without changing the API again.  We could even deprecate
lengthNorm entirely and add a new FieldInvertState.norm property that a new Similarity.computeNorm(FieldInvertState)
could set directly so it could choose to ignore the doc & field boosts altogether if it
wanted to.

> Similarity.lengthNorm and positionIncrement=0
> ---------------------------------------------
>
>                 Key: LUCENE-1420
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1420
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.3.3, 2.9
>            Reporter: Andrzej Bialecki 
>            Assignee: Michael McCandless
>             Fix For: 2.3.3, 2.9
>
>         Attachments: similarity.patch
>
>
> Calculation of lengthNorm factor should in some cases take into account the number of
tokens with positionIncrement=0. This should be made optional, to support two different scenarios:
> * when analyzers insert artificially constructed tokens into TokenStream (e.g. ASCII-fied
versions of accented terms, stemmed terms), and it's unlikely that users submit queries containing
both versions of tokens: in this case lengthNorm calculation should ignore the tokens with
positionIncrement=0.
> * when analyzers insert synonyms, and it's likely that users may submit queries that
contain multiple synonymous terms: in this case the lengthNorm should be calculated as it
is now, i.e. it should take into account all terms no matter what is their positionIncrement.
> The default should be backward-compatible, i.e. it should count all tokens.
> (See also the discussion here: http://markmail.org/message/vfvmzrzhr6pya22h )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message