lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin J <martinj.eng...@gmail.com>
Subject Re: copyFields, multiple terms -- IDF?
Date Wed, 02 Feb 2011 22:04:05 GMT
On a closer review, i am noticing that the fieldNorm is what is killing
document A.
If I reindex with omitNorms=true, will this problem be "solved"?


On Wed, Feb 2, 2011 at 4:54 PM, Martin J <martinj.engine@gmail.com> wrote:

> Hi, I'm having a weirdness with indexing multiple terms to a single field
> using a copyField. An example:
>
> For document A
> field:contents_1 is a multivalued field containing "cat", "dog" and "duck"
> field:contents_2 is a multivalued field containing "cat", "horse", and
> "flower"
>
> For document B
> field:contents_1 is a multivalued field containing "cat" and "fish"
> field:contents_2 is a multivalued field containing "bear" and "turkey"
>
> I have a copyField in my schema:
>
>  <copyField source="contents_*" dest="combined"/>
>
> A query like contents_1:cat contents_2:cat returns document A first, and
> then document B. I think that is the way it should work.
>
> But a query like combined:cat returns document B first. In my mind, when I
> am doing a copyField I am copying each of the terms in the multivalued
> fields of contents_1 and contents_2 into combined, so that combined
> internally has "cat", "dog", "duck", "cat", "horse", "flower" for document
> A.
>
> An explain on the query says something like (this is from a real query not
> the fake one above)
>
> <lst name="explain">
> <str name="B">
> 4.0687284 = (MATCH) fieldWeight(combined:cat in 1663089), product of: 1.0 =
> tf(termFreq(combined:cat)=1) 4.0687284 = idf(docFreq=135688,
> maxDocs=2919285) 1.0 = fieldNorm(field=combined, doc=1663089)
> </str>
> <str name="A">
> 0.8509077 = (MATCH) fieldWeight(combined:cat in 913171), product of:
> 2.236068 = tf(termFreq(combined:cat)=5) 4.0590663 = idf(docFreq=143689,
> maxDocs=3061697) 0.09375 = fieldNorm(field=combined, doc=913171)
> </str>
>
> If I am reading this right, it is finding the higher TF in A (5 in this
> case) but still scoring B higher. Shouldn't idf be exactly the same?
>
> (Both fields are a solr.TextField:
>
>  <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>             <tokenizer class="solr.StandardTokenizerFactory"/>
>             <filter class="solr.StandardFilterFactory"/>
>             <filter class="solr.ISOLatin1AccentFilterFactory"/>
>             <filter class="solr.LowerCaseFilterFactory"/>
>             <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>             <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>       </analyzer>
>     </fieldtype>
> )
>
> Another piece of perhaps relevant information is that this a query over 16
> shards using distributed solr.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message