lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin J <martinj.eng...@gmail.com>
Subject copyFields, multiple terms -- IDF?
Date Wed, 02 Feb 2011 21:54:10 GMT
Hi, I'm having a weirdness with indexing multiple terms to a single field
using a copyField. An example:

For document A
field:contents_1 is a multivalued field containing "cat", "dog" and "duck"
field:contents_2 is a multivalued field containing "cat", "horse", and
"flower"

For document B
field:contents_1 is a multivalued field containing "cat" and "fish"
field:contents_2 is a multivalued field containing "bear" and "turkey"

I have a copyField in my schema:

 <copyField source="contents_*" dest="combined"/>

A query like contents_1:cat contents_2:cat returns document A first, and
then document B. I think that is the way it should work.

But a query like combined:cat returns document B first. In my mind, when I
am doing a copyField I am copying each of the terms in the multivalued
fields of contents_1 and contents_2 into combined, so that combined
internally has "cat", "dog", "duck", "cat", "horse", "flower" for document
A.

An explain on the query says something like (this is from a real query not
the fake one above)

<lst name="explain">
<str name="B">
4.0687284 = (MATCH) fieldWeight(combined:cat in 1663089), product of: 1.0 =
tf(termFreq(combined:cat)=1) 4.0687284 = idf(docFreq=135688,
maxDocs=2919285) 1.0 = fieldNorm(field=combined, doc=1663089)
</str>
<str name="A">
0.8509077 = (MATCH) fieldWeight(combined:cat in 913171), product of:
2.236068 = tf(termFreq(combined:cat)=5) 4.0590663 = idf(docFreq=143689,
maxDocs=3061697) 0.09375 = fieldNorm(field=combined, doc=913171)
</str>

If I am reading this right, it is finding the higher TF in A (5 in this
case) but still scoring B higher. Shouldn't idf be exactly the same?

(Both fields are a solr.TextField:

 <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StandardFilterFactory"/>
            <filter class="solr.ISOLatin1AccentFilterFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
            <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
      </analyzer>
    </fieldtype>
)

Another piece of perhaps relevant information is that this a query over 16
shards using distributed solr.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message