lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Plante <patrick.pla...@semeon.com>
Subject TermVectors and ExactStatsCache
Date Wed, 20 Sep 2017 14:03:15 GMT
Hi!

I have a SolrCloud 6.6 collection with 3 shards setup where I need the TermVectors TF and
DF values when querying.

I have configured the ExactStatsCache in the solrConfig:

<statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>

When I query "detector works" in my collection, it returns different docfreq values based
on the shard the document comes from:

"termVectors":[
    "27504103",[
      "uniqueKey","27504103",
      "kc",[
        "detector works",[
          "tf",1,
          "df",3,
          "tf-idf",0.3333333333333333]]],
    "27507925",[
      "uniqueKey","27507925",
      "kc",[
        "detector works",[
          "tf",1,
          "df",3,
          "tf-idf",0.3333333333333333]]],
    "27504105",[
      "uniqueKey","27504105",
      "kc",[
        "detector works",[
          "tf",1,
          "df",2,
          "tf-idf",0.5]]],
    "27507927",[
      "uniqueKey","27507927",
      "kc",[
        "detector works",[
          "tf",1,
          "df",2,
          "tf-idf",0.5]]],
    "27507929",[
      "uniqueKey","27507929",
      "kc",[
        "detector works",[
          "tf",1,
          "df",1,
          "tf-idf",1.0]]],
    "27504107",[
      "uniqueKey","27504107",
      "kc",[
        "detector works",[
          "tf",1,
          "df",3,
          "tf-idf",0.3333333333333333]]]]}

I expect to see the DF values to be 6 and TF-IDF to be adjusted on that value. I can see in
the debug logs that the cache was active.

I have found a pending bug (since Solr 5.5: https://issues.apache.org/jira/browse/SOLR-8893)
that explains that this ExactStatsCache is used to compute the correct TF-IDF for the query
but not for the TermVectors component.

Is there any way to get the correctly merged DF values (and TF-IDF) from multiple shards?

Is there a way to get from which shard a document comes from so I could compute my own correct
DF?

Thank you,
Patrick



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message