lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Luis Betancourt González <jlbetanco...@uci.cu>
Subject Re: [MASSMAIL]Re: High fieldNorm values causing really odd results
Date Fri, 15 May 2015 03:49:18 GMT
Regarding the experiment, sorry If I explained myself in the wrong way, the indexed document
doesn't have 119669 terms have a lot less terms (less than a 1000 terms, I don't have the
exact number here now), instead 119669 is the number of distinct terms reported by luke (Top-terms
total in the admin interface) on the title field. 

This index was built from scratch using 4.10.3 if I'm no remembering incorrectly. Perhaps
part of the data could be indexed using 4.10.2, but we updated our box quite some time ago
and this problem didn't appear until recently. The more strange issue is that this was working
fine until a week or so ago, the only thing I found strange is that the root partition in
our Solr box got out of space; basically we've Solr deployed in Tomcat, which is installed
in the root partition but the cores and all Solr related data is stored in a separated partition
mounted in /opt with plenty of space to grow; could this be the cause of this behavior? 

We're thinking on rebuilding our index, but would love to avoid it if possible and more importantly
find the root cause if this issue (if is possible at all).

As I said before very grateful for your responses,

----- Original Message -----
From: "Chris Hostetter" <hossman_lucene@fucit.org>
To: solr-user@lucene.apache.org
Sent: Thursday, May 14, 2015 7:11:08 PM
Subject: Re: [MASSMAIL]Re: High fieldNorm values causing really odd results


: Sorry for leaving the Solr version out in my previous email, I'm using 
: Solr 4.10.3 running on Centos7, with the following JRE: Oracle 
: Corporation OpenJDK 64-Bit Server VM (1.7.0_75 24.75-b04)

I can't reproduce Using Solr 4.10.3 (or 4.10.4 - mistread your email the 
first time)

Are you certain you didn't *build* this index with a different Similarity 
configured? or did you perhaps build it with an older version of Solr that 
might have had a bug in it?

Here's what i tried...

applied this patch to the example configs based on the fieldType you 
specified...

hossman@tray:~/lucene/lucene_solr_4_10_3_tag$ svn diff
Index: solr/example/solr/collection1/conf/schema.xml
===================================================================
--- solr/example/solr/collection1/conf/schema.xml	(revision 1679472)
+++ solr/example/solr/collection1/conf/schema.xml	(working copy)
@@ -46,6 +46,21 @@
 -->
 
 <schema name="example" version="1.5">
+
+        <fieldType name="hoss_type" class="solr.TextField" sortMissingLast="true">
+            <analyzer>
+                <charFilter class="solr.HTMLStripCharFilterFactory"/>
+                <tokenizer class="solr.StandardTokenizerFactory"/>
+                <filter class="solr.ASCIIFoldingFilterFactory"/>
+                <filter class="solr.StopFilterFactory"
+                    ignoreCase="true" words="stopwords.txt"/>
+                <filter class="solr.LowerCaseFilterFactory"/>
+                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
+            </analyzer>
+        </fieldType>
+
+        <field name="hoss_test" type="hoss_type" stored="true" indexed="true" multiValued="true"/>
+  
   <!-- attribute "name" is the name of this schema and is only used for display purposes.
        version="x.y" is Solr's version number for the schema syntax and 
        semantics.  It should not normally be changed by applications.

...started up "java -jar start.jar" and then wrote & ran this script to 
generate a doc with the number of unique terms in my field that you mentioned & indexed
it...

hossman@tray:~/tmp$ cat make-big-field.pl
#/usr/bin/perl

print qq{<add><doc><field name="id">hoss</field><field 
name="hoss_test">\n};
for (1..119669) {
    print "term${_} ";
}
print qq{</field></doc></add>\n};
hossman@tray:~/tmp$ perl make-big-field.pl > tmp.xml
hossman@tray:~/tmp$ curl -X POST -H 'Content-Type: application/xml' --data-binary @tmp.xml
"http://localhost:8983/solr/collection1/update?commit=true"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">877</int></lst>
</response>


Then confirmed i got a very small fieldNorm when querying against this 
field...

hossman@tray:~/tmp$ curl 
'http://localhost:8983/solr/collection1/select?q=hoss_test:term1&debug=results&wt=json&indent=true&fl=id&omitHeader=true'
{
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"hoss"}]
  },
  "debug":{
    "explain":{
      "hoss":"\n7.491524E-4 = (MATCH) weight(hoss_test:term1 in 0) 
[DefaultSimilarity], result of:\n  7.491524E-4 = fieldWeight in 0, product 
of:\n    1.0 = tf(freq=1.0), with freq of:\n      1.0 = termFreq=1.0\n    
0.30685282 = idf(docFreq=1, maxDocs=1)\n    0.0024414062 = 
fieldNorm(doc=0)\n"}}}


-Hoss
http://www.lucidworks.com/

Mime
View raw message