lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frederico Azeiteiro <Frederico.Azeite...@cision.com>
Subject RE: Search differences between solr 1.4.0 and 3.6.1
Date Wed, 28 Nov 2012 17:31:56 GMT
Also, i'm having issues with searching "RoC" . It returns thousands of matches on 3.6.1 against
just a few on solr 1.4.0.
Looking to analysis I see no differences...

Should I add "RoC" to protected keywords or can I tweak something on schema to achieve exact
"RoC" matches?


-----Mensagem original-----
De: Frederico Azeiteiro [mailto:Frederico.Azeiteiro@cision.com] 
Enviada: quarta-feira, 28 de Novembro de 2012 17:19
Para: solr-user@lucene.apache.org
Assunto: RE: Search differences between solr 1.4.0 and 3.6.1

Ok, I'll test that and let you know.

Is there some test I can easily do to confirm that is was really a side-effect of the bug?

____________________________________________
Frederico Azeiteiro
Developer
 


-----Mensagem original-----
De: Jack Krupansky [mailto:jack@basetechnology.com] 
Enviada: quarta-feira, 28 de Novembro de 2012 13:39
Para: solr-user@lucene.apache.org
Assunto: Re: Search differences between solr 1.4.0 and 3.6.1

You need to add the generateNumberParts=1 attribute - assuming you actually want the number
generated.

The fact that your schema worked in 1.4 was probably simply a side effect of this bug:
https://issues.apache.org/jira/browse/SOLR-1706
"wrong tokens output from WordDelimiterFilter depending upon options"

-- Jack Krupansky

-----Original Message-----
From: Frederico Azeiteiro
Sent: Monday, November 26, 2012 9:06 AM
To: solr-user@lucene.apache.org
Subject: Search differences between solr 1.4.0 and 3.6.1

Hi,



While updating our SOLR to 3.6.1 I noticed some results differences when using search strings
with letters+number.

For a text field defined as:

<analyzer type="index">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="1" catenateWords="1" generateNumberParts="0"
generateWordParts="1" stemEnglishPossessive="0"/>

</analyzer>

<analyzer type="query">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory" ignoreCase="true"
expand="true" synonyms="synonyms.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="0" catenateWords="0" generateNumberParts="0"
generateWordParts="1"/>

</analyzer>



Searching for string GAMES12 returns a lot of results on 3.6.1 that are not returned on 1.4.0.



It looks like WordDelimiterFilterFactory  is acting different for 3.6.1, the numeric part
of the keyword is being ignored and the search is performed using only GAMES.



Analisys returns for 1.4.0:

org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, catenateWords=0, generateWordParts=1,
catenateAll=0, catenateNumbers=0}

term position

1

2

term text

GAMES

12

term type

word

word

source start,end

0,5

5,7

payload





AND for 3.6.1



org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, catenateWords=0, luceneMatchVersion=LUCENE_36,
generateWordParts=1, catenateAll=0, catenateNumbers=0}

position

1

term text

GAMES

startOffset

0

endOffset

5

type

word

positionLength

1





Is this something that can be modified/fixed to return the same results?



Thank you.



Regards,

Frederico






Mime
View raw message