lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elizabeth Haubert <ehaub...@opensourceconnections.com>
Subject PF, PF2, PF3 clauses missing in solr7 with query-time synonyms?
Date Wed, 18 Apr 2018 16:38:53 GMT
I'm seeing pf and pf3 clauses fail to generate in long queries containing
synonyms.  Wondering if anyone else has run into this, or if it needs to be
submitted as a bug in Jira.   It is a showstopper problem for the current
project, as the pf and pf3 were pretty heavily tuned.

Using Solr 7.1; all fields are using the following type:

With query-time synonyms:
<fieldType name="my_text_general" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(?i)\b(anti|hypo|hyper|non)[-\\/ ](\w+)\b" replacement="$1$2"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="0" stemEnglishPossessive="1"
 protected="protwords_wdff.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords_nostem.txt"/>
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(?i)\b(anti|hypo|hyper|non)[-\\/ ](\w+)\b" replacement="$1$2"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="0" stemEnglishPossessive="1"
 protected="protwords_wdff.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
    <filter class="solr.SynonymGraphFilterFactory"  managed="synonyms_all"
/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords_nostem.txt"/>
<filter class="solr.KStemFilterFactory"/>
</analyzer>
<similarity class="solr.ClassicSimilarityFactory" />
</fieldType>

Without query-time synonyms:
<fieldType name="my_text_general" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(?i)\b(anti|hypo|hyper|non)[-\\/ ](\w+)\b" replacement="$1$2"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="0" stemEnglishPossessive="1"
 protected="protwords_wdff.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
    <filter class="solr.SynonymGraphFilterFactory"  managed="synonyms_all"
/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords_nostem.txt"/>
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(?i)\b(anti|hypo|hyper|non)[-\\/ ](\w+)\b" replacement="$1$2"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="0" stemEnglishPossessive="1"
 protected="protwords_wdff.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords_nostem.txt"/>
<filter class="solr.KStemFilterFactory"/>
</analyzer>
<similarity class="solr.ClassicSimilarityFactory" />
</fieldType>

Synonyms file is pretty long, so I'll just include the relevent bits for an
example:

allergic, hypersensitive
aspirin, acetylsalicylic acid
dog, canine, canis familiris, k 9
rat, rattus


The problem seems to occur when part of the query has a synonym, but the
whole phrase is not.  Whitespace added to piece out what is going on;
believe any parentheses errors are due to my tinkering around.  Beyond that
though, this is as from Solr.  Slop has been tinkered with to identify
PF/PF2/PF3 clauses where PF fields have a slop ending in 0, pf2 ending in
1, pf3 ending in 2 eg ~10, ~11, ~12, etc.

=============
Example 1:  "aspirin dose in rats"
==============

With query-time synonyms:
===============
/// Q terms generate as expected ///
+((((kw1:\"acetylsalicylic acid\" kw1:aspirin)^100.0 |
(species:\"acetylsalicylic acid\" species:aspirin) |
(keywords_bm25_no_norms:\"acetylsalicylic acid\"
keywords_bm25_no_norms:aspirin)^50.0 | (description:\"acetylsalicylic
acid\" description:aspirin) | (kw1ranked:\"acetylsalicylic acid\"
kw1ranked:aspirin)^100.0 | (text:\"acetylsalicylic acid\" text:aspirin) |
(title:\"acetylsalicylic acid\" title:aspirin)^100.0 |
(keywordsranked_bm25_no_norms:\"acetylsalicylic acid\"
keywordsranked_bm25_no_norms:aspirin)^50.0 | (authors:\"acetylsalicylic
acid\" authors:aspirin))~0.4 ((Synonym(kw1:dosage kw1:dose kw1:dose
kw1:dose))^100.0 | Synonym(species:dosage species:dose species:dose
species:dose) | (Synonym(keywords_bm25_no_norms:dosage
keywords_bm25_no_norms:dose keywords_bm25_no_norms:dose
keywords_bm25_no_norms:dose))^50.0 | Synonym(description:dosage
description:dose description:dose description:dose) |
(Synonym(kw1ranked:dosage kw1ranked:dose kw1ranked:dose
kw1ranked:dose))^100.0 | Synonym(text:dosage text:dose text:dose text:dose)
| (Synonym(title:dosage title:dose title:dose title:dose))^100.0 |
(Synonym(keywordsranked_bm25_no_norms:dosage
keywordsranked_bm25_no_norms:dose keywordsranked_bm25_no_norms:dose
keywordsranked_bm25_no_norms:dose))^50.0 | Synonym(authors:dosage
authors:dose authors:dose authors:dose))~0.4 ((Synonym(kw1:rat
kw1:rattu))^100.0 | Synonym(species:rat species:rattu) |
(Synonym(keywords_bm25_no_norms:rat keywords_bm25_no_norms:rattu))^50.0 |
Synonym(description:rat description:rattu) | (Synonym(kw1ranked:rat
kw1ranked:rattu))^100.0 | Synonym(text:rat text:rattu) | (Synonym(title:rat
title:rattu))^100.0 | (Synonym(keywordsranked_bm25_no_norms:rat
keywordsranked_bm25_no_norms:rattu))^50.0 | Synonym(authors:rat
authors:rattu))~0.4)~3)

/// PF and PF2 are missing. ///
 () () () () ()

/// This is actually PF3 with a missing ? where the stopword 'in' belonged.
///
 ((title:\"(dosage dose dose dose) (rattu rat)\"~22)^1000.0 |
(keywordsranked_bm25_no_norms:\"(dosage dose dose dose) (rattu
rat)\"~22)^1000.0 | (text:\"(dosage dose dose dose) (rattu
rat)\"~22)^100.0)~0.4 ((keywords_bm25_no_norms:\"(dosage dose dose dose)
(rattu rat)\"~12)^500.0 | (kw1ranked:\"(dosage dose dose dose) (rattu
rat)\"~12)^100.0 | (kw1:\"(dosage dose dose dose) (rattu
rat)\"~12)^100.0)~0.4,product(max(10.0/(3.16E-11*float(ms(const(1555545600000),date(dateint)))+6.0),int(documentdatefix)),scale(map(int(rank),-1.0,-1.0,const(0.5),null),0.5,2.0)))",

With index-time synonyms:
===============

/// Q ///
 "boost(+((((kw1:aspirin)^100.0 | species:aspirin |
(keywords_bm25_no_norms:aspirin)^50.0 | description:aspirin |
(kw1ranked:aspirin)^100.0 | text:aspirin | (title:aspirin)^100.0 |
(keywordsranked_bm25_no_norms:aspirin)^50.0 | authors:aspirin)~0.4
((kw1:dose)^100.0 | species:dose | (keywords_bm25_no_norms:dose)^50.0 |
description:dose | (kw1ranked:dose)^100.0 | text:dose | (title:dose)^100.0
| (keywordsranked_bm25_no_norms:dose)^50.0 | authors:dose)~0.4
((kw1:rats)^100.0 | species:rats | (keywords_bm25_no_norms:rats)^50.0 |
description:rats | (kw1ranked:rats)^100.0 | text:rats | (title:rats)^100.0
| (keywordsranked_bm25_no_norms:rats)^50.0 | authors:rats)~0.4)~3)
/// PF  ///
  ((title:\"aspirin dose ? rats\"~20)^5000.0 |
(keywordsranked_bm25_no_norms:\"aspirin dose ? rats\"~20)^5000.0 |
(keywords_bm25_no_norms:\"aspirin dose ? rats\"~20)^1500.0 |
(text:\"aspirin dose ? rats\"~20)^1000.0)~0.4 ((kw1ranked:\"aspirin dose ?
rats\"~10)^5000.0 | (kw1:\"aspirin dose ? rats\"~10)^500.0)~0.4
((authors:\"aspirin dose ? rats\")^250.0 | description:\"aspirin dose ?
rats\")~0.4

/// PF2 ///
  ((text:\"aspirin dose ? rats\"~100)^500.0)~0.4 (authors:\"aspirin
dose\"~11 | species:\"aspirin dose\"~11)~0.4

/// PF3 ///
(((title:\"aspirin dose\"~22)^1000.0 |
(keywordsranked_bm25_no_norms:\"aspirin dose\"~22)^1000.0 | (text:\"aspirin
dose\"~22)^100.0)~0.4 ((title:\"dose ? rats\"~22)^1000.0 |
(keywordsranked_bm25_no_norms:\"dose ? rats\"~22)^1000.0 | (text:\"dose ?
rats\"~22)^100.0)~0.4) (((keywords_bm25_no_norms:\"aspirin dose\"~12)^500.0
| (kw1ranked:\"aspirin dose\"~12)^100.0 | (kw1:\"aspirin
dose\"~12)^100.0)~0.4 ((keywords_bm25_no_norms:\"dose ? rats\"~12)^500.0 |
(kw1ranked:\"dose ? rats\"~12)^100.0 | (kw1:\"dose ?
rats\"~12)^100.0)~0.4),product(max(10.0/(3.16E-11*float(ms(const(1555545600000),date(dateint)))+6.0),int(documentdatefix)),scale(map(int(rank),-1.0,-1.0,const(0.5),null),0.5,2.0)))",


===============
Example 2: "allergic reaction dogs"
The underlying issue isn't specifically PF, PF2, PF3. The following example
picks up PF2, but not PF or PF3
===============

With Query-time synonyms:
///  Q ///
parsedquery_toString":"boost(
+((((Synonym(kw1:allergic kw1:allergy kw1:hypersensitive
kw1:hypersensitive))^100.0 | Synonym(species:allergic species:allergy
species:hypersensitive species:hypersensitive) |
(Synonym(keywords_bm25_no_norms:allergic keywords_bm25_no_norms:allergy
keywords_bm25_no_norms:hypersensitive
keywords_bm25_no_norms:hypersensitive))^50.0 | Synonym(description:allergic
description:allergy description:hypersensitive description:hypersensitive)
| (Synonym(kw1ranked:allergic kw1ranked:allergy kw1ranked:hypersensitive
kw1ranked:hypersensitive))^100.0 | Synonym(text:allergic text:allergy
text:hypersensitive text:hypersensitive) | (Synonym(title:allergic
title:allergy title:hypersensitive title:hypersensitive))^100.0 |
(Synonym(keywordsranked_bm25_no_norms:allergic
keywordsranked_bm25_no_norms:allergy
keywordsranked_bm25_no_norms:hypersensitive
keywordsranked_bm25_no_norms:hypersensitive))^50.0 |
Synonym(authors:allergic authors:allergy authors:hypersensitive
authors:hypersensitive))~0.4 ((kw1:reaction)^100.0 | species:reaction |
(keywords_bm25_no_norms:reaction)^50.0 | description:reaction |
(kw1ranked:reaction)^100.0 | text:reaction | (title:reaction)^100.0 |
(keywordsranked_bm25_no_norms:reaction)^50.0 | authors:reaction)~0.4
((kw1:\"cani familiari\" kw1:canine kw1:\"k 9\" kw1:\"cani lupu familiari\"
kw1:dog)^100.0 | (species:\"cani familiari\" species:canine species:\"k 9\"
species:\"cani lupu familiari\" species:dog) |
(keywords_bm25_no_norms:\"cani familiari\" keywords_bm25_no_norms:canine
keywords_bm25_no_norms:\"k 9\" keywords_bm25_no_norms:\"cani lupu
familiari\" keywords_bm25_no_norms:dog)^50.0 | (description:\"cani
familiari\" description:canine description:\"k 9\" description:\"cani lupu
familiari\" description:dog) | (kw1ranked:\"cani familiari\"
kw1ranked:canine kw1ranked:\"k 9\" kw1ranked:\"cani lupu familiari\"
kw1ranked:dog)^100.0 | (text:\"cani familiari\" text:canine text:\"k 9\"
text:\"cani lupu familiari\" text:dog) | (title:\"cani familiari\"
title:canine title:\"k 9\" title:\"cani lupu familiari\" title:dog)^100.0 |
(keywordsranked_bm25_no_norms:\"cani familiari\"
keywordsranked_bm25_no_norms:canine keywordsranked_bm25_no_norms:\"k 9\"
keywordsranked_bm25_no_norms:\"cani lupu familiari\"
keywordsranked_bm25_no_norms:dog)^50.0 | (authors:\"cani familiari\"
authors:canine authors:\"k 9\" authors:\"cani lupu familiari\"
authors:dog))~0.4)~3)

/// PF ///
() () () ()

/// PF2 ////
(authors:\"(hypersensitive allergy hypersensitive allergic) reaction\"~11 |
species:\"(hypersensitive allergy hypersensitive allergic)
reaction\"~11)~0.4

/// PF3 ///
() (),
product(max(10.0/(3.16E-11*float(ms(const(1555545600000),date(dateint)))+6.0),int(documentdatefix)),scale(map(int(rank),-1.0,-1.0,const(0.5),null),0.5,2.0)))",

With index-timy synonyms:
/// Q ///
+((((kw1:allergic)^100.0 | species:allergic |
(keywords_bm25_no_norms:allergic)^50.0 | description:allergic |
(kw1ranked:allergic)^100.0 | text:allergic | (title:allergic)^100.0 |
(keywordsranked_bm25_no_norms:allergic)^50.0 | authors:allergic)~0.4
((kw1:reaction)^100.0 | species:reaction |
(keywords_bm25_no_norms:reaction)^50.0 | description:reaction |
(kw1ranked:reaction)^100.0 | text:reaction | (title:reaction)^100.0 |
(keywordsranked_bm25_no_norms:reaction)^50.0 | authors:reaction)~0.4
((kw1:dog)^100.0 | species:dog | (keywords_bm25_no_norms:dog)^50.0 |
description:dog | (kw1ranked:dog)^100.0 | text:dog | (title:dog)^100.0 |
(keywordsranked_bm25_no_norms:dog)^50.0 | authors:dog)~0.4)~3)

/// PF ///
((title:\"allergic reaction dog\"~20)^5000.0 |
(keywordsranked_bm25_no_norms:\"allergic reaction dog\"~20)^5000.0 |
(keywords_bm25_no_norms:\"allergic reaction dog\"~20)^1500.0 |
(text:\"allergic reaction dog\"~20)^1000.0)~0.4 ((kw1ranked:\"allergic
reaction dog\"~10)^5000.0 | (kw1:\"allergic reaction dog\"~10)^500.0)~0.4
((authors:\"allergic reaction dog\")^250.0 | description:\"allergic
reaction dog\")~0.4 ((text:\"allergic reaction dog\"~100)^500.0)~0.4

/// PF2 ///
((authors:\"allergic reaction\"~11 | species:\"allergic reaction\"~11)~0.4

/// PF3 ///
(authors:\"reaction dog\"~11 | species:\"reaction dog\"~11)~0.4)
((title:\"allergic reaction dog\"~22)^1000.0 |
(keywordsranked_bm25_no_norms:\"allergic reaction dog\"~22)^1000.0 |
(text:\"allergic reaction dog\"~22)^100.0)~0.4
((keywords_bm25_no_norms:\"allergic reaction dog\"~12)^500.0 |
(kw1ranked:\"allergic reaction dog\"~12)^100.0 | (kw1:\"allergic reaction
dog\"~12)^100.0)~0.4,product(max(10.0/(3.16E-11*float(ms(const(1555545600000),date(dateint)))+6.0),int(documentdatefix)),scale(map(int(rank),-1.0,-1.0,const(0.5),null),0.5,2.0)))",


Working on getting this rigged up in the debugger, but would appreciate any
feedback.

Thank you,
Elizabeth

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message