lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: Find part of long query in shorter fields
Date Sat, 16 Jul 2016 17:41:23 GMT
Hi Chantal,

Please see https://issues.apache.org/jira/browse/LUCENE-7148


ahmet



On Saturday, July 16, 2016 3:48 PM, CA <ca@it-agenten.com> wrote:
Hello all,

our index contains product offers from online shops. The fields we are indexing have all rather
short values: the name of the product, the brand, the price, category and some fields containing
identifiers like ASIN, GTIN etc. if available. We do not index the description texts.

The regular user search uses the „edismax“ and queries the above mentioned fields which
works fine for short inputs like „iphone 6s“.

Now, we have to support a different kind of query which won’t be user input but using complete
product names like those we store ourselves but not necessarily names that are actually part
of our data set. This means that the input query can be relatively long. The output of the
query is planned to consist of a More Like This list. So, in effect the query should have
at least one hit that is hopefully close enough, and the actual result will be a More Like
This list sourced by that one hit.

I have tried to get this to work based on the „edismax“ setup for the regular user search
but this does not work well when the input is longer than what we have stored as similar product.
Here is an example:


## Step 1: Input (not stored in our index):
"Braun Series 9 9095CC Men's Electric Shaver Wet/Dry with Clean and Renew Charger“ (input
to edismax without quotes)

(a) This input does not produce any results with our current edismax config (details at the
end of the e-mail).
(b) When I relax the „mm“ parameter to "2<-1 5<-30% 8<10%“, I get one hit with
the following name:
=> "Braun Series Clean&Renew CCR2 Cleansing Dock Cartridges Lemonfresh Formula Cartrige
(Compatible with Series 7,5,3) 2 pc“


## Step 2: When I reduce the input manually to the following:
"Braun Series 9 9095CC Men's Electric Shaver“

The above shortened input returns a very good hit with the name:
=> "Braun 9095cc Series 9 Electric Shaver"


My Question:

Is it possible, and if so - how, to have the query input:
"Braun Series 9 9095CC Men's Electric Shaver Wet/Dry with Clean and Renew Charger“ (input
to edismax without quotes)
return (also or only) the hit with the name:
=> "Braun 9095cc Series 9 Electric Shaver"
and maybe even give it a high score.

I have tried to use „explainOther“ (output see at the end of this e-mail) but I have a
really hard time reading it. In some cases, I’m not even able to understand where one clause
ends and the next one starts (is it possible to have it returned in several lines?). Maybe
someone can give me a hint on how to use that output or knows of some documentation on the
i-net that explains how to make good use of it?


Looking at the input string, I was wondering:

(A) Is relaxing the „mm“ parameter really the way to go?
(B) Should I create another name field in schema.xml that basically has a different query
chain, discarding the last words of a query input if too long. Or maybe it’s possible to
make tokens in the first part of the input more „important“ (though I’m not sure this
is generally the case)? Should I remove some of the filters from the query chain (like the
ShingleFilter)?
(C) Can I configure something else or should I not use edismax for this?


Thank you for reading this,
any insight is highly appreciated!

Chantal


***

Following are the field configuration for the name field, the configuration of the edismax
handler, and the output of „explainOther“ for the above example.



SCHEMA.XML — „name" field:

<field name="name" type="name_split" indexed="true" stored="true" required="true" multiValued="false“/>

<fieldType name="name_split" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
    <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
                generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1"
                splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LengthFilterFactory" min="2" max="255"/>
        <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>



SOLRCONFIG.XML — MLT/EDISMAX

<requestHandler name="/mlt" class="solr.SearchHandler">
     <lst name="defaults">
         <str name="echoParams">all</str>
         <str name="defType">edismax</str>

         <str name="q.alt">*:*</str>
         <str name="fl">id,brand,name,price,score,popularity</str>
         <str name="tie">0.1</str>
         <str name="qf">brand_split^6 name</str>
         <str name="pf">brand_split^10 name^10</str>
         <str name="mm">2&lt;-1 5&lt;-30% 8&lt;10%</str>
         <int name="qs">10</int>
         <int name="ps">20</int>

         <str name="wt">xml</str>

         <str name="mlt">false</str>
         <str name="mlt.qf">brand_split^6 name price</str>
         <str name="mlt.fl">brand_split name price</str>
         <str name="mlt.interestingTerms">details</str>
     </lst>
</requestHandler>



DEBUG — EXPLAIN OTHER

The „other“ document with id:2d617cee76f5ed8598cf7db1b44a40de6f3c8c9b has the title "Braun
9095cc Series 9 Electric Shaver"

<response>
    <lst name="responseHeader">
        <lst name="params“><!-- shortened for better overview -->
            <str name="defType">edismax</str>
            <str name="qf">brand_split^6 name</str>
            <str name="pf">brand_split^10 name^10</str>
            <str name="mm">2<-1 5<-30% 8<10%</str>
            <str name="qs">10</str>
            <str name="ps">20</str>
            <str name="tie">0.1</str>
            <str name="q">
                Braun Series 9 9095CC Men's Electric Shaver Wet/Dry with Clean and Renew Charger
            </str>
            <str name="explainOther">id:2d617cee76f5ed8598cf7db1b44a40de6f3c8c9b</str>
        </lst>
    </lst>
    <result name="response" numFound="1" start="0" maxScore="97.122955">
        <doc>
            <str name="name">
                Braun Series Clean&Renew CCR2 Cleansing Dock Cartridges Lemonfresh Formula
Cartrige (Compatible with
                Series 7,5,3) 2 pc
            </str>
            <str name="id">773d4bdb341c4dc438c481ac80de5abde08d85bf</str>
            <str name="brand">Braun</str>
            <float name="score">97.122955</float>
        </doc>
    </result>
    <lst name="debug">
        <str name="rawquerystring">
            Braun Series 9 9095CC Men's Electric Shaver Wet/Dry with Clean and Renew Charger
        </str>
        <str name="querystring">
            Braun Series 9 9095CC Men's Electric Shaver Wet/Dry with Clean and Renew Charger
        </str>
        <str name="parsedquery">
            (+(DisjunctionMaxQuery((name:braun | (brand_split:braun)^6.0)~0.1) DisjunctionMaxQuery((name:series
|
            (brand_split:series)^6.0)~0.1) DisjunctionMaxQuery((name:"(9095cc 9095) cc"~10
| (brand_split:"(9095cc 9095)
            cc"~10)^6.0)~0.1) DisjunctionMaxQuery((Synonym(name:men name:men's) | (Synonym(brand_split:men
            brand_split:men's))^6.0)~0.1) DisjunctionMaxQuery((name:electric | (brand_split:electric)^6.0)~0.1)
            DisjunctionMaxQuery((name:shaver | (brand_split:shaver)^6.0)~0.1) DisjunctionMaxQuery((name:"(wet/dry
wet
            wetdry) dry"~10 | (brand_split:"(wet/dry wet wetdry) dry"~10)^6.0)~0.1) DisjunctionMaxQuery((name:with
|
            (brand_split:with)^6.0)~0.1) +DisjunctionMaxQuery((name:clean | (brand_split:clean)^6.0)~0.1)
            +DisjunctionMaxQuery((name:renew | (brand_split:renew)^6.0)~0.1) DisjunctionMaxQuery((name:charger
|
            (brand_split:charger)^6.0)~0.1)) DisjunctionMaxQuery(((brand_split:"(braun braun
series braunseries) series
            (series series 9 series9) ? (9 9095cc 99095 99095cc) 9095 cc (9095cc 9095) (cc
9095cc men's 9095 9095ccmen)
            (cc ccmen) men (men's men men's electric menelectric) electric (electric electric
shaver electricshaver)
            shaver (shaver shaver wet/dry shaverwetdry) wet dry (wet/dry wet wetdry) (dry
wet/dry with wet wetdrywith)
            dry with (with with clean withclean) clean (clean clean and cleanand) and (and
and renew andrenew) renew
            (renew renew charger renewcharger) charger charger"~20)^10.0 | (name:"(braun braun
series braunseries)
            series (series series 9 series9) ? (9 9095cc 99095 99095cc) 9095 cc (9095cc 9095)
(cc 9095cc men's 9095
            9095ccmen) (cc ccmen) men (men's men men's electric menelectric) electric (electric
electric shaver
            electricshaver) shaver (shaver shaver wet/dry shaverwetdry) wet dry (wet/dry wet
wetdry) (dry wet/dry with
            wet wetdrywith) dry with (with with clean withclean) clean (clean clean and cleanand)
and (and and renew
            andrenew) renew (renew renew charger renewcharger) charger charger"~20)^10.0)~0.1))/no_coord
        </str>
        <str name="parsedquery_toString">
            +((name:braun | (brand_split:braun)^6.0)~0.1 (name:series | (brand_split:series)^6.0)~0.1
(name:"(9095cc
            9095) cc"~10 | (brand_split:"(9095cc 9095) cc"~10)^6.0)~0.1 (Synonym(name:men
name:men's) |
            (Synonym(brand_split:men brand_split:men's))^6.0)~0.1 (name:electric | (brand_split:electric)^6.0)~0.1
            (name:shaver | (brand_split:shaver)^6.0)~0.1 (name:"(wet/dry wet wetdry) dry"~10
| (brand_split:"(wet/dry
            wet wetdry) dry"~10)^6.0)~0.1 (name:with | (brand_split:with)^6.0)~0.1 +(name:clean
|
            (brand_split:clean)^6.0)~0.1 +(name:renew | (brand_split:renew)^6.0)~0.1 (name:charger
|
            (brand_split:charger)^6.0)~0.1) ((brand_split:"(braun braun series braunseries)
series (series series 9
            series9) ? (9 9095cc 99095 99095cc) 9095 cc (9095cc 9095) (cc 9095cc men's 9095
9095ccmen) (cc ccmen) men
            (men's men men's electric menelectric) electric (electric electric shaver electricshaver)
shaver (shaver
            shaver wet/dry shaverwetdry) wet dry (wet/dry wet wetdry) (dry wet/dry with wet
wetdrywith) dry with (with
            with clean withclean) clean (clean clean and cleanand) and (and and renew andrenew)
renew (renew renew
            charger renewcharger) charger charger"~20)^10.0 | (name:"(braun braun series braunseries)
series (series
            series 9 series9) ? (9 9095cc 99095 99095cc) 9095 cc (9095cc 9095) (cc 9095cc
men's 9095 9095ccmen) (cc
            ccmen) men (men's men men's electric menelectric) electric (electric electric
shaver electricshaver) shaver
            (shaver shaver wet/dry shaverwetdry) wet dry (wet/dry wet wetdry) (dry wet/dry
with wet wetdrywith) dry with
            (with with clean withclean) clean (clean clean and cleanand) and (and and renew
andrenew) renew (renew renew
            charger renewcharger) charger charger"~20)^10.0)~0.1
        </str>
        <lst name="explain">
            <str name="773d4bdb341c4dc438c481ac80de5abde08d85bf">
                97.122955 = sum of: 97.122955 = sum of: 61.102264 = max plus 0.1 times others
of: 6.80276 =
                weight(name:braun in 477314) [], result of: 6.80276 = score(doc=477314,freq=1.0
= termFreq=1.0 ),
                product of: 8.171213 = idf(docFreq=324, docCount=1147961) 0.8325276 = tfNorm,
computed from: 1.0 =
                termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 27.458092 = avgFieldLength
40.96 = fieldLength
                60.42199 = weight(brand_split:braun in 477314) [], result of: 60.42199 = score(doc=477314,freq=1.0
=
                termFreq=1.0 ), product of: 6.0 = boost 8.11682 = idf(docFreq=305, docCount=1023531)
1.2406745 = tfNorm,
                computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 1.9018271
= avgFieldLength 1.0 =
                fieldLength 8.663414 = max plus 0.1 times others of: 8.663414 = weight(name:series
in 477314) [], result
                of: 8.663414 = score(doc=477314,freq=4.0 = termFreq=4.0 ), product of: 5.5549765
= idf(docFreq=4440,
                docCount=1147961) 1.5595771 = tfNorm, computed from: 4.0 = termFreq=4.0 1.2
= parameter k1 0.75 =
                parameter b 27.458092 = avgFieldLength 40.96 = fieldLength 4.0527744 = max
plus 0.1 times others of:
                4.0527744 = weight(name:with in 477314) [], result of: 4.0527744 = score(doc=477314,freq=2.0
=
                termFreq=2.0 ), product of: 3.355103 = idf(docFreq=40070, docCount=1147961)
1.2079433 = tfNorm, computed
                from: 2.0 = termFreq=2.0 1.2 = parameter k1 0.75 = parameter b 27.458092 =
avgFieldLength 40.96 =
                fieldLength 8.542337 = max plus 0.1 times others of: 8.542337 = weight(name:clean
in 477314) [], result
                of: 8.542337 = score(doc=477314,freq=3.0 = termFreq=3.0 ), product of: 6.008829
= idf(docFreq=2820,
                docCount=1147961) 1.421631 = tfNorm, computed from: 3.0 = termFreq=3.0 1.2
= parameter k1 0.75 =
                parameter b 27.458092 = avgFieldLength 40.96 = fieldLength 14.762168 = max
plus 0.1 times others of:
                14.762168 = weight(name:renew in 477314) [], result of: 14.762168 = score(doc=477314,freq=3.0
=
                termFreq=3.0 ), product of: 10.383966 = idf(docFreq=35, docCount=1147961)
1.421631 = tfNorm, computed
                from: 3.0 = termFreq=3.0 1.2 = parameter k1 0.75 = parameter b 27.458092 =
avgFieldLength 40.96 =
                fieldLength
            </str>
        </lst>
        <str name="otherQuery">id:2d617cee76f5ed8598cf7db1b44a40de6f3c8c9b</str>
        <lst name="explainOther">
            <str name="2d617cee76f5ed8598cf7db1b44a40de6f3c8c9b">
                0.0 = Failure to meet condition(s) of required/prohibited clause(s) 0.0 =
no match on required clause
                ((name:braun | (brand_split:braun)^6.0)~0.1 (name:series | (brand_split:series)^6.0)~0.1
(name:"(9095cc
                9095) cc"~10 | (brand_split:"(9095cc 9095) cc"~10)^6.0)~0.1 (Synonym(name:men
name:men's) |
                (Synonym(brand_split:men brand_split:men's))^6.0)~0.1 (name:electric | (brand_split:electric)^6.0)~0.1
                (name:shaver | (brand_split:shaver)^6.0)~0.1 (name:"(wet/dry wet wetdry) dry"~10
|
                (brand_split:"(wet/dry wet wetdry) dry"~10)^6.0)~0.1 (name:with | (brand_split:with)^6.0)~0.1
                +(name:clean | (brand_split:clean)^6.0)~0.1 +(name:renew | (brand_split:renew)^6.0)~0.1
(name:charger |
                (brand_split:charger)^6.0)~0.1) 0.0 = Failure to meet condition(s) of required/prohibited
clause(s)
                61.40732 = max plus 0.1 times others of: 9.853278 = weight(name:braun in 113560)
[], result of: 9.853278
                = score(doc=113560,freq=1.0 = termFreq=1.0 ), product of: 8.171213 = idf(docFreq=324,
docCount=1147961)
                1.2058525 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75
= parameter b 27.458092 =
                avgFieldLength 16.0 = fieldLength 60.42199 = weight(brand_split:braun in 113560)
[], result of: 60.42199
                = score(doc=113560,freq=1.0 = termFreq=1.0 ), product of: 6.0 = boost 8.11682
= idf(docFreq=305,
                docCount=1023531) 1.2406745 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2
= parameter k1 0.75 =
                parameter b 1.9018271 = avgFieldLength 1.0 = fieldLength 8.6537285 = max plus
0.1 times others of:
                8.6537285 = weight(name:series in 113560) [], result of: 8.6537285 = score(doc=113560,freq=2.0
=
                termFreq=2.0 ), product of: 5.5549765 = idf(docFreq=4440, docCount=1147961)
1.5578334 = tfNorm, computed
                from: 2.0 = termFreq=2.0 1.2 = parameter k1 0.75 = parameter b 27.458092 =
avgFieldLength 16.0 =
                fieldLength 52.67099 = max plus 0.1 times others of: 52.67099 = weight(name:"(9095cc
9095) cc"~10 in
                113560) [], result of: 52.67099 = score(doc=113560,freq=3.0 = phraseFreq=3.0
), product of: 30.520727 =
                idf(), sum of: 13.037208 = idf(docFreq=2, docCount=1147961) 10.796498 = idf(docFreq=23,
                docCount=1147961) 6.687021 = idf(docFreq=1431, docCount=1147961) 1.725745
= tfNorm, computed from: 3.0 =
                phraseFreq=3.0 1.2 = parameter k1 0.75 = parameter b 27.458092 = avgFieldLength
16.0 = fieldLength
                8.592838 = max plus 0.1 times others of: 8.592838 = weight(name:electric in
113560) [], result of:
                8.592838 = score(doc=113560,freq=2.0 = termFreq=2.0 ), product of: 5.51589
= idf(docFreq=4617,
                docCount=1147961) 1.5578334 = tfNorm, computed from: 2.0 = termFreq=2.0 1.2
= parameter k1 0.75 =
                parameter b 27.458092 = avgFieldLength 16.0 = fieldLength 13.669254 = max
plus 0.1 times others of:
                13.669254 = weight(name:shaver in 113560) [], result of: 13.669254 = score(doc=113560,freq=2.0
=
                termFreq=2.0 ), product of: 8.7745285 = idf(docFreq=177, docCount=1147961)
1.5578334 = tfNorm, computed
                from: 2.0 = termFreq=2.0 1.2 = parameter k1 0.75 = parameter b 27.458092 =
avgFieldLength 16.0 =
                fieldLength 0.0 = no match on required clause ((name:clean | (brand_split:clean)^6.0)~0.1)
0.0 = No
                matching clause 0.0 = no match on required clause ((name:renew | (brand_split:renew)^6.0)~0.1)
0.0 = No
                matching clause
            </str>
        </lst>
    </lst>
</response>

Mime
View raw message