lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Combined Dismax and Block Join Scoring on nested documents
Date Mon, 21 Nov 2016 13:00:40 GMT
A blog article about what you learned would be very welcome. These
edge cases are something other people could certainly learn from.
Share the knowledge forward etc.

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 21 November 2016 at 23:57, Mike Allen
<mike.allen@thecommercepartnership.com> wrote:
> Hi Mikhail,
>
> Thanks for your advice, it went a long way towards helping me get the right documents
in the first place, especially paramterising the block join with an explicit v, as otherwise
it was a nightmare of parser errors.  Not to mention I'm still figuring out the nuances of
where I need a whitespace and where I don't! However, I spent a part of the weekend fiddling
around with spaces and +'s and I believe I've got it working as I'd hoped.
>
> Again, many thanks,
>
> Mike
>
> -----Original Message-----
> From: Mikhail Khludnev [mailto:mkhl@apache.org]
> Sent: 18 November 2016 12:58
> To: solr-user
> Subject: Re: Combined Dismax and Block Join Scoring on nested documents
>
> Hello Mike,
> Structured queries in Solr are way cumbersome.
> Start from:
> q=+{!dismax v="skirt" qf="name"} +{!parent which=content_type:product score=min v=childq}&childq=+in_stock:true^=0
{!func}list_price_gbp&...
>
> beside of "explain" there is a parsed query entry in debug that's more useful for troubleshooting
purposes.
> Please also make sure that + is properly encoded by %2B and pass http hurdle.
>
> On Fri, Nov 18, 2016 at 2:14 PM, Mike Allen < mike.allen@thecommercepartnership.com>
wrote:
>
>> Apologies if I'm doing something incredibly stupid as I'm new to Solr.
>> I am having an issue with scoring child documents in a block join
>> query when including a dismax query. I'm actually a little unclear on
>> whether or not that's a complete oxymoron, combining dismax and block join.
>>
>>
>>
>> Problem statement: Given a set of Product documents - which contain
>> the product names and descriptions - which contain nested variant
>> documents (see below for abridged example) - which contain the boolean
>> stock status
>> (in_stock) and the variant prices (list_price_gbp) - I want to do a
>> Dismax query of, say, "skirt" on the product name (name) and sort the
>> resulting product documents by the minimum price (list_price_gbp) of
>> their child variant documents. Note that, although the abridged
>> document doesn't show them, there are a number of other arbitrary
>> fields which may be used as filter queries on the child documents, for
>> example size or colour, which will in effect change the "active"
>> minimum price of a product. Hence, denormalizing, or flattening, the
>> documents is not really an option I want to pursue.
>>
>>
>>
>> An abridged example document returned by the Solr Admin Query console
>> which I am querying:
>>
>>
>>
>> <doc>
>>
>>     <str name="id">12345</str>
>>
>>                 <str name="content_type">product</str>
>>
>>                 <str name="name">black flared skirt</str>
>>
>>                 <float name="min_list_price_gbp">40.0</float>
>>
>>                 <result name="doc" numFound="2" start="0">
>>
>>       <doc>
>>
>>                     <str name="skuid">12345abcd</str>
>>
>>                                 <str name="productid">12345</str>
>>
>>         <str name="content_type">variant</str>
>>
>>                                 <float
>> name="list_price_gbp">65.0</float>
>>
>>                                 <bool name="in_stock">true</bool>
>>
>>                   </doc>
>>
>>                   <doc>
>>
>>                     <str name="skuid">12345fghi</str>
>>
>>                                 <str name="productid">12345</str>
>>
>>         <str name="content_type">variant</str>
>>
>>                                 <float
>> name="list_price_gbp">40.0</float>
>>
>>                                 <bool name="in_stock">true</bool>
>>
>>                   </doc>
>>
>> </doc>
>>
>>
>>
>> So I am familiar with the block join score mode; setting aside the
>> dismax aspect for now, this query, using the Function Query
>> {!func}list_price_gbp, with score ascending, returns documents ordered
>> correctly, with a £2.00
>> (cheapest) product first:
>>
>>
>>
>> q={!parent which=content_type:product
>> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
>> f="productid"
>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>> true))&start=0&row
>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>
>>
>>
>> The "explain" for this is:
>>
>>
>>
>> 2.0000184 = Score based on 1 child docs in range from 26752 to 26752,
>> best
>> match:
>>
>>   2.0000184 = sum of:
>>
>>     1.8374416E-5 = weight(in_stock:T in 26752) [], result of:
>>
>>       1.8374416E-5 = score(doc=26752,freq=1.0 = termFreq=1.0
>>
>> ), product of:
>>
>>         1.8374416E-5 = idf(docFreq=27211, docCount=27211)
>>
>>         1.0 = tfNorm, computed from:
>>
>>           1.0 = termFreq=1.0
>>
>>           1.2 = parameter k1
>>
>>           0.0 = parameter b (norms omitted for field)
>>
>>     2.0 = FunctionQuery(float(list_price_gbp)), product of:
>>
>>       2.0 = float(list_price_gbp)=2.0
>>
>>       1.0 = boost
>>
>>       1.0 = queryNorm
>>
>>
>>
>> Even though this is doing what I want, I have a slight niggle the that
>> overall score is not just the result of the Function Query, however,
>> as all results get the same tiny fraction added, it doesn't matter.
>>
>>
>>
>> However, when I prepend my dismax query:
>>
>>
>>
>> q={!dismax v="skirt" qf="name"}+{!parent which=content_type:product
>> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
>> f="productid"
>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>> true))&start=0&row
>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>
>>
>>
>> The scoring is only dependent on the dismax scoring, where the "explain"
>> for
>> this is:
>>
>>
>>
>> 2.7600822 = sum of:
>>
>>   2.7600822 = weight(name:skirt in 13406) [], result of:
>>
>>     2.7600822 = score(doc=13406,freq=1.0 = termFreq=1.0
>>
>> ), product of:
>>
>>       3.5851278 = idf(docFreq=103, docCount=3731)
>>
>>       0.76987 = tfNorm, computed from:
>>
>>         1.0 = termFreq=1.0
>>
>>         1.2 = parameter k1
>>
>>         0.75 = parameter b
>>
>>         4.108818 = avgFieldLength
>>
>>         7.111111 = fieldLength
>>
>>
>>
>> So in actual fact, with score ascending, it is ordering the results by
>> least matching first and the nested document list_price_gbp is
>> irrelevant. I strongly suspect I am being totally dumb and that this
>> is expected behaviour for an obvious reason that escapes me, apart
>> from perhaps it's because the two scoring methods are just plainly
>> incompatible.
>>
>>
>>
>> I have additionally tried just doing a lucene query instead:
>>
>>
>>
>> q=+name:skirt +{!parent which=content_type:product score=min}
>> (in_stock:(true)){!func}list_price_gbp&doc.q={!terms f="productid"
>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>> true))&start=0&row
>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>
>>
>>
>> The "explain" of this indicates it's scoring products, for which
>> list_price_gbp simply does not exist, as the Function Query always
>> returns zero.
>>
>>
>>
>> 6243963 = sum of:
>>
>>   3.624396 = weight(name:skirt in 18113) [], result of:
>>
>>     3.624396 = score(doc=18113,freq=1.0 = termFreq=1.0
>>
>> ), product of:
>>
>>       3.5851278 = idf(docFreq=103, docCount=3731)
>>
>>       1.0109531 = tfNorm, computed from:
>>
>>         1.0 = termFreq=1.0
>>
>>         1.2 = parameter k1
>>
>>         0.75 = parameter b
>>
>>         4.108818 = avgFieldLength
>>
>>         4.0 = fieldLength
>>
>>   1.0 =
>> {!cache=false}ConstantScore(BitDocIdSetFilterWrapper(
>> QueryBitSetProducer(con
>> tent_type:product))), product of:
>>
>>     1.0 = boost
>>
>>     1.0 = queryNorm
>>
>>   0.0 = FunctionQuery(float(list_price_gbp)), product of:
>>
>>     0.0 = float(list_price_gbp)=0.0
>>
>>     1.0 = boost
>>
>>     1.0 = queryNorm
>>
>>
>>
>> Indeed, if I change the Function Query field to a product scoped
>> field, min_list_price_gbp, like so:
>>
>>
>>
>> q=+name:skirt +{!parent which=content_type:product
>> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
>> f="productid"
>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>> true))&start=0&row
>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>
>>
>>
>> then the "explain" certainly does show the Function Query evaluating
>>
>>
>>
>> 8.624397 = sum of:
>>
>>   3.624396 = weight(name:skirt in 17890) [], result of:
>>
>>     3.624396 = score(doc=17890,freq=1.0 = termFreq=1.0
>>
>> ), product of:
>>
>>       3.5851278 = idf(docFreq=103, docCount=3731)
>>
>>       1.0109531 = tfNorm, computed from:
>>
>>         1.0 = termFreq=1.0
>>
>>         1.2 = parameter k1
>>
>>         0.75 = parameter b
>>
>>         4.108818 = avgFieldLength
>>
>>         4.0 = fieldLength
>>
>>   1.0 =
>> {!cache=false}ConstantScore(BitDocIdSetFilterWrapper(
>> QueryBitSetProducer(con
>> tent_type:product))), product of:
>>
>>     1.0 = boost
>>
>>     1.0 = queryNorm
>>
>>   14.0 = FunctionQuery(float(min_list_price_gbp)), product of:
>>
>>     14.0 = float(min_list_price_gbp)=14.0
>>
>>     1.0 = boost
>>
>>     1.0 = queryNorm
>>
>>
>>
>> My grasp of the syntax is pretty flakey, so I would be immensely
>> grateful if someone could point out if I'm just doing something
>> incredibly dumb. In my head, I see what I am trying to do as
>>
>>
>>
>> (some dismax or lucene query on parent document [e.g."skirt"])
>>
>>                 => (get a subset of these parent docs based on a block
>> join)
>>
>>                                 => (where the children match a bunch
>> of arbitrary filter queries [e.g. "colour:red"])
>>
>>                                                 => (then subquery the
>> child docs that match the same filter queries[e.g. "colour:red"])
>>
>>                                                                 =>
>> (then score this subset of child documents)
>>
>>
>> => (and order by that score)
>>
>>
>>
>>
>> Is this actually possible? I've been googling about this for a day or
>> so and can't quite find anything definitive. I'm going to maybe try
>> and dive into the solr source code, but I'm a c# guy, not java,
>> without a debuggable environment as unneeded yet, and that could prove
>> pretty painful.
>>
>>
>>
>> Any help would be appreciated, even if it is just "can't be done", as
>> at least I could stop chasing my tail.
>>
>>
>>
>> Mike
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Mime
View raw message