Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of flahti@thoughtworks.com
 designates 64.18.0.28 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CADS_tivuN_7bGG=q_5t0FLN=eY0=W0Xs3cPRq3xhzJtNKX2rUQ@mail.gmail.com>
References: 
 <CADS_tivQJfeuqO0tRoWWBgwjJPZ9OYoCRroRp=LaUxuYzwoYKA@mail.gmail.com>
	<CAFx0R5ci1r_rLdPMTcgbKpOLdRJWbLrcHU47jmShTrRRjvS2sQ@mail.gmail.com>
	<CADS_tivGPp7cLFmSg7U+ukz_PRft7DjzVe5evMrfNMWh8P0O3g@mail.gmail.com>
	<CADS_tiuUDGRHp4=7Atjd3+T4a+C6kn7XDMo1LqsaoMgTeFjFEg@mail.gmail.com>
	<CAFx0R5ej-5PYB35ByR7prFJbTDC7kJK-rQ_nP9Cvseb7UaAu_g@mail.gmail.com>
	<CADS_tivuN_7bGG=q_5t0FLN=eY0=W0Xs3cPRq3xhzJtNKX2rUQ@mail.gmail.com>
Date: Wed, 30 Jan 2013 17:06:43 -0200
Message-ID: 
 <CAFx0R5etsew74KxBd3BBUE-Xd=s-sfXCXm0gzdKetQ9nip04vg@mail.gmail.com>
Subject: Re: Possible issue in edismax?
From: Felipe Lahti <flahti@thoughtworks.com>
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=bcaec51718fd0823c304d4863669

--bcaec51718fd0823c304d4863669
Content-Type: text/plain; charset=ISO-8859-1

If you compare the first and last document scores you will see that the
last one matches more fields than first one. So, you maybe thinking why?
The first doc only matches "contributions" field and the last matches a
bunch of fields so if you want to  have behave more like (<str
name="qf">series_title^500 title^100 description^15 contribution</str>) you
have to override the method of DefaultSimilarity.


On Wed, Jan 30, 2013 at 4:12 PM, Sandeep Mestry <sanmestry@gmail.com> wrote:

> I have pasted it below and it is slightly variant from the dismax
> configuration I have mentioned above as I was playing with all sorts of
> boost values, however it looks more lie below:
>
> <str name="c208c2ca-4270-27b8-e040-a8c00409063a">
> 2675.7844 = (MATCH) sum of: 2675.7844 = (MATCH) max plus 0.01 times others
> of: 2675.7844 = (MATCH) weight(contributions:news in 63298)
> [DefaultSimilarity], result of: 2675.7844 = score(doc=63298,freq=1.0 =
> termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> 595177.7 = fieldWeight in 63298, product of: 1.0 = tf(freq=1.0), with freq
> of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414)
> 40960.0 = fieldNorm(doc=63298)
> </str>
> <str name="c208c2a9-66bc-27b8-e040-a8c00409063a">
> 2317.297 = (MATCH) sum of: 2317.297 = (MATCH) max plus 0.01 times others
> of: 2317.297 = (MATCH) weight(contributions:news in 9826415)
> [DefaultSimilarity], result of: 2317.297 = score(doc=9826415,freq=3.0 =
> termFreq=3.0 ), product of: 0.004495774 = queryWeight, product of:
> 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> 515439.0 = fieldWeight in 9826415, product of: 1.7320508 = tf(freq=3.0),
> with freq of: 3.0 = termFreq=3.0 14.530705 = idf(docFreq=14,
> maxDocs=11282414) 20480.0 = fieldNorm(doc=9826415)
> </str>
> <str name="c208c2aa-1806-27b8-e040-a8c00409063a">
> 2140.6274 = (MATCH) sum of: 2140.6274 = (MATCH) max plus 0.01 times others
> of: 2140.6274 = (MATCH) weight(contributions:news in 9882325)
> [DefaultSimilarity], result of: 2140.6274 = score(doc=9882325,freq=1.0 =
> termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> 476142.16 = fieldWeight in 9882325, product of: 1.0 = tf(freq=1.0), with
> freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414)
> 32768.0 = fieldNorm(doc=9882325)
> </str>
> <str name="c208c2b0-5165-27b8-e040-a8c00409063a">
> 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times others
> of: 1605.4707 = (MATCH) weight(contributions:news in 220007)
> [DefaultSimilarity], result of: 1605.4707 = score(doc=220007,freq=1.0 =
> termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> 357106.62 = fieldWeight in 220007, product of: 1.0 = tf(freq=1.0), with
> freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414)
> 24576.0 = fieldNorm(doc=220007)
> </str>
> <str name="c208c2cc-d01b-27b8-e040-a8c00409063a">
> 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times others
> of: 1605.4707 = (MATCH) weight(contributions:news in 241151)
> [DefaultSimilarity], result of: 1605.4707 = score(doc=241151,freq=1.0 =
> termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> 357106.62 = fieldWeight in 241151, product of: 1.0 = tf(freq=1.0), with
> freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414)
> 24576.0 = fieldNorm(doc=241151)
> </str>
> </lst>
> <str name="otherQuery">id:c208c2b4-1b3e-27b8-e040-a8c00409063a</str>
> <lst name="explainOther">
> <str name="*c208c2b4-1b3e-27b8-e040-a8c00409063a*"> <!-- this should rank
> higher -->
> 6.5742764 = (MATCH) sum of: 6.5742764 = (MATCH) max plus 0.01 times others
> of: 3.304414 = (MATCH) weight(description:news^25.0 in 967895)
> [DefaultSimilarity], result of: 3.304414 = score(doc=967895,freq=1.0 =
> termFreq=1.0 ), product of: 0.042727955 = queryWeight, product of: 25.0 =
> boost 5.5240083 = idf(docFreq=122362, maxDocs=11282414) 3.093982E-4 =
> queryNorm 77.33611 = fieldWeight in 967895, product of: 1.0 = tf(freq=1.0),
> with freq of: 1.0 = termFreq=1.0 5.5240083 = idf(docFreq=122362,
> maxDocs=11282414) 14.0 = fieldNorm(doc=967895) 5.913381 = (MATCH)
> weight(pg_series_title:news^50.0 in 967895) [DefaultSimilarity], result of:
> 5.913381 = score(doc=967895,freq=1.0 = termFreq=1.0 ), product of:
> 0.080834694 = queryWeight, product of: 50.0 = boost 5.2252855 =
> idf(docFreq=164961, maxDocs=11282414) 3.093982E-4 = queryNorm 73.154 =
> fieldWeight in 967895, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
> termFreq=1.0 5.2252855 = idf(docFreq=164961, maxDocs=11282414) 14.0 =
> fieldNorm(doc=967895) 0.18680073 = (MATCH) weight(p_programme_title:news in
> 967895) [DefaultSimilarity], result of: 0.18680073 =
> score(doc=967895,freq=1.0 = termFreq=1.0 ), product of: 0.002031815 =
> queryWeight, product of: 6.5669904 = idf(docFreq=43120, maxDocs=11282414)
> 3.093982E-4 = queryNorm 91.93787 = fieldWeight in 967895, product of: 1.0 =
> tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.5669904 =
> idf(docFreq=43120, maxDocs=11282414) 14.0 = fieldNorm(doc=967895) 6.464123
> = (MATCH) weight(pg_series_title_ci:news^500.0 in 967895)
> [DefaultSimilarity], result of: 6.464123 = score(doc=967895,freq=1.0 =
> termFreq=1.0 ), product of: 0.99999696 = queryWeight, product of: 500.0 =
> boost 6.4641423 = idf(docFreq=47791, maxDocs=11282414) 3.093982E-4 =
> queryNorm 6.4641423 = fieldWeight in 967895, product of: 1.0 =
> tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.4641423 =
> idf(docFreq=47791, maxDocs=11282414) 1.0 = fieldNorm(doc=967895) 1.6107484
> = (MATCH) weight(title_ci:news^100.0 in 967895) [DefaultSimilarity], result
> of: 1.6107484 = score(doc=967895,freq=1.0 = termFreq=1.0 ), product of:
> 0.22324038 = queryWeight, product of: 100.0 = boost 7.2153096 =
> idf(docFreq=22548, maxDocs=11282414) 3.093982E-4 = queryNorm 7.2153096 =
> fieldWeight in 967895, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
> termFreq=1.0 7.2153096 = idf(docFreq=22548, maxDocs=11282414) 1.0 =
> fieldNorm(doc=967895)
> </str>
>
>
> On 30 January 2013 17:55, Felipe Lahti <flahti@thoughtworks.com> wrote:
>
> > Let me see if I understood your problem:
> >
> > By your first e-mail I think you are worried about the returned order of
> > documents from Solr. Is that correct? If yes, as I said before it's not
> > only the boosting that influence the order of returned documents. There's
> > term frequency, IDF(inverse document frequency)... If I understood
> > correctly by your first e-mail, you are interested in get rid of IDF. So
> > for that, you can create a NoIDFSimilarity class to override the default
> > similarity.
> >
> > Can you paste here the score calculation for one document?
> >
> >
> > On Wed, Jan 30, 2013 at 2:06 PM, Sandeep Mestry <sanmestry@gmail.com
> >wrote:
> >
> >> (Sorry for in complete reply in my previous mail, didn't know Ctrl F
> sends
> >> an email in Gmail.. ;-))
> >>
> >> Thanks Felipe, yes I have seen that and my requirement falls for
> >>
> >> How can I make exact-case matches score higher
> >>
> >> Example: a query of "Penguin" should score documents containing
> "Penguin"
> >> higher than docs containing "penguin".
> >>
> >> The general strategy is to index the content twice, using different
> fields
> >> with different fieldTypes (and different analyzers associated with those
> >> fieldTypes). One analyzer will contain a lowercase filter for
> >> case-insensitive matches, and one will preserve case for exact-case
> >> matches.
> >>
> >> Use copyField <http://wiki.apache.org/solr/SchemaXml#copyField>
> commands
> >> in
> >>
> >> the schema to index a single input field multiple times.
> >>
> >> Once the content is indexed into multiple fields that are analyzed
> >> differently, query across both
> >> fields<http://wiki.apache.org/solr/SolrRelevancyFAQ#multiFieldQuery>
> >>
> >> .
> >>
> >> I have added a case insensitive field too to match the exact matches
> >> higher, however the result is not even considering the matches in field
> -
> >> forget the exact matching part.
> >>
> >> And I have tried the debugQuery option as mentioned in my previous mail,
> >> and I have also posted the parsed queries. From the debug query, I see
> >> that
> >> field boosted with lesser factor (contribution) is still resulting
> higher
> >> than the one with higher boost factor (series_title).
> >>
> >>
> >> Thanks,
> >>
> >> Sandeep
> >>
> >>
> >>
> >>
> >> On 30 January 2013 16:02, Sandeep Mestry <sanmestry@gmail.com> wrote:
> >>
> >> > Thanks Felipe, yes I have seen that and my requirement somewhere falls
> >> for
> >> >
> >> >
> >> > On 30 January 2013 15:53, Felipe Lahti <flahti@thoughtworks.com>
> wrote:
> >> >
> >> >> Hi Sandeep,
> >> >>
> >> >> Quick answer is that not only the boost that you define in your
> >> >> requestHandler is taken to calculate the score of each document.
> There
> >> are
> >> >> others factors that contribute to score calculation. You can take a
> >> look
> >> >> here about http://wiki.apache.org/solr/SolrRelevancyFAQ. Also, you
> can
> >> >> see
> >> >> using debugQuery=true the score calculation for each document
> returned.
> >> >>
> >> >> Let me know you need something else.
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Jan 30, 2013 at 1:13 PM, Sandeep Mestry <sanmestry@gmail.com
> >
> >> >> wrote:
> >> >>
> >> >> > Hi All,
> >> >> >
> >> >> > I'm facing an issue in relevancy calculation by dismax query
> parser.
> >> >> > The boost factor applied does not work as expected in certain cases
> >> when
> >> >> > the keyword is generic and by generic I mean, if the keyword is
> >> >> appearing
> >> >> > many times in the document as well as in the index.
> >> >> >
> >> >> > I have parser configuration as below:
> >> >> >
> >> >> > <requestHandler name="querydismax" class="solr.SearchHandler" >
> >> >> >         <lst name="defaults">
> >> >> >             <str name="defType">edismax</str>
> >> >> >             <str name="echoParams">explicit</str>
> >> >> >             <float name="tie">0.01</float>
> >> >> >             <str name="qf">series_title^500 title^100
> description^15
> >> >> > contribution</str>
> >> >> >             <str name="pf">series_title^200</str>
> >> >> >             <int name="ps">0</int>
> >> >> >             <str name="q.alt">*:*</str>
> >> >> >         </lst>
> >> >> > </requestHandler>
> >> >> >
> >> >> > As you can see above, I'd expect the documents containing the
> matches
> >> >> for
> >> >> > series title should rank higher than the ones in contribution.
> >> >> >
> >> >> > This works well, if I type in a query like 'wonderworld' which is a
> >> less
> >> >> > occurring term and the series titles rank higher. But, if I type
> in a
> >> >> > keyword like 'news' which is the most common term in the index, I
> get
> >> >> hits
> >> >> > in contributions even though I have lots of documents having word
> >> news
> >> >> in
> >> >> > series title.
> >> >> >
> >> >> > The field definition is as below:
> >> >> >
> >> >> > <field name="series_title" type="text_wc" indexed="true"
> >> stored="true"
> >> >> > multiValued="false" />
> >> >> > <field name="title" type="text_wc" indexed="true" stored="true"
> >> >> > multiValued="false" />
> >> >> > <field name="description" type="text_wc" indexed="true"
> stored="true"
> >> >> > multiValued="false" />
> >> >> > <field name="contribution" type="text" indexed="true" stored="true"
> >> >> > multiValued="true" />
> >> >> >
> >> >> > <fieldType name="text" class="solr.TextField"
> >> positionIncrementGap="100"
> >> >> > compressThreshold="10">
> >> >> >             <analyzer type="index">
> >> >> >                 <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
> >> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> >> >> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> >> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >> >> >             </analyzer>
> >> >> >             <analyzer type="query">
> >> >> >                 <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
> >> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> >> >> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >> >> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >> >> >             </analyzer>
> >> >> >         </fieldType>
> >> >> >
> >> >> > <fieldType name="text_wc" class="solr.TextField"
> >> >> positionIncrementGap="100"
> >> >> > >
> >> >> >             <analyzer type="index">
> >> >> >                 <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
> >> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> >> >> > stemEnglishPossessive="0" generateWordParts="1"
> >> generateNumberParts="1"
> >> >> > catenateWords="1" catenateNumbers="1" catenateAll="1"
> >> >> splitOnCaseChange="1"
> >> >> > splitOnNumerics="0" preserveOriginal="1" />
> >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >> >> >             </analyzer>
> >> >> >             <analyzer type="query">
> >> >> >                 <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
> >> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> >> >> > stemEnglishPossessive="0" generateWordParts="1"
> >> generateNumberParts="1"
> >> >> > catenateWords="1" catenateNumbers="1" catenateAll="1"
> >> >> splitOnCaseChange="1"
> >> >> > splitOnNumerics="0" preserveOriginal="1" />
> >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >> >> >             </analyzer>
> >> >> >  </fieldType>
> >> >> >
> >> >> > I have tried debugging and when I use query term news, I see that
> >> >> matches
> >> >> > for contributions are ranked higher than series title. The parsed
> >> >> queries
> >> >> > look like below:
> >> >> > (Note that I have edited the query as in reality I have lot of
> fields
> >> >> that
> >> >> > are searchable and I have only mentioned the fields containing text
> >> >> data -
> >> >> > rest all contain uuids)
> >> >> >
> >> >> > <str name="parsedquery">
> >> >> > (+DisjunctionMaxQuery((description:news^15.0 | title:news^100.0 |
> >> >> > contributions:news | series_title:news^500.0)~0.01) () () () () ()
> ()
> >> >> () ()
> >> >> > () () () () () () () () () () () () () () () () () () ()
> ())/no_coord
> >> >> > </str>
> >> >> > <str name="parsedquery_toString">
> >> >> > +(description:news^15 | title:news^100.0 | contributions:news |
> >> >> > series_title:news^500.0)~0.01 () () () () () () () () () () () ()
> ()
> >> ()
> >> >> ()
> >> >> > () () () () () () () () () () () () ()
> >> >> >
> >> >> >
> >> >> > Could you guide me in right direction please?
> >> >> >
> >> >> > Many Thanks,
> >> >> > Sandeep
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Felipe Lahti
> >> >> Consultant Developer - ThoughtWorks Porto Alegre
> >> >>
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > Felipe Lahti
> > Consultant Developer - ThoughtWorks Porto Alegre
> >
>


-- 
Felipe Lahti
Consultant Developer - ThoughtWorks Porto Alegre

--bcaec51718fd0823c304d4863669--