Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 17CC6E6B1 for ; Wed, 30 Jan 2013 19:30:33 +0000 (UTC) Received: (qmail 5730 invoked by uid 500); 30 Jan 2013 19:30:29 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 5678 invoked by uid 500); 30 Jan 2013 19:30:29 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 5669 invoked by uid 99); 30 Jan 2013 19:30:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Jan 2013 19:30:29 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of flahti@thoughtworks.com designates 64.18.0.28 as permitted sender) Received: from [64.18.0.28] (HELO exprod5og114.obsmtp.com) (64.18.0.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Jan 2013 19:30:20 +0000 Received: from mail-ia0-f198.google.com ([209.85.210.198]) (using TLSv1) by exprod5ob114.postini.com ([64.18.4.12]) with SMTP ID DSNKUQl0tUckfIEPN5gywfZSH8lpXVOChk5x@postini.com; Wed, 30 Jan 2013 11:29:58 PST Received: by mail-ia0-f198.google.com with SMTP id h23so6978971iae.1 for ; Wed, 30 Jan 2013 11:29:57 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:x-received:in-reply-to:references:date :message-id:subject:from:to:content-type:x-gm-message-state; bh=IbdNwt3hybEWk/+6qK3LSma7u9H7DYjqd6AAcSilJm4=; b=nMweQRkjOUKa8BJZ6QsS8Aic0JOpUXGJ2kMmL9jvZ1fJ5JlPtPKiAM8z2jE/wja3d9 BGWyVvhov5sAwbKhn4m2oOBxFUnbvt3qlQCyUg9WsRgOGtLtd7LaL6MLfeZCzxsTMo0G pKZOE0On4iH8wAC2BF1F0MwnHxEY6AKTP0J3HM3ydt/IKKuhY4mZQxYMERVr6s0fVHca StnulYKzx7KE0sthuFjvWi2fJR6LTMShUoJBkmrJvrDME9joq+h8zFZi1oYDORT7Xm8w 66TDA1gOYDxaR0OymYosVQigxarZUJx3oIV/8e0NxG+lYczqd+DnsooUo12MItdl+0ED xP2w== X-Received: by 10.43.114.135 with SMTP id fa7mr3765030icc.21.1359572804303; Wed, 30 Jan 2013 11:06:44 -0800 (PST) MIME-Version: 1.0 X-Received: by 10.43.114.135 with SMTP id fa7mr3765020icc.21.1359572804051; Wed, 30 Jan 2013 11:06:44 -0800 (PST) Received: by 10.64.162.197 with HTTP; Wed, 30 Jan 2013 11:06:43 -0800 (PST) In-Reply-To: References: Date: Wed, 30 Jan 2013 17:06:43 -0200 Message-ID: Subject: Re: Possible issue in edismax? From: Felipe Lahti To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=bcaec51718fd0823c304d4863669 X-Gm-Message-State: ALoCoQkfpFabL+cxBKjcTCtLrGJwzjZFpFTgQ+hRJ7lknqGN0i7KQ7Ty1afFFqe/SdK5dHTl4DavdSD6+rymuf2zhwcOrBUY//RW9utm5ifbz+5J1sH6j35Ko+54zLSdw6Rw95cBtzydst5g8+Ih+tIPusVPZe0X2tODlORwJt7aw1WYx1l0aWQ= X-Virus-Checked: Checked by ClamAV on apache.org --bcaec51718fd0823c304d4863669 Content-Type: text/plain; charset=ISO-8859-1 If you compare the first and last document scores you will see that the last one matches more fields than first one. So, you maybe thinking why? The first doc only matches "contributions" field and the last matches a bunch of fields so if you want to have behave more like (series_title^500 title^100 description^15 contribution) you have to override the method of DefaultSimilarity. On Wed, Jan 30, 2013 at 4:12 PM, Sandeep Mestry wrote: > I have pasted it below and it is slightly variant from the dismax > configuration I have mentioned above as I was playing with all sorts of > boost values, however it looks more lie below: > > > 2675.7844 = (MATCH) sum of: 2675.7844 = (MATCH) max plus 0.01 times others > of: 2675.7844 = (MATCH) weight(contributions:news in 63298) > [DefaultSimilarity], result of: 2675.7844 = score(doc=63298,freq=1.0 = > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of: > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm > 595177.7 = fieldWeight in 63298, product of: 1.0 = tf(freq=1.0), with freq > of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414) > 40960.0 = fieldNorm(doc=63298) > > > 2317.297 = (MATCH) sum of: 2317.297 = (MATCH) max plus 0.01 times others > of: 2317.297 = (MATCH) weight(contributions:news in 9826415) > [DefaultSimilarity], result of: 2317.297 = score(doc=9826415,freq=3.0 = > termFreq=3.0 ), product of: 0.004495774 = queryWeight, product of: > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm > 515439.0 = fieldWeight in 9826415, product of: 1.7320508 = tf(freq=3.0), > with freq of: 3.0 = termFreq=3.0 14.530705 = idf(docFreq=14, > maxDocs=11282414) 20480.0 = fieldNorm(doc=9826415) > > > 2140.6274 = (MATCH) sum of: 2140.6274 = (MATCH) max plus 0.01 times others > of: 2140.6274 = (MATCH) weight(contributions:news in 9882325) > [DefaultSimilarity], result of: 2140.6274 = score(doc=9882325,freq=1.0 = > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of: > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm > 476142.16 = fieldWeight in 9882325, product of: 1.0 = tf(freq=1.0), with > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414) > 32768.0 = fieldNorm(doc=9882325) > > > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times others > of: 1605.4707 = (MATCH) weight(contributions:news in 220007) > [DefaultSimilarity], result of: 1605.4707 = score(doc=220007,freq=1.0 = > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of: > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm > 357106.62 = fieldWeight in 220007, product of: 1.0 = tf(freq=1.0), with > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414) > 24576.0 = fieldNorm(doc=220007) > > > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times others > of: 1605.4707 = (MATCH) weight(contributions:news in 241151) > [DefaultSimilarity], result of: 1605.4707 = score(doc=241151,freq=1.0 = > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of: > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm > 357106.62 = fieldWeight in 241151, product of: 1.0 = tf(freq=1.0), with > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414) > 24576.0 = fieldNorm(doc=241151) > > > id:c208c2b4-1b3e-27b8-e040-a8c00409063a > > > 6.5742764 = (MATCH) sum of: 6.5742764 = (MATCH) max plus 0.01 times others > of: 3.304414 = (MATCH) weight(description:news^25.0 in 967895) > [DefaultSimilarity], result of: 3.304414 = score(doc=967895,freq=1.0 = > termFreq=1.0 ), product of: 0.042727955 = queryWeight, product of: 25.0 = > boost 5.5240083 = idf(docFreq=122362, maxDocs=11282414) 3.093982E-4 = > queryNorm 77.33611 = fieldWeight in 967895, product of: 1.0 = tf(freq=1.0), > with freq of: 1.0 = termFreq=1.0 5.5240083 = idf(docFreq=122362, > maxDocs=11282414) 14.0 = fieldNorm(doc=967895) 5.913381 = (MATCH) > weight(pg_series_title:news^50.0 in 967895) [DefaultSimilarity], result of: > 5.913381 = score(doc=967895,freq=1.0 = termFreq=1.0 ), product of: > 0.080834694 = queryWeight, product of: 50.0 = boost 5.2252855 = > idf(docFreq=164961, maxDocs=11282414) 3.093982E-4 = queryNorm 73.154 = > fieldWeight in 967895, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = > termFreq=1.0 5.2252855 = idf(docFreq=164961, maxDocs=11282414) 14.0 = > fieldNorm(doc=967895) 0.18680073 = (MATCH) weight(p_programme_title:news in > 967895) [DefaultSimilarity], result of: 0.18680073 = > score(doc=967895,freq=1.0 = termFreq=1.0 ), product of: 0.002031815 = > queryWeight, product of: 6.5669904 = idf(docFreq=43120, maxDocs=11282414) > 3.093982E-4 = queryNorm 91.93787 = fieldWeight in 967895, product of: 1.0 = > tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.5669904 = > idf(docFreq=43120, maxDocs=11282414) 14.0 = fieldNorm(doc=967895) 6.464123 > = (MATCH) weight(pg_series_title_ci:news^500.0 in 967895) > [DefaultSimilarity], result of: 6.464123 = score(doc=967895,freq=1.0 = > termFreq=1.0 ), product of: 0.99999696 = queryWeight, product of: 500.0 = > boost 6.4641423 = idf(docFreq=47791, maxDocs=11282414) 3.093982E-4 = > queryNorm 6.4641423 = fieldWeight in 967895, product of: 1.0 = > tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.4641423 = > idf(docFreq=47791, maxDocs=11282414) 1.0 = fieldNorm(doc=967895) 1.6107484 > = (MATCH) weight(title_ci:news^100.0 in 967895) [DefaultSimilarity], result > of: 1.6107484 = score(doc=967895,freq=1.0 = termFreq=1.0 ), product of: > 0.22324038 = queryWeight, product of: 100.0 = boost 7.2153096 = > idf(docFreq=22548, maxDocs=11282414) 3.093982E-4 = queryNorm 7.2153096 = > fieldWeight in 967895, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = > termFreq=1.0 7.2153096 = idf(docFreq=22548, maxDocs=11282414) 1.0 = > fieldNorm(doc=967895) > > > > On 30 January 2013 17:55, Felipe Lahti wrote: > > > Let me see if I understood your problem: > > > > By your first e-mail I think you are worried about the returned order of > > documents from Solr. Is that correct? If yes, as I said before it's not > > only the boosting that influence the order of returned documents. There's > > term frequency, IDF(inverse document frequency)... If I understood > > correctly by your first e-mail, you are interested in get rid of IDF. So > > for that, you can create a NoIDFSimilarity class to override the default > > similarity. > > > > Can you paste here the score calculation for one document? > > > > > > On Wed, Jan 30, 2013 at 2:06 PM, Sandeep Mestry >wrote: > > > >> (Sorry for in complete reply in my previous mail, didn't know Ctrl F > sends > >> an email in Gmail.. ;-)) > >> > >> Thanks Felipe, yes I have seen that and my requirement falls for > >> > >> How can I make exact-case matches score higher > >> > >> Example: a query of "Penguin" should score documents containing > "Penguin" > >> higher than docs containing "penguin". > >> > >> The general strategy is to index the content twice, using different > fields > >> with different fieldTypes (and different analyzers associated with those > >> fieldTypes). One analyzer will contain a lowercase filter for > >> case-insensitive matches, and one will preserve case for exact-case > >> matches. > >> > >> Use copyField > commands > >> in > >> > >> the schema to index a single input field multiple times. > >> > >> Once the content is indexed into multiple fields that are analyzed > >> differently, query across both > >> fields > >> > >> . > >> > >> I have added a case insensitive field too to match the exact matches > >> higher, however the result is not even considering the matches in field > - > >> forget the exact matching part. > >> > >> And I have tried the debugQuery option as mentioned in my previous mail, > >> and I have also posted the parsed queries. From the debug query, I see > >> that > >> field boosted with lesser factor (contribution) is still resulting > higher > >> than the one with higher boost factor (series_title). > >> > >> > >> Thanks, > >> > >> Sandeep > >> > >> > >> > >> > >> On 30 January 2013 16:02, Sandeep Mestry wrote: > >> > >> > Thanks Felipe, yes I have seen that and my requirement somewhere falls > >> for > >> > > >> > > >> > On 30 January 2013 15:53, Felipe Lahti > wrote: > >> > > >> >> Hi Sandeep, > >> >> > >> >> Quick answer is that not only the boost that you define in your > >> >> requestHandler is taken to calculate the score of each document. > There > >> are > >> >> others factors that contribute to score calculation. You can take a > >> look > >> >> here about http://wiki.apache.org/solr/SolrRelevancyFAQ. Also, you > can > >> >> see > >> >> using debugQuery=true the score calculation for each document > returned. > >> >> > >> >> Let me know you need something else. > >> >> > >> >> > >> >> > >> >> On Wed, Jan 30, 2013 at 1:13 PM, Sandeep Mestry > > >> >> wrote: > >> >> > >> >> > Hi All, > >> >> > > >> >> > I'm facing an issue in relevancy calculation by dismax query > parser. > >> >> > The boost factor applied does not work as expected in certain cases > >> when > >> >> > the keyword is generic and by generic I mean, if the keyword is > >> >> appearing > >> >> > many times in the document as well as in the index. > >> >> > > >> >> > I have parser configuration as below: > >> >> > > >> >> > > >> >> > > >> >> > edismax > >> >> > explicit > >> >> > 0.01 > >> >> > series_title^500 title^100 > description^15 > >> >> > contribution > >> >> > series_title^200 > >> >> > 0 > >> >> > *:* > >> >> > > >> >> > > >> >> > > >> >> > As you can see above, I'd expect the documents containing the > matches > >> >> for > >> >> > series title should rank higher than the ones in contribution. > >> >> > > >> >> > This works well, if I type in a query like 'wonderworld' which is a > >> less > >> >> > occurring term and the series titles rank higher. But, if I type > in a > >> >> > keyword like 'news' which is the most common term in the index, I > get > >> >> hits > >> >> > in contributions even though I have lots of documents having word > >> news > >> >> in > >> >> > series title. > >> >> > > >> >> > The field definition is as below: > >> >> > > >> >> > >> stored="true" > >> >> > multiValued="false" /> > >> >> > >> >> > multiValued="false" /> > >> >> > stored="true" > >> >> > multiValued="false" /> > >> >> > >> >> > multiValued="true" /> > >> >> > > >> >> > >> positionIncrementGap="100" > >> >> > compressThreshold="10"> > >> >> > > >> >> > class="solr.WhitespaceTokenizerFactory"/> > >> >> > >> >> > generateWordParts="1" generateNumberParts="1" catenateWords="1" > >> >> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >> >> > > >> >> > > >> >> > > >> >> > class="solr.WhitespaceTokenizerFactory"/> > >> >> > >> >> > generateWordParts="1" generateNumberParts="1" catenateWords="0" > >> >> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > >> >> positionIncrementGap="100" > >> >> > > > >> >> > > >> >> > class="solr.WhitespaceTokenizerFactory"/> > >> >> > >> >> > stemEnglishPossessive="0" generateWordParts="1" > >> generateNumberParts="1" > >> >> > catenateWords="1" catenateNumbers="1" catenateAll="1" > >> >> splitOnCaseChange="1" > >> >> > splitOnNumerics="0" preserveOriginal="1" /> > >> >> > > >> >> > > >> >> > > >> >> > class="solr.WhitespaceTokenizerFactory"/> > >> >> > >> >> > stemEnglishPossessive="0" generateWordParts="1" > >> generateNumberParts="1" > >> >> > catenateWords="1" catenateNumbers="1" catenateAll="1" > >> >> splitOnCaseChange="1" > >> >> > splitOnNumerics="0" preserveOriginal="1" /> > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > I have tried debugging and when I use query term news, I see that > >> >> matches > >> >> > for contributions are ranked higher than series title. The parsed > >> >> queries > >> >> > look like below: > >> >> > (Note that I have edited the query as in reality I have lot of > fields > >> >> that > >> >> > are searchable and I have only mentioned the fields containing text > >> >> data - > >> >> > rest all contain uuids) > >> >> > > >> >> > > >> >> > (+DisjunctionMaxQuery((description:news^15.0 | title:news^100.0 | > >> >> > contributions:news | series_title:news^500.0)~0.01) () () () () () > () > >> >> () () > >> >> > () () () () () () () () () () () () () () () () () () () > ())/no_coord > >> >> > > >> >> > > >> >> > +(description:news^15 | title:news^100.0 | contributions:news | > >> >> > series_title:news^500.0)~0.01 () () () () () () () () () () () () > () > >> () > >> >> () > >> >> > () () () () () () () () () () () () () > >> >> > > >> >> > > >> >> > Could you guide me in right direction please? > >> >> > > >> >> > Many Thanks, > >> >> > Sandeep > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Felipe Lahti > >> >> Consultant Developer - ThoughtWorks Porto Alegre > >> >> > >> > > >> > > >> > > > > > > > > -- > > Felipe Lahti > > Consultant Developer - ThoughtWorks Porto Alegre > > > -- Felipe Lahti Consultant Developer - ThoughtWorks Porto Alegre --bcaec51718fd0823c304d4863669--