Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D2DD39A40 for ; Wed, 22 Feb 2012 17:29:33 +0000 (UTC) Received: (qmail 32534 invoked by uid 500); 22 Feb 2012 17:29:31 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 32488 invoked by uid 500); 22 Feb 2012 17:29:31 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 32480 invoked by uid 99); 22 Feb 2012 17:29:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Feb 2012 17:29:31 +0000 X-ASF-Spam-Status: No, hits=0.4 required=5.0 tests=FSL_RCVD_USER,NO_RDNS_DOTCOM_HELO,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: 216.145.54.172 is neither permitted nor denied by domain of ykesten@yahoo-inc.com) Received: from [216.145.54.172] (HELO mrout2.yahoo.com) (216.145.54.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Feb 2012 17:29:22 +0000 Received: from IRD-EX07CAS02.ds.corp.yahoo.com (ird-ex07cas02.corp.ird.yahoo.com [77.238.176.72]) by mrout2.yahoo.com (8.14.4/8.14.4/y.out) with ESMTP id q1MHSjLa001560 for ; Wed, 22 Feb 2012 09:28:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=yahoo-inc.com; s=cobra; t=1329931726; bh=aUdNdTaQfcc8kVhUZi3wm7iYyId52GDMGfVYui/oIzM=; h=From:To:Date:Subject:Message-ID:References:In-Reply-To: Content-Type:Content-Transfer-Encoding:MIME-Version; b=AJD2qN8En6oFx+/9wolcNWE9FuOLIkJfA47M4zd4UvkvC8TVJK5ofmPn0BQK/LRdD GOJ6GL/YS+jKyrSzYq771tZahK3I5HG2d/VT7togzPgA1gz9cHn6QMtT2Ri2g/LKAW 89YciD8jwFXxm+oxiSqiieo0TlzRVqg+U1XU7ETI= Received: from IRD-EX07VS01.ds.corp.yahoo.com ([77.238.176.68]) by IRD-EX07CAS02.ds.corp.yahoo.com ([77.238.176.80]) with mapi; Wed, 22 Feb 2012 17:28:45 +0000 From: Yuval Kesten To: "java-user@lucene.apache.org" Date: Wed, 22 Feb 2012 17:28:42 +0000 Subject: RE: Custom lucene scoring - Dot product between field boost and query boost Thread-Topic: Custom lucene scoring - Dot product between field boost and query boost Thread-Index: Aczxa0i7U/x8w5DVQ+ezvBWvbdho9gAGyz2w Message-ID: <25CF98EF23E14641A16ED9FE2A2E2C50018371E6F89C@IRD-EX07VS01.ds.corp.yahoo.com> References: <25CF98EF23E14641A16ED9FE2A2E2C50018371E6F484@IRD-EX07VS01.ds.corp.yahoo.com> <4F43C13E.70709@yahoo.de> <25CF98EF23E14641A16ED9FE2A2E2C50018371E6F580@IRD-EX07VS01.ds.corp.yahoo.com> <7A8EF49A-ABF4-447C-9BB5-CA8390C91C03@romseysoftware.co.uk> In-Reply-To: <7A8EF49A-ABF4-447C-9BB5-CA8390C91C03@romseysoftware.co.uk> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Hi all, Inspired by another thread here (Question about CustomScoreQuery) I am usin= g this solution which is working really well (with one drawback): I discovered that some of my problems were due to the fact that my assumpti= on was wrong: I did have many fields/queries terms with the same field ID. This ruined my approach because the query boost was aggregated and my calcu= lations were wrong. What I did was during indexing I added the field value to the field id (con= catenated it by '_') and as filed value used the desired score. At search time I am using simple FieldScoreQuery (As-is, no modifications n= eeded) with the complex field ID. Here I can still use the setBoost to set the score because now my filed are= unique. Logic wise this is perfect - dot product using Lucene. Drawback - Lots of lots of different types of fields - effects the memory u= sage dramatically. If anyone has better ideas - please share! -----Original Message----- From: Alan Woodward [mailto:alan.woodward@romseysoftware.co.uk]=20 Sent: Wednesday, February 22, 2012 4:00 PM To: java-user@lucene.apache.org Subject: Re: Custom lucene scoring - Dot product between field boost and qu= ery boost Hi Yuval, You can just override Similarity, rather than DefaultSimilarity - that way = you don't burn any CPU cycles on TF/IDF calculations. Alan On 22 Feb 2012, at 07:17, Yuval Kesten wrote: > Hi Em, > 1. Regarding the performances - the similarity class (And my subtype as w= ell) gets the IDF and TF and SQUARED SUMS calculations as inputs - they jus= t factor them differently. Even though I ignore the values they are being c= omputed. > 2. I have written this code: > static { > Similarity.setDefault(new MySimilarity()); > } > Which means that I am setting the default similarity before doing the ind= exing and obviously before the searching. > Thanks! >=20 > -----Original Message----- > From: Em [mailto:mailformailinglists@yahoo.de] > Sent: Tuesday, February 21, 2012 6:07 PM > To: java-user@lucene.apache.org > Subject: Re: Custom lucene scoring - Dot product between field boost=20 > and query boost >=20 > Hi Yuval, >=20 >> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for=20 >> nothing... > You aren't calculating that much, since you declared all those values as = constants. What are you worried about? >=20 >> 2. The score I get from the TopScoreDocCollector is not the same as I > get from the Explanation. >> Here is part of my code: > Could you provide us the code where you are setting the Similarity, pleas= e? >=20 > Kind regards, > Em >=20 > Am 21.02.2012 16:18, schrieb Yuval Kesten: >> Hi, >> I want to use Lucene with the following scoring logic: >> When I index my documents I want to set for each field a score/weight. >> When I query my index I want to set for each query term a score/weight. >>=20 >> I will NEVER index or query with many instances of the same field - In e= ach query (document) there will be 0-1 instances with the same field name. >> My fields/query term are not analyzed - they are already made out of one= token. >>=20 >> I want the score to be simply the dot product between the fields of the = query to the fields of the document if they have the same value. >>=20 >> For example: >> Query: >> Field Name >>=20 >> Field Value >>=20 >> Field Score >>=20 >> 1 >>=20 >> AA >>=20 >> 0.1 >>=20 >> 7 >>=20 >> BB >>=20 >> 0.2 >>=20 >> 8 >>=20 >> CC >>=20 >> 0.3 >>=20 >>=20 >> Document 1: >> Field Name >>=20 >> Field Value >>=20 >> Field Score >>=20 >> 1 >>=20 >> AA >>=20 >> 0.2 >>=20 >> 2 >>=20 >> DD >>=20 >> 0.8 >>=20 >> 7 >>=20 >> CC >>=20 >> 0.999 >>=20 >> 10 >>=20 >> FFF >>=20 >> 0.1 >>=20 >>=20 >> Document 2: >> Field Name >>=20 >> Field Value >>=20 >> Field Score >>=20 >> 7 >>=20 >> BB >>=20 >> 0.3 >>=20 >> 8 >>=20 >> CC >>=20 >> 0.5 >>=20 >>=20 >> The scores should be: >> Score(q,d1) =3D FIELD_1_SCORE_Q * FILED_1_SCORE_D1 =3D 0.1 * 0.2 =3D 0.= 02 >> Score(q,d2) =3D FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q * >> FILED_8_SCORE_D2 =3D (0.2 * 0.3) + (0.3 * 0.5) >>=20 >> What would be the best way implement it? In terms of accuracy and perfor= mances (I don't need TF and IDF calculations). >>=20 >> I currently implemented it by setting boosts to the fields and query ter= ms. >> Then I overwritten the DefaultSimilarity class: >>=20 >> public class MySimilarity extends DefaultSimilarity { >>=20 >> @Override >> public float computeNorm(String field, FieldInvertState state) { >> return state.getBoost(); >> } >>=20 >> @Override >> public float queryNorm(float sumOfSquaredWeights) { >> return 1; >> } >>=20 >> @Override >> public float tf(float freq) { >> return 1; >> } >>=20 >> @Override >> public float idf(int docFreq, int numDocs) { >> return 1; >> } >>=20 >> @Override >> public float coord(int overlap, int maxOverlap) { >> return 1; >> } >>=20 >> } >>=20 >> And based on http://lucene.apache.org/core/old_versioned_docs/versions/3= _5_0/scoring.html this should work. >> Problems: >> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for not= hing... >> 2. The score I get from the TopScoreDocCollector is not the same as I ge= t from the Explanation. >> Here is part of my code: >>=20 >> indexSearcher =3D new IndexSearcher(IndexReader.open(directory, true));= =20 >> TopScoreDocCollector collector =3D TopScoreDocCollector.create(iTopN, >> true); indexSearcher.search(query, collector); ScoreDoc[] hits =3D=20 >> collector.topDocs().scoreDocs; for (int i =3D 0; i < hits.length; ++i)=20 >> { int docId =3D hits[i].doc; Document d =3D indexSearcher.doc(docId);=20 >> double score =3D hits[i].score; String id =3D d.get(FIELD_ID);=20 >> Explanation explanation =3D indexSearcher.explain(query, docId); } >>=20 >> Thanks! >>=20 >>=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org