Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: 216.145.54.172 is neither permitted
 nor denied by domain of ykesten@yahoo-inc.com)
From: Yuval Kesten <ykesten@yahoo-inc.com>
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
Date: Wed, 22 Feb 2012 17:28:42 +0000
Subject: RE: Custom lucene scoring - Dot product between field boost and
 query boost
Thread-Topic: Custom lucene scoring - Dot product between field boost and
 query boost
Thread-Index: Aczxa0i7U/x8w5DVQ+ezvBWvbdho9gAGyz2w
Message-ID: 
 <25CF98EF23E14641A16ED9FE2A2E2C50018371E6F89C@IRD-EX07VS01.ds.corp.yahoo.com>
References: 
 <25CF98EF23E14641A16ED9FE2A2E2C50018371E6F484@IRD-EX07VS01.ds.corp.yahoo.com>
 <4F43C13E.70709@yahoo.de>
 <25CF98EF23E14641A16ED9FE2A2E2C50018371E6F580@IRD-EX07VS01.ds.corp.yahoo.com>
 <7A8EF49A-ABF4-447C-9BB5-CA8390C91C03@romseysoftware.co.uk>
In-Reply-To: <7A8EF49A-ABF4-447C-9BB5-CA8390C91C03@romseysoftware.co.uk>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi all,
Inspired by another thread here (Question about CustomScoreQuery) I am usin=
g this solution which is working really well (with one drawback):
I discovered that some of my problems were due to the fact that my assumpti=
on was wrong:
I did have many fields/queries terms with the same field ID.
This ruined my approach because the query boost was aggregated and my calcu=
lations were wrong.

What I did was during indexing I added the field value to the field id (con=
catenated it by '_') and as filed value used the desired score.

At search time I am using simple FieldScoreQuery (As-is, no modifications n=
eeded) with the complex field ID.
Here I can still use the setBoost to set the score because now my filed are=
 unique.

Logic wise this is perfect - dot product using Lucene.

Drawback - Lots of lots of different types of fields - effects the memory u=
sage dramatically.

If anyone has better ideas - please share!

-----Original Message-----
From: Alan Woodward [mailto:alan.woodward@romseysoftware.co.uk]=20
Sent: Wednesday, February 22, 2012 4:00 PM
To: java-user@lucene.apache.org
Subject: Re: Custom lucene scoring - Dot product between field boost and qu=
ery boost

Hi Yuval,

You can just override Similarity, rather than DefaultSimilarity - that way =
you don't burn any CPU cycles on TF/IDF calculations.

Alan

On 22 Feb 2012, at 07:17, Yuval Kesten wrote:

> Hi Em,
> 1. Regarding the performances - the similarity class (And my subtype as w=
ell) gets the IDF and TF and SQUARED SUMS calculations as inputs - they jus=
t factor them differently. Even though I ignore the values they are being c=
omputed.
> 2. I have written this code:
>    static {
>        Similarity.setDefault(new MySimilarity());
>    }
> Which means that I am setting the default similarity before doing the ind=
exing and obviously before the searching.
> Thanks!
>=20
> -----Original Message-----
> From: Em [mailto:mailformailinglists@yahoo.de]
> Sent: Tuesday, February 21, 2012 6:07 PM
> To: java-user@lucene.apache.org
> Subject: Re: Custom lucene scoring - Dot product between field boost=20
> and query boost
>=20
> Hi Yuval,
>=20
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for=20
>> nothing...
> You aren't calculating that much, since you declared all those values as =
constants. What are you worried about?
>=20
>> 2. The score I get from the TopScoreDocCollector is not the same as I
> get from the Explanation.
>> Here is part of my code:
> Could you provide us the code where you are setting the Similarity, pleas=
e?
>=20
> Kind regards,
> Em
>=20
> Am 21.02.2012 16:18, schrieb Yuval Kesten:
>> Hi,
>> I want to use Lucene with the following scoring logic:
>> When I index my documents I want to set for each field a score/weight.
>> When I query my index I want to set for each query term a score/weight.
>>=20
>> I will NEVER index or query with many instances of the same field - In e=
ach query (document) there will be 0-1 instances with the same field name.
>> My fields/query term are not analyzed - they are already made out of one=
 token.
>>=20
>> I want the score to be simply the dot product between the fields of the =
query to the fields of the document if they have the same value.
>>=20
>> For example:
>> Query:
>> Field Name
>>=20
>> Field Value
>>=20
>> Field Score
>>=20
>> 1
>>=20
>> AA
>>=20
>> 0.1
>>=20
>> 7
>>=20
>> BB
>>=20
>> 0.2
>>=20
>> 8
>>=20
>> CC
>>=20
>> 0.3
>>=20
>>=20
>> Document 1:
>> Field Name
>>=20
>> Field Value
>>=20
>> Field Score
>>=20
>> 1
>>=20
>> AA
>>=20
>> 0.2
>>=20
>> 2
>>=20
>> DD
>>=20
>> 0.8
>>=20
>> 7
>>=20
>> CC
>>=20
>> 0.999
>>=20
>> 10
>>=20
>> FFF
>>=20
>> 0.1
>>=20
>>=20
>> Document 2:
>> Field Name
>>=20
>> Field Value
>>=20
>> Field Score
>>=20
>> 7
>>=20
>> BB
>>=20
>> 0.3
>>=20
>> 8
>>=20
>> CC
>>=20
>> 0.5
>>=20
>>=20
>> The scores should be:
>> Score(q,d1) =3D FIELD_1_SCORE_Q * FILED_1_SCORE_D1 =3D 0.1 * 0.2  =3D 0.=
02
>> Score(q,d2) =3D FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q *
>> FILED_8_SCORE_D2 =3D (0.2 * 0.3) + (0.3 * 0.5)
>>=20
>> What would be the best way implement it? In terms of accuracy and perfor=
mances (I don't need TF and IDF calculations).
>>=20
>> I currently implemented it by setting boosts to the fields and query ter=
ms.
>> Then I overwritten the DefaultSimilarity class:
>>=20
>> public class MySimilarity extends DefaultSimilarity {
>>=20
>>    @Override
>>    public float computeNorm(String field, FieldInvertState state) {
>>        return state.getBoost();
>>    }
>>=20
>>    @Override
>>    public float queryNorm(float sumOfSquaredWeights) {
>>        return 1;
>>    }
>>=20
>>    @Override
>>    public float tf(float freq) {
>>        return 1;
>>    }
>>=20
>>    @Override
>>    public float idf(int docFreq, int numDocs) {
>>        return 1;
>>    }
>>=20
>>    @Override
>>    public float coord(int overlap, int maxOverlap) {
>>        return 1;
>>    }
>>=20
>> }
>>=20
>> And based on http://lucene.apache.org/core/old_versioned_docs/versions/3=
_5_0/scoring.html this should work.
>> Problems:
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for not=
hing...
>> 2. The score I get from the TopScoreDocCollector is not the same as I ge=
t from the Explanation.
>> Here is part of my code:
>>=20
>> indexSearcher =3D new IndexSearcher(IndexReader.open(directory, true));=
=20
>> TopScoreDocCollector collector =3D TopScoreDocCollector.create(iTopN,
>> true); indexSearcher.search(query, collector); ScoreDoc[] hits =3D=20
>> collector.topDocs().scoreDocs; for (int i =3D 0; i < hits.length; ++i)=20
>> { int docId =3D hits[i].doc; Document d =3D indexSearcher.doc(docId);=20
>> double score =3D hits[i].score; String id =3D d.get(FIELD_ID);=20
>> Explanation explanation =3D indexSearcher.explain(query, docId); }
>>=20
>> Thanks!
>>=20
>>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>=20


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org