lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Hendriksen <dennis.hendrik...@kalooga.com>
Subject combine query score with external score
Date Thu, 28 Jan 2010 11:01:03 GMT
Hi,

I'm struggling to create a performant query in Lucene 3.0.0 in which I
want to combine 'regular' scoring with scores derived from external
sources.

For each document a fixed set of scores is calculated in the range [0.0,
1.0>. These scores represent the confidences that a document falls into
categories. So for example document #1 has a score of 0.3 for cat=boys,
0.2 for cat=girls, 0.1 for cat=toys, 0.05 for cat=animals.

The 'regular' scoring is calculated using a BooleanQuery with TermQuerys
similar to: -type:H +(title:dna body:dna^1.5)

In the current naive approach I'm combining the scores as following:
- for each document store the three best categories in the following
fields:
name=cat1st value=boys fieldboost=0.3
name=cat2nd value=girls fieldboost=0.2
name=cat3rd value=toys fieldboost=0.1
Search-time use the following query if you're interested in 'girls':
-type:H +(title:dna body:dna^1.5) cat1st:girls cat2nd:girls cat3rd:girls
or if you're interested in 'boys':
-type:H +(title:dna body:dna^1.5) cat1st:boys cat2nd:boys cat3rd:boys

Disadvantages of the current approach:
- loss of precision encoding/decoding boosts (performance is important,
so this might be acceptable)
- using TermQuery for the cat fields doesn't make a lot of sense since
the external scores are multiplied by the idf of 'boys'/'girls' and the
querynorm
- the resulting score from the cat field is added to the other query
score instead of multiplied

Just to give you an idea: the index I'm using is growing in time and
contains about 50 million documents

Do you have an idea how I can improve my query and still keep high
performance? 
Or should I combine the scores in the Collector (but this doesn't seem
the right place to retrieve the category scores from the index)?
Is it possible to use a different float->byte encoder per field to
reduce the lack of precision?

Thanks for your time,
Dennis
  


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message