lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anshum <ansh...@gmail.com>
Subject Re: Choosing boosting in Lucene
Date Mon, 18 Apr 2011 09:41:03 GMT
Hi Cristina,
Lucene scores each doc per search based on its scoring formula. As there is
a lot of query related normalizing and other component, the scores for docs
change as the query changes.
About understanding how boosting affects the score in detail, you may read
about *lucene scoring* at
http://lucene.apache.org/java/3_1_0/scoring.html
And the *scoring formula* at:
http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/search/Similarity.html

Talking about the difference between index time and search time boost, score
time boost is term level and generally speaking, index level boost is
field/doc level.
Also, having a look at the scoring formula in the Similarity class (link
provided above) you'd be in a  better position to understand the difference
(and there is some).
You should also use the *IndexSearcher's explain method*
http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query,
int)

Choosing the boost is again about what is it that you desire, these are
subjective questions. You should try different sets and have a look at the
score using the explain function to figure out what fits you the best.
Relevance or an apt method about boost values, can again be figured out
using varying the boost *via* *trial and error*. That is pretty much a
general practice.

Hope this helps you figuring out a reasonable solution and boost values.

--
Anshum Gupta
http://ai-cafe.blogspot.com


On Sat, Apr 16, 2011 at 9:13 PM, HAIDUC SONIA <haiduc_sonia@yahoo.com>wrote:

> Hello,
>
> I have a few questions about boosting in Lucene. I am running a research
> project where I have, for each document, 4 fields: f1, f2, f3, f4. I also
> have a set of queries for my corpus, and I know the relevant documents for
> each of these queries. What I want to study is how boosting affects the
> search results of these queries. Basically, I want to show that by boosting
> some of these fields the results are better (I hope).
> I have, though, a few essential questions that I cannot figure out and I
> would really appreciate some help...
>
> 1. Is there any difference between boosting the fields at index time and
> boosting the terms in the queries which appear in these fields at search
> time?
> Again, I know beforehand the set of queries and also the terms in these
> queries which appear in the documents in the corpus in each of the fields.
>
> 2. In what range are boosting values usually chosen? I.e., should I choose
> boosts in a 0.5-2 range (say 0.5, 1, 1.5, 2), like I have seen in soem
> examples, or is it the same if I choose boosts in a range like 50-200
> (respectively 50, 100, 150, 200)?
>
> 3. How sensitive is boosting in Lucene? For example, if I know
> approximately
> the importance of each field, and I want to assign boosting values
> accordingly, what would be good differences between the values of the
> boosting factor for the different fields? More precisely, if the importance
> order is f1<f2<f3<f4, will it matter if I choose the boosts as (1,2,3,4),
> or
> (1, 5, 10, 15)?
>
> 4. Is there any method besides trial and error for finding the boosts for
> each field that work the best for a particular corpus?
>
> Thank you very much,
> Cristina
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message