lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ghinwa Choueiter <>
Subject Re: Scoring a query with OR's
Date Wed, 19 Mar 2008 22:12:27 GMT

I emailed a question earlier about the difference between OR and AND in a 
Boolean query. So in what I am trying to do, I need AND to behave like an 
OR ( or what I like to call "soft AND"), and I need OR to behave like a 
logic OR, meaning that I don't want to reward documents that have more of 
the OR operands. It is easy for me to fix the AND, but is there a 
straightforward way of fixing the OR?

Many thanks!

On Sun, 9 Mar 2008, Mark Miller wrote:

> I have been trying to understand all of this better myself, so while I am no 
> expert, here is my take:
> Lucene is really a combined Vector Space / Boolean Model search engine.
> At its core, Lucene is essentially a Vector Space Model search engine: 
> scoring is done by comparing a query term vector to each of the document term 
> vectors. However, on top of this, Lucene allows a Boolean Model by 
> constraining results using a BooleanQuery.
> So when Lucene finds the score for "mark OR mandy", the idea is the same as 
> for "mark AND mandy". The difference is that the BooleanQuery will treat the 
> Must and Should clause differently: if a term is labeled Must but is not in 
> the document, the document won't match. If a Should term is not in the 
> document, the BooleanQuery excludes no extra documents on that account, but 
> the term may contribute 0 towards the similarity score. The BooleanQuery kind 
> of clamps down on top of the Vector Space TermVector similarity scoring, 
> allowing for a hybrid system.
> The coord factor essentially juices the term vector similarity score based on 
> how many query terms are in the document. Term overlap is already taken into 
> account during the term vector similarity part, but apparently users don't 
> like how that ranks eg users intuitively think that sharing more terms 
> between document and query is more important than sharing fewer very highly 
> weighted terms. So basically, coord is just trying to reorder things a bit 
> based on reported user expectations.
> - Mark
> Ghinwa Choueiter wrote:
>> but shouldn't the coord factor kick in with AND instead of OR? I understand 
>> why you would want to use coord in the case of AND, where you reward more 
>> the documents that contain most of the terms in the query. However in the 
>> case of OR, it should not matter if all the OR  operands are in the 
>> document?
>> -Ghinwa
>> ----- Original Message ----- From: "Erik Hatcher" 
>> <>
>> To: <>
>> Sent: Sunday, March 09, 2008 1:22 PM
>> Subject: Re: Scoring a query with OR's
>>> On Mar 9, 2008, at 12:39 PM, Ghinwa Choueiter wrote:
>>>> but what exactly happens when there are OR's, for eg.  (life OR  place OR

>>>> time)
>>>> The scoring equation can get a score for life, place, time  separately, 
>>>> but what does it do with them then? Does it also add them.
>>> The coord factor kicks in then:
>>> < 
>>> apache/lucene/search/DefaultSimilarity.html#coord(int,%20int)>
>>> the formula listed here should help too:
>>> < 
>>> apache/lucene/search/Similarity.html>
>>> Erik
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message