lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten Konrad" <Karsten.Kon...@xtramind.com>
Subject AW: AW: Real Boolean Model in Lucene?
Date Mon, 01 Dec 2003 14:19:08 GMT

Hello Ralf,

>>
According to your description, Lucene basically maps the boolean query into the vector space
and measures the cosine similarity towards other documents in the vector space. If I understood
you correctly you mean if a document is found by Lucene based on a boolean query it is relevant
(boolean true). If it is not returned, if was boolean false. The score sits on top of it and
can be used for ranking. If I would like to use true boolean model I would therefore just
need to ignore the score of the Hits document. Did I understand correctly?
>>

Yes, I think that this is indeed pretty close to some theoretical foundation: The Boolean
Model 
explains which documents fit to a query, while some appropriate (Lucene is good!) similarity

function in vector space yields the ranking.

Now hell would be the place for me where I would have to prove that Lucene's ranking is 
exactly equivalent to some transformation of vector space and then using the *cosine* for
the 
ranking. Can't be really, as Lucene sometimes returns results > 1.0 and only some ruthless
normalisation keeps it within 0.0 to 1.0. In other words, there still are some rough corners
in Lucene where a good theorist could find some work.

Could  we leave this topic aside until some suicid.. err, I mean enthusiastic fellow
tries to work out a really good theory?

Regards,

Karsten





-----Ursprüngliche Nachricht-----
Von: Ralf B [mailto:ambiesense@gmx.de] 
Gesendet: Montag, 1. Dezember 2003 14:28
An: Lucene Users List
Betreff: Re: AW: Real Boolean Model in Lucene?


Hi Karsten,

I want to thank you for your qualified answer as well as your answer from the 14th of November,
where you agreed with me that Lucene is basically a VSM implementation. Sometimes it is difficult
to make the link between the clear theory and its implementation.

According to your description, Lucene basically maps the boolean query into the vector space
and measures the cosine similarity towards other documents in the vector space. If I understood
you correctly you mean if a document is found by Lucene based on a boolean query it is relevant
(boolean true). If it is not returned, if was boolean false. The score sits on top of it and
can be used for ranking. If I would like to use true boolean model I would therefore just
need to ignore the score of the Hits document. Did I understand correctly?

I aggree that nobody really want to do that. My question intended to find out more about the
implemented theory within Lucene.

Cheers,
Ralph


> 
> Hi,
> 
> >>
> My Question: Does Lucene use TF/IDF for getting this? (which would 
> mean
> it does not use the boolean model for the boolean query...)
> >>
> 
> Lucene indeed uses TF/IDF with length normalization for fields and
> documents. 
> 
> However, Lucene is "downward compatible" to the Boolean Model where 
> documents are represented as 0/1-vectors in Vector Space. Ranking just 
> adds weights to the elements of the result set, so the underlying 
> interpretation of a query result can be still that of a 
> Propositional/Boolean model. If a document appears in the result, its 
> tokens valuate the query (which actually is a propositional formula 
> formed over words and phrases) to true. The representation of 
> documents is more complex in Lucene than required for the Boolean 
> Model, and as a result, Lucene can efficiently handle phrases and 
> proximity searches, but these seem to be compatible extensions - if 
> you can do it in the Boolean Model, you can do it in Lucene :)
> 
> One place where Lucene is not 100% compatible with a basic Boolean 
> Model
> is that 
> full negation is a bit tricky - you can not simply ask for all documents 
> that 
> do not contain a certain term unless you also have some term that 
> appears in all 
> documents. Not a great deal, really. 
> 
> If TF/IDF weighting is a problem to you, the Similarity interface
> implementation allows you 
> to remove all references to length normalization and document 
> frequencies.
> 
> Regards,
> 
> Mit freundlichen Grüßen aus Saarbrücken
> 
> --
> 
> Dr.-Ing. Karsten Konrad
> Head of Artificial Intelligence Lab
> 
> XtraMind Technologies GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Phone: +49 (681) 3025113
> Fax: +49 (681) 3025109
> konrad@xtramind.com
> www.xtramind.com
> 
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: ambiesense@gmx.de [mailto:ambiesense@gmx.de]
> Gesendet: Montag, 1. Dezember 2003 13:11
> An: lucene-user@jakarta.apache.org
> Betreff: Real Boolean Model in Lucene?
> 
> 
> Hi,
> 
> is it possible to use a real boolean model in lucene for searching. 
> When
> one is using the Queryparser with a boolean query (i.e. "dog AND horse") 
> one does get a list of documents from the Hits object. However these 
> documents have a ranking (score).
> 
> My Question: Does Lucene use TF/IDF for getting this? (which would 
> mean
> it does not use the boolean model for the boolean query...)
> 
> How can one use a boolean model search, where the outcome are all
> score=1 ? Example?
> 
> Cheers,
> Ralph
> 
> --
> Neu bei GMX: Preissenkung für MMS-Versand und FreeMMS!
> 
> Ideal für alle, die gerne MMS verschicken:
> 25 FreeMMS/Monat mit GMX TopMail. 
> http://www.gmx.net/de/cgi/produktemail
> 
> +++ GMX - die erste Adresse für Mail, Message, More! +++
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
HoHoHo! Seid Ihr auch alle schön brav gewesen?

GMX Weihnachts-Special: Die 1. Adresse für Weihnachts-
männer und -frauen! http://www.gmx.net/de/cgi/specialmail

+++ GMX - die erste Adresse für Mail, Message, More! +++


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message