Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 46947 invoked from network); 30 Dec 2008 11:15:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Dec 2008 11:15:59 -0000 Received: (qmail 29168 invoked by uid 500); 30 Dec 2008 11:15:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 28928 invoked by uid 500); 30 Dec 2008 11:15:50 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 28917 invoked by uid 99); 30 Dec 2008 11:15:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Dec 2008 03:15:50 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [194.109.24.23] (HELO smtp-vbr3.xs4all.nl) (194.109.24.23) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Dec 2008 11:15:42 +0000 Received: from k8u.lan (porta.xs4all.nl [80.127.24.69]) by smtp-vbr3.xs4all.nl (8.13.8/8.13.8) with ESMTP id mBUBFK1d062774 for ; Tue, 30 Dec 2008 12:15:20 +0100 (CET) (envelope-from paul.elschot@xs4all.nl) From: Paul Elschot To: java-user@lucene.apache.org Subject: Re: Lucene retrieval model Date: Tue, 30 Dec 2008 12:09:17 +0100 User-Agent: KMail/1.9.10 References: <4F4AD93B957B4538A6186356A3718970@intrepid> In-Reply-To: <4F4AD93B957B4538A6186356A3718970@intrepid> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200812301209.18041.paul.elschot@xs4all.nl> X-Virus-Scanned: by XS4ALL Virus Scanner X-Virus-Checked: Checked by ClamAV on apache.org Op Tuesday 30 December 2008 10:03:03 schreef Claudia Santos: > Hello, > > I would like to know more about Lucene's retrieval model, more > specifically about the boolean model. > Is that a standard model or an extended model? I mean, it returns > just documents that match the boolean expression or include in the > search result all Documents which correspond to the given conditions, > regardless of the boolean connectors - AND, OR, NOT and calculate a > weight between 0 and 1 for all search results that contains at least > one of the terms. The extended model evaluates documents with only > one of the terms with a smaller value than one that contains both. > > In the Apache Lucene - Scoring's page i found not that much about: > "Lucene scoring uses a combination of the Vector Space Model (VSM) of > Information Retrieval and the Boolean model to determine how relevant > a given Document is to a User's query. In general, the idea behind > the VSM is the more times a query term appears in a document relative > to the number of times the term appears in all the documents in the > collection, the more relevant that document is to the query. It uses > the Boolean model to first narrow down the documents that need to be > scored based on the use of boolean logic in the Query specification. > Lucene also adds some capabilities and refinements onto this model to > support boolean and fuzzy searching, but it essentially remains a VSM > based system at the heart." > A somewhat refined Boolean model is used to determine a set of documents, and only for documents in that set a score value is calculated according the Lucene VSM model. The Boolean model in Lucene does not directly use the standard boolean connectors. Instead of that, each clause (term, subquery) is either required, optional or prohibited. The required and prohibited clauses determine a set of documents to be scored in the normal Boolean AND/NOT way. The refinement in the Boolean model is for the optional clauses: a minimum number of optional clauses may be required for documents to be part of the set that is scored. The normal Boolean OR operator has 1 as that minimum number, and in Lucene this minimum defaults to 1 when no required clauses are present. The required clauses and the optional clauses contribute to the score. One might consider the scoring of the optional clauses to be an implementation of the extended Boolean model. Fuzzy searching is implemented by constructing a Boolean query with optional (and actually present) terms that are similar enough to the fuzzy query term. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org