Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 87043 invoked from network); 23 Jul 2005 12:22:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 23 Jul 2005 12:22:36 -0000 Received: (qmail 31993 invoked by uid 500); 23 Jul 2005 12:22:35 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 31511 invoked by uid 500); 23 Jul 2005 12:22:33 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 31498 invoked by uid 99); 23 Jul 2005 12:22:32 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 23 Jul 2005 05:22:32 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [69.55.225.129] (HELO ehatchersolutions.com) (69.55.225.129) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 23 Jul 2005 05:22:25 -0700 Received: by ehatchersolutions.com (Postfix, from userid 504) id 0DEA213E2005; Sat, 23 Jul 2005 08:22:25 -0400 (EDT) Received: from [172.16.1.101] (va-71-48-129-227.dhcp.sprint-hsd.net [71.48.129.227]) by ehatchersolutions.com (Postfix) with ESMTP id 4F92013E2006 for ; Sat, 23 Jul 2005 08:22:01 -0400 (EDT) Mime-Version: 1.0 (Apple Message framework v730) In-Reply-To: References: <98F97B3A-EC1A-4536-8981-50CBE378DF95@ehatchersolutions.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <6D00B570-28E3-41E0-9467-22F10FAA5D5E@ehatchersolutions.com> Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: Extending the similarity class Date: Sat, 23 Jul 2005 08:21:59 -0400 To: java-dev@lucene.apache.org X-Mailer: Apple Mail (2.730) X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on javelina X-Spam-Level: X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-5.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.0.1 X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On Jul 23, 2005, at 4:45 AM, Ahmed El-dawy wrote: >> Only terms returned from the Analyzer are considered, so if a stop >> word is removed it does not count for tf or idf. >> > But I need to compare according to non indexed words also. By the way, > goole does this. Please provide an example or reference to support this claim. Perhaps Google is doing something like what Nutch does by default with a bi-gram technique of joining terms that begin with a common term with the successive term and overlapping it position-increment- wise. This technique allows searches to be fast when stop words need to be considered, but also optimized to avoid searching by stop words when it is not a phrase query. >> This will happen automatically with PhraseQuery with a slop factor. >> The closer the words, the better the score. However, with a pure >> boolean query, proximity is not considered at all (nor should it >> be). You can use a large slop factor for phrases such as "quick >> fox"~100 and see how the scores work then. >> > This means that all words must be in the result. This is not always > the case in my application. If I am searching for "quick brown fox", > "quick fox" is an acceptable result. In the case of single term queries boolean OR'd together, Similaritys coord factor boosts results that have more clauses overlapped. This does not take proximity of the words into consideration. > I just need to know whether I need to resort the search results > according to my criteria, or there are some methods to override which > will bring results already sorted. It seems like you're asking for a different type of Query than currently exists that can do a boolean OR but score based on proximity of the matching terms. Without looking it up, perhaps SpanOrQuery already does this sort of thing - though I don't think so. Erik > > > On 7/22/05, Erik Hatcher wrote: > >> >> On Jul 22, 2005, at 9:59 AM, Ahmed El-dawy wrote: >> >> >>> Hello, >>> I am using lucene to search plain text, but the order of the >>> search >>> results is not satisfying to my needs. First, I want to know how the >>> similarity works. Then, I need to extend it. >>> >> >> Use IndexSearcher.explain() to see how each individual hit is scored >> against a Query - this will be the clearest way to see why things >> score the way they do. >> >> >>> First, does the similarity class work on analyzed text or original >>> search text? To be precise, does it count the stop words as found >>> terms or not? >>> >> >> Only terms returned from the Analyzer are considered, so if a stop >> word is removed it does not count for tf or idf. >> >> >>> Second, I want to add a factor of how relative are the terms of >>> the >>> query found in text. For example, when I search for "quick fox", >>> "fox >>> quick" and "quick brown fox" will be less ranked than "quick fox". >>> >> >> This will happen automatically with PhraseQuery with a slop factor. >> The closer the words, the better the score. However, with a pure >> boolean query, proximity is not considered at all (nor should it >> be). You can use a large slop factor for phrases such as "quick >> fox"~100 and see how the scores work then. >> >> Erik >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> >> >> > > > -- > Regards, > Ahmed Saad > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org