Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 60186 invoked from network); 25 Aug 2004 18:16:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 25 Aug 2004 18:16:45 -0000 Received: (qmail 51184 invoked by uid 500); 25 Aug 2004 18:16:39 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 51158 invoked by uid 500); 25 Aug 2004 18:16:38 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 51137 invoked by uid 99); 25 Aug 2004 18:16:38 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FORGED_RCVD_HELO X-Spam-Check-By: apache.org Received: from [194.109.24.29] (HELO smtp-vbr9.xs4all.nl) (194.109.24.29) by apache.org (qpsmtpd/0.27.1) with ESMTP; Wed, 25 Aug 2004 11:16:35 -0700 Received: from k7l.local (porta.xs4all.nl [80.127.24.69]) by smtp-vbr9.xs4all.nl (8.12.11/8.12.11) with ESMTP id i7PIGW10071275 for ; Wed, 25 Aug 2004 20:16:32 +0200 (CEST) (envelope-from paul.elschot@xs4all.nl) From: Paul Elschot To: lucene-user@jakarta.apache.org Subject: Re: How not to show results with the same score? Date: Wed, 25 Aug 2004 20:16:31 +0200 User-Agent: KMail/1.5.4 References: <412C6839.5060803@eastbeam.com> In-Reply-To: <412C6839.5060803@eastbeam.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200408252016.31960.paul.elschot@xs4all.nl> X-Virus-Scanned: by XS4ALL Virus Scanner X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N On Wednesday 25 August 2004 12:21, B. Grimm [Eastbeam GmbH] wrote: > hi there, > > i browsed through the list and had some different searches but i do not > find, what i'm looking for. > > i got an index which is generated by a bot, collecting websites. there > are sites like www.domain.de/article/1 and www.domain.de/article/1?page=1 > these different urls have the same content and when u search for a word, > matching, both are returned, which is correct. > > they have excatly the same score because of there content an so one, so > i would like to know if its possible "to group by" (mysql, of course) > the returned score, so that only the first match is collected into > "Hits" and all following matches with the same score are ignored. > > it would be great if anyone has an idea how to do that. You can implement your own HitCollector and pass it to IndexSearcher.search() Have a look at the javadocs of the org.apache.lucene.search package, it's quite straightforward. The PriorityQueue from the util package is useful to collect results. For every distinct score you could store an int[] of document nrs in there while collecting the hits. Basically you'll end up implementing your own Hits class. For URL's that have the same content, it's better to store multiple URL's for the same document. However, this merging is normally done by a crawler because the same contents means the same outgoing URL's. Crawlers also keep track of multiple host names resolving to the same IP address. In case you need to crawl and index an intranet or more, have a look at Nutch. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org