Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 67970 invoked from network); 21 Mar 2005 11:06:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 21 Mar 2005 11:06:56 -0000 Received: (qmail 35517 invoked by uid 500); 21 Mar 2005 11:06:49 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 35418 invoked by uid 500); 21 Mar 2005 11:06:47 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 34894 invoked by uid 99); 21 Mar 2005 11:06:44 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from intranet.runtime-collective.com (HELO mail.runtime-collective.com) (212.42.171.82) by apache.org (qpsmtpd/0.28) with ESMTP; Mon, 21 Mar 2005 03:06:42 -0800 Received: from [81.168.89.246] (helo=[192.168.1.28]) by mail.runtime-collective.com with smtp (Exim 3.35 #1 (Debian)) id 1DDKjz-0006x3-00 for ; Mon, 21 Mar 2005 11:06:35 +0000 Subject: Re: Removing similar documents from search results From: Miles Barr To: Lucene User In-Reply-To: References: <1110820764.11418.36.camel@saturn> <4235D01D.9020301@cs.put.poznan.pl> <1110823739.11418.45.camel@saturn> <4235EA99.9000703@cs.put.poznan.pl> <1110883855.11418.75.camel@saturn> Content-Type: text/plain Organization: Runtime Collective Ltd. Date: Mon, 21 Mar 2005 11:06:18 +0000 Message-Id: <1111403178.11418.320.camel@saturn> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On Sun, 2005-03-20 at 00:49 -0800, Chris Hostetter wrote: > Actually, your "Split across several pages" comment implies that you want > a system which can tell that page 1 of a multipage article should be > grouped with page 2 -- which may be radically different content. Most > multipage documents have very differnet text on subsequent pages, so i'm > not sure that a progromatic solution is going to be bale to spot that. Actually I added that in after I saw that Google does it. You're right that the context is likely to be completely different so I guess they do it through some URL matching. > I may also be reading too much into your message, but it sounds like you > aren't trying to index generic content -- it sounds like you are trying to > index content under your control (ie: content on your own web site). > > if that's the case, then presumably you know somethign about the > source data and the URL strucutre -- maybe you could solve this problem > when you build your index. > > for example, if i look at a site like perl.com, i can see a pattern in the > way the article URLs look... > > page 1... > http://www.perl.com/pub/a/2005/02/17/3d_engine.html > page 2, etc... > http://www.perl.com/pub/a/2005/02/17/3d_engine.html?page=2 > printable... > http://www.perl.com/lpt/a/2005/02/17/3d_engine.html > > > So instead of putting all of those URLs in the index as seperate docs, why > not create a single doc, with all of those URLs? I have to index several sites and I used some examples of the problems I've come across so far. I don't control the content for any of them, and they get picked up by a spider so excluding pages requires adding special cases. I'll probably adopt a two stage approach. 1. Prevent duplicate documents from getting into the index in the first place, e.g. compare MD5 hashes and file sizes, maybe make the spider configurable to spot certain URL patterns, etc. 2. Try out the various techniques suggested in this thread to spot similar pages at query time and hide them. -- Miles Barr Runtime Collective Ltd. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org