Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (hermes.apache.org: local policy)
Subject: Re: Removing similar documents from search results
From: Miles Barr <miles@runtime-collective.com>
To: Lucene User <java-user@lucene.apache.org>
In-Reply-To: <Pine.LNX.4.58.0503200031550.21116@hal.rescomp.berkeley.edu>
References: <1110820764.11418.36.camel@saturn>
	 <4235D01D.9020301@cs.put.poznan.pl> <1110823739.11418.45.camel@saturn>
	 <4235EA99.9000703@cs.put.poznan.pl> <1110883855.11418.75.camel@saturn>
	 <Pine.LNX.4.58.0503200031550.21116@hal.rescomp.berkeley.edu>
Content-Type: text/plain
Organization: Runtime Collective Ltd.
Date: Mon, 21 Mar 2005 11:06:18 +0000
Message-Id: <1111403178.11418.320.camel@saturn>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit

On Sun, 2005-03-20 at 00:49 -0800, Chris Hostetter wrote:
> Actually, your "Split across several pages" comment implies that you want
> a system which can tell that page 1 of a multipage article should be
> grouped with page 2 -- which may be radically different content.  Most
> multipage documents have very differnet text on subsequent pages, so i'm
> not sure that a progromatic solution is going to be bale to spot that.

Actually I added that in after I saw that Google does it. You're right
that the context is likely to be completely different so I guess they do
it through some URL matching.

> I may also be reading too much into your message, but it sounds like you
> aren't trying to index generic content -- it sounds like you are trying to
> index content under your control (ie: content on your own web site).
> 
> if that's the case, then presumably you know somethign about the
> source data and the URL strucutre -- maybe you could solve this problem
> when you build your index.
> 
> for example, if i look at a site like perl.com, i can see a pattern in the
> way the article URLs look...
> 
> page 1...
> http://www.perl.com/pub/a/2005/02/17/3d_engine.html
> page 2, etc...
> http://www.perl.com/pub/a/2005/02/17/3d_engine.html?page=2
> printable...
> http://www.perl.com/lpt/a/2005/02/17/3d_engine.html
> 
> 
> So instead of putting all of those URLs in the index as seperate docs, why
> not create a single doc, with all of those URLs?

I have to index several sites and I used some examples of the problems
I've come across so far. I don't control the content for any of them,
and they get picked up by a spider so excluding pages requires adding
special cases.

I'll probably adopt a two stage approach.

1. Prevent duplicate documents from getting into the index in the first
place, e.g. compare MD5 hashes and file sizes, maybe make the spider
configurable to spot certain URL patterns, etc.

2. Try out the various techniques suggested in this thread to spot
similar pages at query time and hide them.


-- 
Miles Barr <miles@runtime-collective.com>
Runtime Collective Ltd.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org