lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Removing similar documents from search results
Date Sun, 20 Mar 2005 08:49:05 GMT

: At the moment I need something quite simple. To identify a page that
: appears in many forms, e.g.:
: - Normal version
: - Split across several pages
: - Print version
: - From a different section (different styling and navigation elements)
: Basically identical content, presented in different ways.

Actually, your "Split across several pages" comment implies that you want
a system which can tell that page 1 of a multipage article should be
grouped with page 2 -- which may be radically different content.  Most
multipage documents have very differnet text on subsequent pages, so i'm
not sure that a progromatic solution is going to be bale to spot that.

I may also be reading too much into your message, but it sounds like you
aren't trying to index generic content -- it sounds like you are trying to
index content under your control (ie: content on your own web site).

if that's the case, then presumably you know somethign about the
source data and the URL strucutre -- maybe you could solve this problem
when you build your index.

for example, if i look at a site like, i can see a pattern in the
way the article URLs look...

page 1...
page 2, etc...

So instead of putting all of those URLs in the index as seperate docs, why
not createa single doc, with all of those URLs?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message