lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Removing similar documents from search results
Date Sun, 20 Mar 2005 08:49:05 GMT

: At the moment I need something quite simple. To identify a page that
: appears in many forms, e.g.:
:
: - Normal version
: - Split across several pages
: - Print version
: - From a different section (different styling and navigation elements)
:
: Basically identical content, presented in different ways.

Actually, your "Split across several pages" comment implies that you want
a system which can tell that page 1 of a multipage article should be
grouped with page 2 -- which may be radically different content.  Most
multipage documents have very differnet text on subsequent pages, so i'm
not sure that a progromatic solution is going to be bale to spot that.

I may also be reading too much into your message, but it sounds like you
aren't trying to index generic content -- it sounds like you are trying to
index content under your control (ie: content on your own web site).

if that's the case, then presumably you know somethign about the
source data and the URL strucutre -- maybe you could solve this problem
when you build your index.

for example, if i look at a site like perl.com, i can see a pattern in the
way the article URLs look...

page 1...
http://www.perl.com/pub/a/2005/02/17/3d_engine.html
page 2, etc...
http://www.perl.com/pub/a/2005/02/17/3d_engine.html?page=2
printable...
http://www.perl.com/lpt/a/2005/02/17/3d_engine.html


So instead of putting all of those URLs in the index as seperate docs, why
not createa single doc, with all of those URLs?




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message