lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: subclassing of IndexReader
Date Fri, 31 Oct 2003 13:25:22 GMT
I am also involved in a small project that deals with crawling. :)
I have not done this, yet, but have thought about the same problem that
you are asking about - detecting small changes in web pages.
Have you considered using Nilsimsa?

Otis


--- Christoph Goller <goller@detego-software.de> wrote:
> Hi Doug
> 
> 
> > 
> >>*)I am just curious. What is IndexReader.undeleteAll needed for?
> > 
> > 
> > In Nutch we have a rotating set of indexes.  For example, we might
> create a new index every day.  
> > Our crawler guarantees that pages will be re-indexed every 30 days,
> so we can, e.g., every day merge 
> > (or search w/o merging) the most recent 30 indexes.  So far so
> good.  But many pages are clones of 
> > other pages: different urls with the same content.  So, each time
> we deploy a new set of indexes we 
> > need to first perform duplicate detection to make sure that, for
> each unique content, only a single 
> > url is present, that with the highest link analysis score.  I
> implement this by first calling 
> > undeleteAll(), then perform the global duplicate detection,
> deleting duplicates from their index.  
> > Does this make sense?  Each day duplicate detection must be
> repeated when a new index is added, but 
> > first all of the previously detected duplicates must be cleared.
> > 
> 
> That's quite interesting. I am currently involved in a small crawling
> project. We only crawl a very 
> limited number of news pages, some of them several times per day. We
> found that there are often 
> tiny changes on these pages (spelling corrections, banner changes)
> which we would like to ignore
> (classify as dublicate) while we want to recognize bigger changes.
> For such a setting MD5 keys are
> not very helpful. How do you detect dublicates in Nutch?
> 
> Christoph
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Exclusive Video Premiere - Britney Spears
http://launch.yahoo.com/promos/britneyspears/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message