lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: URL Stemmer
Date Wed, 27 Jul 2005 23:31:07 GMT
Hm, not sure why you're emailing java-user@lucene.  nutch-user@lucene
may be better.  Here are 2 ancient classes from 2003 that I once used
to normalize URLs, to help me identify URL duplicates.  This may get
stripped on its way to the list.

Otis


--- Chris Fraschetti <fraschetti@gmail.com> wrote:

> Writing simple code to trim down a URL is trivial, but to actually
> trim it down to its most meaningful state is very hard. In same cases
> the URL parameters actually define the page in others they are
> useless
> babble. I'd like to use the hash of a page's URL as well as a hash of
> the content data to help me eliminate duplicates... is there any good
> methods that are commonly used for URL stemming?
> 
> -- 
> ___________________________________________________
> Chris Fraschetti
> e fraschetti@gmail.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
Mime
View raw message