lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clemens Marschner" <c...@lanlab.de>
Subject Re: LARM Web Crawler: note on normalized URLs
Date Wed, 19 Jun 2002 20:55:05 GMT
> > note: restrictto is a regular expression; the URLs tested against it
> > are
> > normalized beforehand, which means
> > they are made lower case, index.* are removed, and some other
> > corrections
> > (see URLNormalizer.java for details)
>
> Removing index.* may be too bold and incorrect in some situations.

Hm, but I think it's much more likely that http://host/ and
http://host/index.* point to the same document as to different documents.
It's also very unlikely that (UNIX) users have one "abc" and one "Abc" file
in the same directory, although it's possible. That's why URLs are made
lower case.
Therefore, I think the cost of not crawling a document that falls out of
this scheme is higher than crawling a document twice.
Later on we could use i.e. use MD5 hashes to be sure.

I must point out that these normalized URLs are only used for comparing the
already crawled URLs with new ones. The actual request sent to the server is
the original URL. removing index.* before sending the request would indeed
be pretty bold.

I have a more detailed description of the URLnormalizer, but still in
German; might check it in after I have translated it; I need it for my
master's thesis (see my homepage). Probably I'll write that in English
anyway...

by the way I've made some very promising experiments with MySQL as URL
repository. seems to be fast enough. When I did this with MS SQL Server in
the first place, I was very disappointed. That's the basis for incremental
crawling!

--Clemens


http://www.cmarschner.net



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message