lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: LARM Web Crawler: note on normalized URLs
Date Wed, 19 Jun 2002 21:14:10 GMT

--- Clemens Marschner <cmad@lanlab.de> wrote:
> > > note: restrictto is a regular expression; the URLs tested against
> it
> > > are
> > > normalized beforehand, which means
> > > they are made lower case, index.* are removed, and some other
> > > corrections
> > > (see URLNormalizer.java for details)
> >
> > Removing index.* may be too bold and incorrect in some situations.
> 
> Hm, but I think it's much more likely that http://host/ and
> http://host/index.* point to the same document as to different
> documents.
> It's also very unlikely that (UNIX) users have one "abc" and one
> "Abc" file
> in the same directory, although it's possible. That's why URLs are
> made
> lower case.
> Therefore, I think the cost of not crawling a document that falls out
> of
> this scheme is higher than crawling a document twice.
> Later on we could use i.e. use MD5 hashes to be sure.

I don't know, maybe.  I haven't done any tests nor read anything that
would confirm that this is correct (or wrong).

> I must point out that these normalized URLs are only used for
> comparing the
> already crawled URLs with new ones. The actual request sent to the
> server is
> the original URL. removing index.* before sending the request would
> indeed
> be pretty bold.

Aha!
I thought you use normalized URLs for requests, too.

> I have a more detailed description of the URLnormalizer, but still in
> German; might check it in after I have translated it; I need it for
> my
> master's thesis (see my homepage). Probably I'll write that in
> English
> anyway...
> 
> by the way I've made some very promising experiments with MySQL as
> URL
> repository. seems to be fast enough. When I did this with MS SQL
> Server in
> the first place, I was very disappointed. That's the basis for
> incremental crawling!

People at Senga.org developed something called Webbase (its CVS
repository is at sf.net) that used MySQL for this purpose as well.

It may be even nicer to use some DB implemented in Java, such as
HyperSQL (I think that's the name) or Smyle
(https://sourceforge.net/projects/smyle/) or Berkeley DB
(http://www.sleepycat.com/), although MySQL may be simpler if you want
to create a crawler that can be run on a cluster of machines that share
a central link repository.

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message