lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Goetz <>
Subject Re: LARM web crawler: use lucene itself for visited URLs
Date Wed, 30 Oct 2002 22:08:15 GMT

>Yes, I thought of that, but it always felt like a weird idea to me.  I
>can't really explain why....  Clemens, what do you think about this?  I
>was imagining something like skipping the link parts that are the same
>in the previous link....and now I know where I got that :)

This seems dangerous to me, since Lucene is free to take liberties with 
tokens, such as stemming and filtering out stop words.  So a URL like
might get mapped to
if you used a stopword analyzer.

A very common trick for compressing paths is this: give each known URL 
prefix a code.  Example:

/foo -> 1 = ("foo")
/foo/bar -> 2 = (1, "bar")
/foo/blah -> 3 = (1, "blah")
/foo/bar/moo -> 4 = (2, "moo")

This trick is used often in caching, to reduce the number of lookups 
required to find an element in a hierarchical cache.

Brian Goetz
Quiotix Corporation           Tel: 650-843-1300            Fax: 650-324-8032

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message