lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Goetz <br...@quiotix.com>
Subject Re: LARM web crawler: use lucene itself for visited URLs
Date Wed, 30 Oct 2002 22:08:15 GMT

>Yes, I thought of that, but it always felt like a weird idea to me.  I
>can't really explain why....  Clemens, what do you think about this?  I
>was imagining something like skipping the link parts that are the same
>in the previous link....and now I know where I got that :)

This seems dangerous to me, since Lucene is free to take liberties with 
tokens, such as stemming and filtering out stop words.  So a URL like
  /path/to/foo
might get mapped to
  /path/foo
if you used a stopword analyzer.

A very common trick for compressing paths is this: give each known URL 
prefix a code.  Example:

/foo -> 1 = ("foo")
/foo/bar -> 2 = (1, "bar")
/foo/blah -> 3 = (1, "blah")
/foo/bar/moo -> 4 = (2, "moo")

This trick is used often in caching, to reduce the number of lookups 
required to find an element in a hierarchical cache.



--
Brian Goetz
Quiotix Corporation
brian@quiotix.com           Tel: 650-843-1300            Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message