lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ype Kingma <>
Subject LARM web crawler: use lucene itself for visited URLs
Date Wed, 30 Oct 2002 21:58:57 GMT

I managed to loose some recent messages on the LARM crawler and the lucene
file formats, so I don't know whom to address.

Anyway, I noticed this on the LARM crawler info page
Something worth while would be to compress the URLs. A lot of parts of URLs 
are the same between hundreds of URLs (i.e. the host name). And since only a 
limited number of characters are allowed in URLs, Huffman compression will 
lead to a good compression rate. 

and this on the file formats page
Term text prefixes are shared. The PrefixLength is the number of initial 
characters from the previous term which must be pre-pended to a term's suffix 
in order to form the term's text. Thus, if the previous term's text was 
"bone" and the term is "boy", the PrefixLength is two and the suffix is "y". 

Somehow I get the impression that lucene itself would be quite helpful for 
the crawler by using indexed, non stored fields for the normalized visited 

Have fun,

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message