lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ype Kingma <ykin...@xs4all.nl>
Subject LARM web crawler: use lucene itself for visited URLs
Date Wed, 30 Oct 2002 21:58:57 GMT

I managed to loose some recent messages on the LARM crawler and the lucene
file formats, so I don't know whom to address.

Anyway, I noticed this on the LARM crawler info page
http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
<<<
Something worth while would be to compress the URLs. A lot of parts of URLs 
are the same between hundreds of URLs (i.e. the host name). And since only a 
limited number of characters are allowed in URLs, Huffman compression will 
lead to a good compression rate. 
>>>

and this on the file formats page
http://jakarta.apache.org/lucene/docs/fileformats.html
<<<
Term text prefixes are shared. The PrefixLength is the number of initial 
characters from the previous term which must be pre-pended to a term's suffix 
in order to form the term's text. Thus, if the previous term's text was 
"bone" and the term is "boy", the PrefixLength is two and the suffix is "y". 
>>>

Somehow I get the impression that lucene itself would be quite helpful for 
the crawler by using indexed, non stored fields for the normalized visited 
URLs.

Have fun,
Ype

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message