lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Halácsy Péter <>
Subject RE: Web Crawler
Date Wed, 24 Apr 2002 21:41:46 GMT

> -----Original Message-----
> From: Clemens Marschner []
> Sent: Wednesday, April 24, 2002 11:14 PM
> To: Lucene Developers List;
> Subject: Re: Web Crawler
>> Another thing I have in mind is to compress the URLs in 
> memory. First of
> all, the URL can be divided in several parts, some of which 
> occur in a lot
> of URLs (i.e. the host name). And finally, URLs contain only a limited
> number of different characters, so Huffman encoding is probably quite
> efficient.
see this:

. "In CS2, each URL is stored in 10 bytes. In CS1, each link requires 8 bytes to store as
both an in-link and out-link; in CS2, an average of only 3.4 bytes are used. Second, CS2 provides
additional functionality in the form of a host database. For example, in CS2, it is easy to
get all the in-links for a given node, or just the in-links from remote hosts. 

Like CS1, CS2 is designed to give high-performance access to all this data on a high-end machine
with enough RAM to store the database in memory.  On a 465 MHz Compaq AlphaServer 4100 with
12GB of RAM, it takes 70-80 ms to convert a URL into an internal id or vice versa, and then
only 0.15 ms/link to retrieve each in-link or out-link.  On a uniprocessor machine, a BFS
that reaches 100M nodes takes about 4 minutes; on a 2-processor machine we were able complete
a BFS every two minutes." 


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message