lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: LARM web crawler: use lucene itself for visited URLs
Date Wed, 30 Oct 2002 22:00:59 GMT
Redirecting this to lucene-dev, seems more appropriate.

Clemens is the person to talk to.
Yes, I thought of that, but it always felt like a weird idea to me.  I
can't really explain why....  Clemens, what do you think about this?  I
was imagining something like skipping the link parts that are the same
in the previous link....and now I know where I got that :)

Otis



--- Ype Kingma <ykingma@xs4all.nl> wrote:
> 
> I managed to loose some recent messages on the LARM crawler and the
> lucene
> file formats, so I don't know whom to address.
> 
> Anyway, I noticed this on the LARM crawler info page
>
http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
> <<<
> Something worth while would be to compress the URLs. A lot of parts
> of URLs 
> are the same between hundreds of URLs (i.e. the host name). And since
> only a 
> limited number of characters are allowed in URLs, Huffman compression
> will 
> lead to a good compression rate. 
> >>>
> 
> and this on the file formats page
> http://jakarta.apache.org/lucene/docs/fileformats.html
> <<<
> Term text prefixes are shared. The PrefixLength is the number of
> initial 
> characters from the previous term which must be pre-pended to a
> term's suffix 
> in order to form the term's text. Thus, if the previous term's text
> was 
> "bone" and the term is "boy", the PrefixLength is two and the suffix
> is "y". 
> >>>
> 
> Somehow I get the impression that lucene itself would be quite
> helpful for 
> the crawler by using indexed, non stored fields for the normalized
> visited 
> URLs.
> 
> Have fun,
> Ype
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
HotJobs - Search new jobs daily now
http://hotjobs.yahoo.com/

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message