lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ype Kingma <ykin...@xs4all.nl>
Subject Re: LARM web crawler: use lucene itself for visited URLs
Date Thu, 31 Oct 2002 08:10:27 GMT
On Wednesday 30 October 2002 23:30, Clemens Marschner wrote:
> There's a good paper on compressing URLs in
> http://citeseer.nj.nec.com/suel01compressing.html It takes advantage of the
> regular structure of the sorted list of URLs and compresses the resulting
> structure with some Huffman encoding.
> I have already implemented a somewhat simpler algorithm that can compress
> URLs based on their prefixes. I maybe contribute that a little later.

Compressing is one part, storing the visited URL's on disk (to save RAM)
is another. Once the hashtable being used now grows over a max size,
it could be added to a lucene db, after which a new indexreader can be opened 
and table can be flushed from RAM.
No analyzer is needed to create the lucene documents, as the URL's are 
already normalized.
Lookup can be done on directly with an indexreader, in case
the lookup in RAM fails.
The nice thing about it is that this lucene scales up quite a bit.

Have fun,
Ype


> ----- Original Message -----
> From: "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
> To: <lucene-dev@jakarta.apache.org>
> Sent: Wednesday, October 30, 2002 11:00 PM
> Subject: Re: LARM web crawler: use lucene itself for visited URLs
>
> > Redirecting this to lucene-dev, seems more appropriate.
> >
> > Clemens is the person to talk to.
> > Yes, I thought of that, but it always felt like a weird idea to me.  I
> > can't really explain why....  Clemens, what do you think about this?  I
> > was imagining something like skipping the link parts that are the same
> > in the previous link....and now I know where I got that :)
> >
> > Otis
> >
> > --- Ype Kingma <ykingma@xs4all.nl> wrote:
> > > I managed to loose some recent messages on the LARM crawler and the
> > > lucene
> > > file formats, so I don't know whom to address.
> > >
> > > Anyway, I noticed this on the LARM crawler info page
> >
> > http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
> >
> > > <<<
> > > Something worth while would be to compress the URLs. A lot of parts
> > > of URLs
> > > are the same between hundreds of URLs (i.e. the host name). And since
> > > only a
> > > limited number of characters are allowed in URLs, Huffman compression
> > > will
> > > lead to a good compression rate.
> > >
> > >
> > > and this on the file formats page
> > > http://jakarta.apache.org/lucene/docs/fileformats.html
> > > <<<
> > > Term text prefixes are shared. The PrefixLength is the number of
> > > initial
> > > characters from the previous term which must be pre-pended to a
> > > term's suffix
> > > in order to form the term's text. Thus, if the previous term's text
> > > was
> > > "bone" and the term is "boy", the PrefixLength is two and the suffix
> > > is "y".
> > >
> > >
> > > Somehow I get the impression that lucene itself would be quite
> > > helpful for
> > > the crawler by using indexed, non stored fields for the normalized
> > > visited
> > > URLs.
> > >
> > > Have fun,
> > > Ype
> > >
> > > --
> > > To unsubscribe, e-mail:
> > > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > > <mailto:lucene-user-help@jakarta.apache.org>
> >
> > __________________________________________________
> > Do you Yahoo!?
> > HotJobs - Search new jobs daily now
> > http://hotjobs.yahoo.com/
> >
> > --
> > To unsubscribe, e-mail:
>
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>
> > For additional commands, e-mail:
>
> <mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message