lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: LARM Crawler: Status
Date Tue, 18 Jun 2002 22:11:29 GMT

> I just want to keep you informed on how we plan to integrate the LARM
> crawler with Lucene. I'm working with Mehran Mehr on two major
> topics:

Just curious - is that the Iranian Olympic silver medalist from 1996?

> 1. Lucene storage: We want to see a web document as a bunch of
> name-value
> pairs, one of which is the URL and the other could be the document
> itself.

If you are interested, I can send you a class that is written as a
NekoHTML Filter, which I use for extracting title, body, meta keywords
and description.

> From within the storage pipeline, these web documents can be enhanced
> or
> changed. In the end there is the Lucene storage which takes a web
> document
> and stores its contents as fields within a Lucene index. So the
> storage
> itself is stupid. We can think of a lot of preprocessing steps that
> can
> occur before the store process itself takes place: document
> conversion, HTML
> removal, Header extraction, lemmatization and other linguistic
> features, and
> so forth. The storage itself can also be only an intermediary step:
> web
> documents could also be saved in plain files or a JMS topic, allowing
> for
> the division of the processing steps in a temporal or spacial manner.

Have I mentioned framework here before?
I read about it in JavaPro a few months ago and chose it for an
application that I was/am writing.  It allows for a very elegant and
simple (in terms of use) producer/consumer pipeline.
I've actually added a bit of functionality to the version that's at and sent it to the author who will, I believe, include it in
the new version.
Also, the framework allows for distributed consumer pipeline with
different communication protocols (JMS, RMI, BEEP...).  That is
something that is not available yet, but the author told me about it
over a month ago.

> 2. Configuration. The crawler is very modular and mainly consists of
> several
> producer/consumer pipelines that define a way where documents come
> from and
> how they are processed. We want this whole pipeline to be
> configurable
> (remember, most of it is still done from within the source code).
> That way,
> we want to be able to provide different configurations for different
> purposes: One could mimic the behavior of "wget", for example,
> another could
> build a fast one-machine crawler for a medium-size intranet, while a
> third
> configuration could be distributed and crawls a major part of the
> web. stuff doesn't have anything that allows for dynamic
configurations, but it may be good to use because then you don't have
to worry about developing, maintaining, fixing yet another component,
which should really be just another piece of your infrastructure on top
of which you can construct your specific application logic.

> As soon as we have done these two things, I think we can move the
> crawler and Lucene a bit closer together.
> We are still looking for people to help us. If you have resources
> left for
> the further developments (design, code, test), please read the
> technical
> overview document and the TODO.txt files in the lucene-sandbox
> repository, and contact me.

I will try to test it.


Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message