lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: LARM Crawler: Status // Avalon?
Date Fri, 21 Jun 2002 13:44:31 GMT

--- Clemens Marschner <> wrote:
> > If you are interested, I can send you a class that is written as a
> > NekoHTML Filter, which I use for extracting title, body, meta
> keywords
> > and description.
> Sure, send it over. But isn't the example packaged with Lucene doing
> the same?

It's attached.  I'm sending it to the list, in case anyone searches the
list archives and needs code like this.

> > Have I mentioned framework here before?
> > I read about it in JavaPro a few months ago and chose it for an
> > application that I was/am writing.  It allows for a very elegant
> and
> > simple (in terms of use) producer/consumer pipeline.
> > I've actually added a bit of functionality to the version that's at
> > and sent it to the author who will, I believe, include it
> in
> > the new version.
> > Also, the framework allows for distributed consumer pipeline with
> > different communication protocols (JMS, RMI, BEEP...).  That is
> > something that is not available yet, but the author told me about
> it
> > over a month ago.
> Hmm.. I'll have a look at it. But keep in mind that the current
> solution is
> working already, and we probably only need one very simple way to
> transfer the data.

I know, if it works it may not need fixing, but I thought you may want
to get rid of the infrastructure part of your code if there is
something that does it nicely already.

> > We want this whole pipeline to be
> > > configurable
> > > (remember, most of it is still done from within the source code).
> >...
> > stuff doesn't have anything that allows for dynamic
> > configurations, but it may be good to use because then you don't
> have
> > to worry about developing, maintaining, fixing yet another
> component,
> > which should really be just another piece of your infrastructure on
> top
> > of which you can construct your specific application logic.
> yep, right. that's what i hate about c++ programs (also called
> 'yet-another-linked-list-implementation's :-)) i'll have a look at
> it; I
> just think the patterns used in LARM are probably too simple to be
> worth the exchange. But I'll see.

This k2d2 framework is super simple to use.  Register consumers, put
something in the front queue, extend a base class and override a single
method that takes an object and returns an object (or null if it
consumes it).  Pipeline done.

> By the way, I thought about the "putting all together in config
> files"
> thing: It's probably sufficient to have a couple of applications
> (main
> classes) that put the basic stuff together, and whose parts are then
> configurable through property files. At least now.
> I just have this feeling, but I fear some things could become very
> nasty if
> we have to invent a declarative configuration language that describes
> the
> configuration of the pipelines, or at least whose components tell the
> configuring class which other components they need to know of... (oh,
> that looks like we need component based development...)...

I don't have a better suggestion right now.

> >> Lots of open questions:
> >> - LARM doesn't have the notion of closing everything down. What
> >> happens if IndexWriter is interrupted?
> I must add that in general I don't have experience with using Lucene
> incrementally, that is, updating the index while others are using it.
> Is that working smoothly?

Yes, in my experience it works without problems.

> >As in what if it encounters an exception (e.g. somebody removes the
> >index directory)?  I guess one of the items that should them maybe
> get
> >added to the to-do list is checkpointing for starters.
> Hm... what do you mean...?
> From what I understand you mean that then the doc is stored in a
> repository
> until the index is available again...? [confused]

What I meant was this.
You have MySQL to hold your links.
You have N crawler threads.
You don't want to hit MySQL a lot, so you get links to crawl in batches
(e.g. each crawler thread tells MySQL: give me 1000 links to crawl).
The crawler fetches all pages, and they go through your component
pipeline and get processed.
What happens if after fetching 100 links from this batch of 1000 the
crawler thread dies?  Do you keep track of which links in that batch
you've crawled, so that in case the thread dies you don't recrawl
That's roughly what I meant.

> One last thought:
> - the crawler should be be started as a daemon process (at least
> optionally)
> - it should wake up from time to time to crawl changed pages
> - it should provide a management and status interface to the outside.
> - it internally needs the ability to run service jobs while crawling
> (keeping memory tidy, collecting stats, etc.)
> from what I know, these matters could be addressed by the Apache
> Avalon/Phoenix project. Does anyone know anything about it?

To me Avalon looks relatively complex, but from what I've read it is a
piece of software designed to allow applications like your crawler to
run on top of it.  I'm stating the obvious, for some.


Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
View raw message