lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: LARM Web Crawler: LuceneStorage [experimental]
Date Tue, 18 Jun 2002 21:55:24 GMT
I see nice progress here.
I will try it in the near future (time!).

> I have added an experimental version of a LuceneStorage to the LARM
> crawler,
> available from CVS in lucene-sandbox. That means crawled documents
> can now directly be indexed into a lucene index.
> Sorry, no configuration files yet. Config is done in
> ...larm/
> The main class FetcherMain is now configured to store the contents in
> a lucene index called "luceneIndex".
> Lots of open questions:
> - LARM doesn't have the notion of closing everything down. What
> happens if IndexWriter is interrupted?

As in what if it encounters an exception (e.g. somebody removes the
index directory)?  I guess one of the items that should them maybe get
added to the to-do list is checkpointing for starters.

> - I haven't tried to read from the index yet...

Heh, I'm familiar with that situation.

> - How to configure the stuff from a config file
> ... (it's late)

Property file with name=value pairs and some init() method that is
called at the beginning may be sufficient.

> Please try it:
> To build and run it,
> - put ANT in your path
> - provide a with the location of the lucene Jar file
> (lucene.jar=)
>   (just like javacc in lucene/build.xml)
> - put HTTPClient.jar from and jakarta-oro
> library
> into libs
> - type:
> ant
> run -Dstart=<starturl> -Drestrictto=<restricttourl>
> -Dthreads=<numThreads>
> ex.:
> ant
> run -Dstart=http://localhost/ -Drestrictto=http://localhost.*
> -Dthreads=5
> note: restrictto is a regular expression; the URLs tested against it
> are
> normalized beforehand, which means
> they are made lower case, index.* are removed, and some other
> corrections
> (see for details)

Removing index.* may be too bold and incorrect in some situations.

> note: LuceneStorage is dumb; it just takes the WebDocument and stores
> it.
> That means with the current config it also stores tags, and only one
> "content" field that contains everything. I plan to write another
> storage
> that uses the HTMLDocument from the demo package to store HTML
> documents.

I found NekoHTML to do a nice job of 'dehtmlization'.

> Please note that when adding this storage to the storage pipeline,
> the whole
> crawling process becomes
> CPU- instead of I/O bound. We already have plans how to do the
> distribution.
> Feel free to contact me if there are questions.
> Still Looking For Contributors!
> Clemens



Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message