Hi, I have added an experimental version of a LuceneStorage to the LARM crawler, available from CVS in lucene-sandbox. That means crawled documents can now directly be indexed into a lucene index. Sorry, no configuration files yet. Config is done in ...larm/FetcherMain.java The main class FetcherMain is now configured to store the contents in a lucene index called "luceneIndex". Lots of open questions: - LARM doesn't have the notion of closing everything down. What happens if IndexWriter is interrupted? - I haven't tried to read from the index yet... - How to configure the stuff from a config file ... (it's late) Please try it: To build and run it, - put ANT in your path - provide a build.properties with the location of the lucene Jar file (lucene.jar=) (just like javacc in lucene/build.xml) - put HTTPClient.jar from http://innovation.ch/java and jakarta-oro library into libs - type: ant run -Dstart= -Drestrictto= -Dthreads= ex.: ant run -Dstart=http://localhost/ -Drestrictto=http://localhost.* -Dthreads=5 note: restrictto is a regular expression; the URLs tested against it are normalized beforehand, which means they are made lower case, index.* are removed, and some other corrections (see URLNormalizer.java for details) note: LuceneStorage is dumb; it just takes the WebDocument and stores it. That means with the current config it also stores tags, and only one "content" field that contains everything. I plan to write another storage that uses the HTMLDocument from the demo package to store HTML documents. Please note that when adding this storage to the storage pipeline, the whole crawling process becomes CPU- instead of I/O bound. We already have plans how to do the distribution. Feel free to contact me if there are questions. Still Looking For Contributors! Clemens -------------------------------------- http://www.cmarschner.net -- To unsubscribe, e-mail: For additional commands, e-mail: