lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <pbec...@dstc.edu.au>
Subject ANN: Docco 0.2 / contribution offer
Date Tue, 02 Sep 2003 11:51:50 GMT
Hi all,

we finally finished the 0.2 release of our little personal document 
management tool based on Lucene:

  http://tockit.sourceforge.net/docco/index.html

This might be interesting for some readers of this list since its source 
contains some infrastructure for document handlers and index management. 
The document handlers are written with a very simple API, which just 
asks the implementation to fill a structure with the information 
retrieved from a URL. It is similar to the Ant task in the Lucene 
sandbox, but it separates the information collection and the actual 
indexing, i.e. all the decisions what should be stored and what shouldn't.

The program comes with implementations for plain text, HTML (based on 
Swing), XML (based on JAXP) and Open Office (using ZipStreams/SAX). We 
wrote plugins for POI, PDFbox and Multivalent. The latter is 
unfortunately a wild hack since Multivalent is the worst Java code I've 
seen. Literally. Bad C written in Java. The tool would be nice to use, 
but catching exceptions in little helper classes to do a System.exit is 
just insane. And that is just one of the problems -- we had to do some 
bad hacks to fix these issues. The other implementations should be fine, 
although they need some more testing.

The source (including all required libs) of the program is available via 
Sourceforge's CVS:

  http://sourceforge.net/cvs/?group_id=37081

The module in question is called "docco". A current snapshot of only the 
source is here:

  http://tockit.sourceforge.net/docco/source20030902.zip (~100kb)


The relevant packages are:

  org.tockit.docco.documenthandler: the documenthandler interface and 
implementations
  org.tockit.docco.filefilter: some code to pick document handlers via 
file extensions or regexps
  org.tockit.docco.index: the model/static bits of the index management
  org.tockit.docco.indexer: the dynamic aspects of the index management: 
runnable, framework for handlers

The index management is probably not optimal, I strongly suspect that an 
expert could tweak it. But the structure should be ok.

We would be happy to contribute this code to the Lucene sandbox if there 
is interest. Or to turn it into a project of its own, we don't think it 
should be hidden in our more specific program. It should be easy to 
merge it with the Ant task and we are happy to give a hand if wanted. 
Adding some documentation would be easy, too -- at the moment the code 
is still more for ourself, but it should be very readable by itself. We 
require JDK 1.4, but this can be reduced by moving some more document 
handlers into plugins.

Anyone interested in joining into maintaining this code? Any feedback is 
welcome.

Cheers,
   Peter


Mime
View raw message