Return-Path: Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 98538 invoked from network); 2 Sep 2003 13:01:04 -0000 Received: from unknown (HELO igdmail.igd.fhg.de) (146.140.28.14) by daedalus.apache.org with SMTP; 2 Sep 2003 13:01:04 -0000 Received: from pcara ([192.168.202.2]) by igdmail.igd.fhg.de (Netscape Messaging Server 3.6) with SMTP id AAA48EA for ; Tue, 2 Sep 2003 15:00:10 +0200 Reply-To: From: "Gregor Heinrich" To: "'Lucene Users List'" Subject: RE: Docco 0.2 / contribution offer Date: Tue, 2 Sep 2003 14:59:50 +0200 Message-ID: <001501c37152$1931d8e0$2501a8c0@pcara> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook CWS, Build 9.0.2416 (9.0.2911.0) X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 In-Reply-To: <3F548456.8040803@dstc.edu.au> Importance: Normal X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hi Peter. Docco is a great tool which I have been using since you posted your first announcement (version 1.0, that is). Beside the things you mention in you mail I also generally think it's a great idea to using formal concept analysis with Lucene. I would be interested to explore the idea also for more structured data (maybe include fields and even hierarchies). Apart from this, if I had an idea of the time commitments connected, I would definitely consider to join. Best, Gregor -----Original Message----- From: Peter Becker [mailto:pbecker@dstc.edu.au] Sent: Tuesday, September 02, 2003 1:52 PM To: Lucene Users List Subject: ANN: Docco 0.2 / contribution offer Hi all, we finally finished the 0.2 release of our little personal document management tool based on Lucene: http://tockit.sourceforge.net/docco/index.html This might be interesting for some readers of this list since its source contains some infrastructure for document handlers and index management. The document handlers are written with a very simple API, which just asks the implementation to fill a structure with the information retrieved from a URL. It is similar to the Ant task in the Lucene sandbox, but it separates the information collection and the actual indexing, i.e. all the decisions what should be stored and what shouldn't. The program comes with implementations for plain text, HTML (based on Swing), XML (based on JAXP) and Open Office (using ZipStreams/SAX). We wrote plugins for POI, PDFbox and Multivalent. The latter is unfortunately a wild hack since Multivalent is the worst Java code I've seen. Literally. Bad C written in Java. The tool would be nice to use, but catching exceptions in little helper classes to do a System.exit is just insane. And that is just one of the problems -- we had to do some bad hacks to fix these issues. The other implementations should be fine, although they need some more testing. The source (including all required libs) of the program is available via Sourceforge's CVS: http://sourceforge.net/cvs/?group_id=37081 The module in question is called "docco". A current snapshot of only the source is here: http://tockit.sourceforge.net/docco/source20030902.zip (~100kb) The relevant packages are: org.tockit.docco.documenthandler: the documenthandler interface and implementations org.tockit.docco.filefilter: some code to pick document handlers via file extensions or regexps org.tockit.docco.index: the model/static bits of the index management org.tockit.docco.indexer: the dynamic aspects of the index management: runnable, framework for handlers The index management is probably not optimal, I strongly suspect that an expert could tweak it. But the structure should be ok. We would be happy to contribute this code to the Lucene sandbox if there is interest. Or to turn it into a project of its own, we don't think it should be hidden in our more specific program. It should be easy to merge it with the Ant task and we are happy to give a hand if wanted. Adding some documentation would be easy, too -- at the moment the code is still more for ourself, but it should be very readable by itself. We require JDK 1.4, but this can be reduced by moving some more document handlers into plugins. Anyone interested in joining into maintaining this code? Any feedback is welcome. Cheers, Peter --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org