Return-Path: Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 85050 invoked from network); 2 Sep 2003 23:38:24 -0000 Received: from unknown (HELO mail.iinet.net.au) (203.59.3.46) by daedalus.apache.org with SMTP; 2 Sep 2003 23:38:24 -0000 Received: (qmail 8847 invoked from network); 2 Sep 2003 23:38:28 -0000 Received: from unknown (HELO dstc.edu.au) (203.217.81.141) by mail.iinet.net.au with SMTP; 2 Sep 2003 23:38:28 -0000 Message-ID: <3F55297B.5010805@dstc.edu.au> Date: Wed, 03 Sep 2003 09:36:27 +1000 From: Peter Becker User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: Docco 0.2 / contribution offer References: <001501c37152$1931d8e0$2501a8c0@pcara> In-Reply-To: <001501c37152$1931d8e0$2501a8c0@pcara> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hi Gregor, Gregor Heinrich wrote: >Hi Peter. > >Docco is a great tool which I have been using since you posted your first >announcement (version 1.0, that is). Beside the things you mention in you > No, it must be 0.1 :-) It wasn't too bad for a first hack, though *g* >mail I also generally think it's a great idea to using formal concept >analysis with Lucene. I would be interested to explore the idea also for >more structured data (maybe include fields and even hierarchies). > Sounds interesting. We will be busy with other things for a month or two, afterwards we will probably do some experiments with extending Docco towards LSA and similar technologies (main idea is guessing good keywords to refine the query -- which is a bit different if you are happy with overspecified queries). If you have some particular ideas we can discuss them on tockit-general@lists.sf.net. Combining the current Docco approach with conceptual scaling could be interesting -- if that is what you mean. It would be a bit off-topic here, so let's put this on our list. The question for this list was more if someone is interested in turning the indexing and index managements bits into a separate project. I'd love to see it used by a broader audience and I think it would make getting started with Lucene a bit easier for some people -- at least if we add a little JavaDoc in form of package.htmls and class descriptions. And there are not that many classes, so that shouldn't be too much pain. > >Apart from this, if I had an idea of the time commitments connected, I would >definitely consider to join. > Hey -- it is Open Source and voluntary. You commit as much time as you want ;-) Not that that couldn't be a problem -- at least I tend to try too many things at once. If you are interested in more Docco-specific things, just post your ideas on our mailing list. That's usually a good start to get involved. And I am academic enough to always find the time to discuss things -- doing them is the hard bit :-) Once I understand what exactly you want to do I can probably give you a reasonable estimate of the effort involved. Cheers, Peter >Best, > >Gregor > > > >-----Original Message----- >From: Peter Becker [mailto:pbecker@dstc.edu.au] >Sent: Tuesday, September 02, 2003 1:52 PM >To: Lucene Users List >Subject: ANN: Docco 0.2 / contribution offer > > >Hi all, > >we finally finished the 0.2 release of our little personal document >management tool based on Lucene: > > http://tockit.sourceforge.net/docco/index.html > >This might be interesting for some readers of this list since its source >contains some infrastructure for document handlers and index management. >The document handlers are written with a very simple API, which just >asks the implementation to fill a structure with the information >retrieved from a URL. It is similar to the Ant task in the Lucene >sandbox, but it separates the information collection and the actual >indexing, i.e. all the decisions what should be stored and what shouldn't. > >The program comes with implementations for plain text, HTML (based on >Swing), XML (based on JAXP) and Open Office (using ZipStreams/SAX). We >wrote plugins for POI, PDFbox and Multivalent. The latter is >unfortunately a wild hack since Multivalent is the worst Java code I've >seen. Literally. Bad C written in Java. The tool would be nice to use, >but catching exceptions in little helper classes to do a System.exit is >just insane. And that is just one of the problems -- we had to do some >bad hacks to fix these issues. The other implementations should be fine, >although they need some more testing. > >The source (including all required libs) of the program is available via >Sourceforge's CVS: > > http://sourceforge.net/cvs/?group_id=37081 > >The module in question is called "docco". A current snapshot of only the >source is here: > > http://tockit.sourceforge.net/docco/source20030902.zip (~100kb) > > >The relevant packages are: > > org.tockit.docco.documenthandler: the documenthandler interface and >implementations > org.tockit.docco.filefilter: some code to pick document handlers via >file extensions or regexps > org.tockit.docco.index: the model/static bits of the index management > org.tockit.docco.indexer: the dynamic aspects of the index management: >runnable, framework for handlers > >The index management is probably not optimal, I strongly suspect that an >expert could tweak it. But the structure should be ok. > >We would be happy to contribute this code to the Lucene sandbox if there >is interest. Or to turn it into a project of its own, we don't think it >should be hidden in our more specific program. It should be easy to >merge it with the Ant task and we are happy to give a hand if wanted. >Adding some documentation would be easy, too -- at the moment the code >is still more for ourself, but it should be very readable by itself. We >require JDK 1.4, but this can be reduced by moving some more document >handlers into plugins. > >Anyone interested in joining into maintaining this code? Any feedback is >welcome. > >Cheers, > Peter > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >For additional commands, e-mail: lucene-user-help@jakarta.apache.org > >