lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <pbec...@dstc.edu.au>
Subject Re: Parser Question
Date Wed, 16 Jul 2003 08:17:56 GMT
Leo Galambos wrote:

> Peter Becker wrote:
>
>> Hi Tod,
>>
>> as far as I know Lucene itself doesn't offer this (at least we failed 
>> to find it). The closest thing available seem to be the Ant tasks.
>>
>> We are currently working on introducing this notion for our program, 
>> which is open source. Beside the plugin mechanism there will be a 
>> file filter mapping and a thread mechanism to maintain an index as 
>> well as implementations using POI and Multivalent. Give us another 
>> week or two.
>
>
> Unfortunately, I didn't get this. Could you explain the mechanism, 
> please? Thank you 

Not fully yet, since we are still working on it ;-) You can find the 
code in our CVS repository on Sourceforge:

  
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/

The idea is that you have to supply different parsers for different 
formats, then turn the results found into Lucene Document objects. At 
the moment we do this using a normal interface similar to the one used 
in the Java Ant tasks (see the "handlers" directory), but we want to 
turn it into a plugin interface. Our tool should in the end do TXT, HTML 
and XML out of the box and have at least three plugin implementations:

  - POI for .doc, .xls
  - PDFbox for .pdf
  - Multivalent for .pdf, .dvi and others

The plugin API will be extremely simple and it should fit easily with 
the Ant tasks, so you should be able to wrap our code into an Ant task 
or whatever interface you need.

  Peter



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message