lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <t...@hypermall.net>
Subject Re: Bridge with OpenOffice
Date Mon, 19 Apr 2004 23:38:18 GMT
On Monday 19 April 2004 14:01, Mario Ivankovits wrote:
> Stephane James Vaucher wrote:
> > Anyone try what Joerg suggested here?
> > http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.a
> >pache.org&msgNo=6231
>
> Dont know what you would like to do, but if you simply would like to
> extract text, you could simply try this sniplet:

This leads to question I was thinking; it seems that originally this thread 
started by someone pointing that OO can be used as converter from other 
formats... but how about tokenizer for native OO documents? I have written 
full-featured converters from OO to (simplified) DocBook and HTML, and 
creating one for just tokenizing to be used by Lucene would be much easier. 
Even if it would tokenize into separate fields (document metadata, content, 
maybe bibliography separately etc), it'd be easy to do.

Would anyone find full-featured, customizable OpenOffice document tokenizer 
useful?

-+ Tatu +-



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message