Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
From: Tatu Saloranta <tatu@hypermall.net>
Reply-To: tatu@hypermall.net
Organization: Linux-users missalie
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Subject: Re: Bridge with OpenOffice
Date: Mon, 19 Apr 2004 17:38:18 -0600
User-Agent: KMail/1.5
References: <Pine.LNX.4.44.0404161808360.10235-100000@mere.cirano.qc.ca>
 <4084301D.8080006@ops.co.at>
In-Reply-To: <4084301D.8080006@ops.co.at>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200404191738.18396.tatu@hypermall.net>

On Monday 19 April 2004 14:01, Mario Ivankovits wrote:
> Stephane James Vaucher wrote:
> > Anyone try what Joerg suggested here?
> > http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.a
> >pache.org&msgNo=6231
>
> Dont know what you would like to do, but if you simply would like to
> extract text, you could simply try this sniplet:

This leads to question I was thinking; it seems that originally this thread 
started by someone pointing that OO can be used as converter from other 
formats... but how about tokenizer for native OO documents? I have written 
full-featured converters from OO to (simplified) DocBook and HTML, and 
creating one for just tokenizing to be used by Lucene would be much easier. 
Even if it would tokenize into separate fields (document metadata, content, 
maybe bibliography separately etc), it'd be easy to do.

Would anyone find full-featured, customizable OpenOffice document tokenizer 
useful?

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org