Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 88677 invoked from network); 19 Apr 2004 23:37:05 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 19 Apr 2004 23:37:05 -0000 Received: (qmail 46019 invoked by uid 500); 19 Apr 2004 23:36:44 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 45992 invoked by uid 500); 19 Apr 2004 23:36:44 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 45977 invoked from network); 19 Apr 2004 23:36:44 -0000 Received: from unknown (HELO mz2.forethought.net) (216.241.36.13) by daedalus.apache.org with SMTP; 19 Apr 2004 23:36:44 -0000 Received: from j72.denver.dsl.forethought.net ([216.241.38.72]) by mz2.forethought.net with esmtp (Exim 4.30) id 1BFiJn-0007yS-Ne for lucene-user@jakarta.apache.org; Mon, 19 Apr 2004 17:36:51 -0600 From: Tatu Saloranta Reply-To: tatu@hypermall.net Organization: Linux-users missalie To: "Lucene Users List" Subject: Re: Bridge with OpenOffice Date: Mon, 19 Apr 2004 17:38:18 -0600 User-Agent: KMail/1.5 References: <4084301D.8080006@ops.co.at> In-Reply-To: <4084301D.8080006@ops.co.at> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200404191738.18396.tatu@hypermall.net> X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N On Monday 19 April 2004 14:01, Mario Ivankovits wrote: > Stephane James Vaucher wrote: > > Anyone try what Joerg suggested here? > > http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.a > >pache.org&msgNo=6231 > > Dont know what you would like to do, but if you simply would like to > extract text, you could simply try this sniplet: This leads to question I was thinking; it seems that originally this thread started by someone pointing that OO can be used as converter from other formats... but how about tokenizer for native OO documents? I have written full-featured converters from OO to (simplified) DocBook and HTML, and creating one for just tokenizing to be used by Lucene would be much easier. Even if it would tokenize into separate fields (document metadata, content, maybe bibliography separately etc), it'd be easy to do. Would anyone find full-featured, customizable OpenOffice document tokenizer useful? -+ Tatu +- --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org