Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 92438 invoked from network); 20 Apr 2004 04:29:20 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 20 Apr 2004 04:29:20 -0000 Received: (qmail 40777 invoked by uid 500); 20 Apr 2004 04:28:54 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 40747 invoked by uid 500); 20 Apr 2004 04:28:54 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 40731 invoked from network); 20 Apr 2004 04:28:53 -0000 Received: from unknown (HELO mail.iinet.net.au) (203.59.3.41) by daedalus.apache.org with SMTP; 20 Apr 2004 04:28:53 -0000 Received: (qmail 10695 invoked from network); 20 Apr 2004 04:29:03 -0000 Received: from unknown (HELO peterbecker.de) (203.173.21.170) by mail.iinet.net.au with SMTP; 20 Apr 2004 04:29:02 -0000 Message-ID: <4084A622.8020102@peterbecker.de> Date: Tue, 20 Apr 2004 14:25:06 +1000 From: Peter Becker User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.6) Gecko/20040113 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: Bridge with OpenOffice References: <4084301D.8080006@ops.co.at> <200404191738.18396.tatu@hypermall.net> In-Reply-To: <200404191738.18396.tatu@hypermall.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N We did a simple one a while ago. Could probably be a bit more sophisticated, but it seems to do it job on the little bit of testing we did. See http://cvs.sourceforge.net/viewcvs.py/toscanaj/docco/source/org/tockit/docco/documenthandler/OpenOfficeDocumentHandler.java?rev=1.4&view=auto HTH, Peter PS: sorry for the broken whitespace -- I just noticed that myself. Tatu Saloranta wrote: >On Monday 19 April 2004 14:01, Mario Ivankovits wrote: > > >>Stephane James Vaucher wrote: >> >> >>>Anyone try what Joerg suggested here? >>>http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.a >>>pache.org&msgNo=6231 >>> >>> >>Dont know what you would like to do, but if you simply would like to >>extract text, you could simply try this sniplet: >> >> > >This leads to question I was thinking; it seems that originally this thread >started by someone pointing that OO can be used as converter from other >formats... but how about tokenizer for native OO documents? I have written >full-featured converters from OO to (simplified) DocBook and HTML, and >creating one for just tokenizing to be used by Lucene would be much easier. >Even if it would tokenize into separate fields (document metadata, content, >maybe bibliography separately etc), it'd be easy to do. > >Would anyone find full-featured, customizable OpenOffice document tokenizer >useful? > >-+ Tatu +- > > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org