Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 80759 invoked from network); 28 May 2003 23:12:33 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 28 May 2003 23:12:33 -0000 Received: (qmail 27995 invoked by uid 97); 28 May 2003 23:14:51 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 27988 invoked from network); 28 May 2003 23:14:50 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 28 May 2003 23:14:50 -0000 Received: (qmail 80506 invoked by uid 500); 28 May 2003 23:12:30 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 80491 invoked from network); 28 May 2003 23:12:29 -0000 Received: from 251.017.dsl.syd.iprimus.net.au (HELO file1.syd.nuix.com.au) (210.50.55.251) by daedalus.apache.org with SMTP; 28 May 2003 23:12:29 -0000 Received: from host86.syd.nuix.com.au (host86.syd.nuix.com.au [192.168.222.86]) by file1.syd.nuix.com.au (Postfix) with ESMTP id 177B2B734F for ; Thu, 29 May 2003 09:12:23 +1000 (EST) Content-Type: text/plain; charset="iso-8859-1" From: Victor Hadianto Organization: NUIX Pty. Ltd. To: "Lucene Users List" Subject: Re: RE : Parsers Date: Thu, 29 May 2003 09:01:09 +1000 User-Agent: KMail/1.4.3 References: <000c01c32511$260f3b30$6001a8c0@labate> In-Reply-To: <000c01c32511$260f3b30$6001a8c0@labate> Massage-Id: <13921192.1322@nuix.com.au> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-Id: <200305290901.09873@bah> X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N > The www.textmining.org text extractors work very well for Word and pdf > documents. > They use both PDFBox and POI. > > For Excel, using POI directly is very easy. Tell me if you want to see > code samples. > > I'm looking myself for a Powerpoint text extractor, if you know one... Another solution is to use Microsoft Office itself. You can setup a server that serve request to convert Microsoft Office doc. There are many ways of doing this, for example using Python to directly call Office then put your python script in a webserver. Or you can set a .Net conversion server and you can call this .Net service using a Web Service, and many other interesting technique. victor --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org