Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 6800 invoked from network); 29 May 2003 08:05:54 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 29 May 2003 08:05:54 -0000 Received: (qmail 5464 invoked by uid 97); 29 May 2003 08:08:17 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 5457 invoked from network); 29 May 2003 08:08:17 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 29 May 2003 08:08:17 -0000 Received: (qmail 5989 invoked by uid 500); 29 May 2003 08:05:45 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 5871 invoked from network); 29 May 2003 08:05:44 -0000 Received: from cmailm3.svr.pol.co.uk (195.92.193.19) by daedalus.apache.org with SMTP; 29 May 2003 08:05:44 -0000 Received: from modem-3768.llama.dialup.pol.co.uk ([217.135.190.184] helo=joseph) by cmailm3.svr.pol.co.uk with smtp (Exim 4.14) id 19LIQ7-0000DV-Ef for lucene-user@jakarta.apache.org; Thu, 29 May 2003 09:05:55 +0100 Message-ID: <050101c325b9$17989940$0200a8c0@joseph> From: "Pete Lewis" To: "Lucene Users List" References: <000c01c32511$260f3b30$6001a8c0@labate> <200305290901.09873@bah> Subject: Re: RE : Parsers Date: Thu, 29 May 2003 09:05:38 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hi Victor Thanks. In the past I have used the Inso OutsideIn filters and found them very good; however I'd like to come up with a pure Java solution, so if there is a Java equivalent to the Inso filters I be grateful for any details. Failing that, I thought that I'd go for individual parsers initially using the file extensions to select the correct parser but in the future adding a file type recogniser for files without extensions. Hence my request for anyone knowing of good parsers particularly for the most common formats. That being said, has anyone come across a Powerpoint parser? Pete ----- Original Message ----- From: "Victor Hadianto" To: "Lucene Users List" Sent: Thursday, May 29, 2003 12:01 AM Subject: Re: RE : Parsers > > The www.textmining.org text extractors work very well for Word and pdf > > documents. > > They use both PDFBox and POI. > > > > For Excel, using POI directly is very easy. Tell me if you want to see > > code samples. > > > > I'm looking myself for a Powerpoint text extractor, if you know one... > > Another solution is to use Microsoft Office itself. You can setup a server > that serve request to convert Microsoft Office doc. There are many ways of > doing this, for example using Python to directly call Office then put your > python script in a webserver. > > Or you can set a .Net conversion server and you can call this .Net service > using a Web Service, and many other interesting technique. > > victor > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org