Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 92012 invoked from network); 6 Mar 2003 12:01:02 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 6 Mar 2003 12:01:02 -0000 Received: (qmail 16876 invoked by uid 97); 6 Mar 2003 12:02:40 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 16869 invoked from network); 6 Mar 2003 12:02:39 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 6 Mar 2003 12:02:39 -0000 Received: (qmail 91135 invoked by uid 500); 6 Mar 2003 12:00:51 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 91056 invoked from network); 6 Mar 2003 12:00:50 -0000 Received: from ms-smtp-02.tampabay.rr.com (65.32.1.39) by daedalus.apache.org with SMTP; 6 Mar 2003 12:00:50 -0000 Received: from ackley (249.38.35.65.cfl.rr.com [65.35.38.249]) by ms-smtp-02.tampabay.rr.com (8.12.5/8.12.5) with SMTP id h26C0nUM026249 for ; Thu, 6 Mar 2003 07:00:49 -0500 (EST) Message-ID: <000b01c2e3d9$c979b440$f9262341@cfl.rr.com> Reply-To: "Ryan Ackley" From: "Ryan Ackley" To: "Lucene Users List" References: <1046863571.3e65ded335692@mail.lanrx.com> <3E668747.2000100@micromuse.com> Subject: Re: my experiences - Re: Parsing Word Docs Date: Thu, 6 Mar 2003 07:13:24 -0500 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N David, The textmining.org stuff only works on Word97 and above. It should work with no exceptions on any Word 97 doc. If you have any problems then it is from an earlier version (most likely Word 6.0) or its not a word document. If this isn't the case you need to email me so I can fix it and make it better for the benefit of everyone. I plan on adding support for Word 6 in the future. Ryan Ackley ----- Original Message ----- From: "David Spencer" To: "Lucene Users List" Sent: Wednesday, March 05, 2003 6:24 PM Subject: my experiences - Re: Parsing Word Docs > FYI I tried the textmining.org/poi combo and on a collection of 350 word > docs people have developed here over the years, and it failed on 33% of them > with exceptions being thrown about the formats being invalid. > > I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free > *.exe, and > it worked great ( well it seemed to process all the files fine). > > I've had similar experiences with PDF - I tried the 3 or so > freeware/java PDF > text extractors and they were not as good as the exe, pdftotext, > from foolabs (http://www.foolabs.com/xpdf/). > > Not satisfying to a java developer but these work better than anything > else I can find. > > You get source and I use them on windows & linux, no prob. > > > > Eric Anderson wrote: > > >I'm interested in using the textmining/textextraction utilities using Apache > >POI, that Ryan was discussing. However, I'm having some difficulty determining > >what the insertion point would be to replace the default parser with the word > >parser. > > > >Any assistance would be appreciated. > > > > > > > > > > > >LanRx Network Solutions, Inc. > >Providing Enterprise Level Solutions...On A Small Business Budget > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > >For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org