From java-user-return-27139-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Sun Mar 25 23:00:52 2007 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 33096 invoked from network); 25 Mar 2007 23:00:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Mar 2007 23:00:51 -0000 Received: (qmail 26827 invoked by uid 500); 25 Mar 2007 23:00:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 25767 invoked by uid 500); 25 Mar 2007 23:00:49 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 25756 invoked by uid 99); 25 Mar 2007 23:00:49 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Mar 2007 16:00:49 -0700 X-ASF-Spam-Status: No, hits=2.1 required=10.0 tests=RCVD_IN_WHOIS_INVALID,SPF_HELO_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [212.226.92.15] (HELO monkey.teamware.com) (212.226.92.15) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Mar 2007 16:00:40 -0700 Received: from nimitz (nimitz.teamw.com [10.142.128.10]) by monkey.teamware.com (8.13.1/8.13.1) with ESMTP id l2PN0BaU023357 for ; Mon, 26 Mar 2007 02:00:11 +0300 Received: from [10.142.3.11] ([10.142.3.11]) by nimitz with ESMTP id m3q1nl7i; 26 Mar 2007 01:59:00 +0200 Message-ID: <4606FEEE.2070704@teamware.com> Date: Mon, 26 Mar 2007 08:59:58 +1000 From: Antony Bowesman Organization: Teamware Group User-Agent: Thunderbird 1.5.0.10 (Windows/20070221) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: index word files ( doc ) References: <3F5099632A78C7488A80D6535C4F4E8026631D@EX01.service.utwente.nl> <3F5099632A78C7488A80D6535C4F4E8026631E@EX01.service.utwente.nl> <4604CCFD.9030803@teamware.com> <4604D1B9.6010509@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (monkey.teamware.com [212.226.92.15]); Mon, 26 Mar 2007 02:00:11 +0300 (EEST) X-TWG-MailScanner-Information: See www.mailscanner.info for information X-TWG-MailScanner: Found to be clean X-TWG-MailScanner-SpamCheck: not spam, SpamAssassin (score=0, required 5, autolearn=not spam) X-MailScanner-From: adb@teamware.com X-Virus-Checked: Checked by ClamAV on apache.org I've been using Ryan's textmining in prefence to the POI as internally TM uses POI and the Word6 extractor so handles a greater variety of files. Ryan, thanks for fixing your site. Do you have any plans/ideas on how to parse the 'fast-saved' files and any ideas on Word files older than the Word 6 format? Regards Antony Ryan Ackley wrote: > As the author of both Word POI and textmining.org, I recommend using > textmining.org. POI is for general purpose manipulation of Word > documents. textmining's only purpose is extracting text. > > Also, people recommend using POI for text extraction but the only > place I've seen an actual how-to on this is in the "Lucene in Action" > book. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org