Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 39306 invoked from network); 25 Mar 2007 23:36:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Mar 2007 23:36:45 -0000 Received: (qmail 71590 invoked by uid 500); 25 Mar 2007 23:36:45 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 71557 invoked by uid 500); 25 Mar 2007 23:36:44 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 71546 invoked by uid 99); 25 Mar 2007 23:36:44 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Mar 2007 16:36:44 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of ryanackley@gmail.com designates 66.249.92.168 as permitted sender) Received: from [66.249.92.168] (HELO ug-out-1314.google.com) (66.249.92.168) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Mar 2007 16:36:36 -0700 Received: by ug-out-1314.google.com with SMTP id k40so1619193ugc for ; Sun, 25 Mar 2007 16:36:15 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=ZIGHSAhz94AJnkYd6l4XE7Gc8eg/w7kXmPTiNBOQ9k27lmsCx0tABKKdL/ZYW5V2q+GmRDjjqOsiHXLAtQrZHR2qogkJa6k74M0XebvLNzL8HGhGW2g4wTg72d1kJhKIifrgnkeZhq7pA8mQEUuJua/bIB56HRRXD6X6CvPlpIM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=kbDaBxrppDyfYj7N7wtMgtT+TgsyWs0Y6V2c82CNr4oCtrbBoHrSk9IBYU+OQ8Zng2zmqF10DgeJtVnGs7WbtDT/4KakLLC4JMT/HMbYvvkMcAYXAopHL60xyYu+bvDDEQKAJc9cQ6m/JV/1ETXy4GO60dherch8tQb/GF+fQXY= Received: by 10.115.18.1 with SMTP id v1mr2383707wai.1174865774482; Sun, 25 Mar 2007 16:36:14 -0700 (PDT) Received: by 10.115.93.15 with HTTP; Sun, 25 Mar 2007 16:36:14 -0700 (PDT) Message-ID: Date: Sun, 25 Mar 2007 16:36:14 -0700 From: "Ryan Ackley" To: java-user@lucene.apache.org Subject: Re: index word files ( doc ) In-Reply-To: <4606FEEE.2070704@teamware.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <3F5099632A78C7488A80D6535C4F4E8026631D@EX01.service.utwente.nl> <3F5099632A78C7488A80D6535C4F4E8026631E@EX01.service.utwente.nl> <4604CCFD.9030803@teamware.com> <4604D1B9.6010509@gmail.com> <4606FEEE.2070704@teamware.com> X-Virus-Checked: Checked by ClamAV on apache.org Yes I do have plans for adding fast save support and support for more file formats. The time frame for this happening is the next couple of months. I'm playing with the idea of offering a commercial version. I want to continue to support the open source community so I want to keep it open source or free and add value that people would be willing to pay for. Any comments on this are appreciated. One thing I thought of would be to continue to offer the text extraction as open source but add html conversion with hit highlighting for a variety of file formats as a commercial add on. Is this something anyone would pay for? What are some other pain points of the Lucene community besides text extraction? On 3/25/07, Antony Bowesman wrote: > I've been using Ryan's textmining in prefence to the POI as internally TM uses > POI and the Word6 extractor so handles a greater variety of files. > > Ryan, thanks for fixing your site. Do you have any plans/ideas on how to parse > the 'fast-saved' files and any ideas on Word files older than the Word 6 format? > > Regards > Antony > > > Ryan Ackley wrote: > > As the author of both Word POI and textmining.org, I recommend using > > textmining.org. POI is for general purpose manipulation of Word > > documents. textmining's only purpose is extracting text. > > > > Also, people recommend using POI for text extraction but the only > > place I've seen an actual how-to on this is in the "Lucene in Action" > > book. > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org