From general-return-503-apmail-lucene-general-archive=lucene.apache.org@lucene.apache.org Mon Jul 23 12:57:50 2007 Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 30716 invoked from network); 23 Jul 2007 12:57:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 23 Jul 2007 12:57:50 -0000 Received: (qmail 94669 invoked by uid 500); 23 Jul 2007 12:57:50 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 94648 invoked by uid 500); 23 Jul 2007 12:57:49 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 94637 invoked by uid 99); 23 Jul 2007 12:57:49 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Jul 2007 05:57:49 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of michaelrlevy@gmail.com designates 66.249.90.182 as permitted sender) Received: from [66.249.90.182] (HELO ik-out-1112.google.com) (66.249.90.182) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Jul 2007 05:57:47 -0700 Received: by ik-out-1112.google.com with SMTP id b32so1121794ika for ; Mon, 23 Jul 2007 05:57:26 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=l/WFdvjg8HC9u2hpVuYIYFvWLidEqfAq+rCq+iX4SkZHQPqMlPnRVm4AhWa/vC6qLWQAjEjxzUNRZJ3IZsHwabAHwpJOl8dJPgisQRWSz3pUT/2WoGEHC+X41M7/vemdR+jjh1OaHVTd3zhOdpN2s6PJiKR1xCs/i1peW988bso= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=ZhQr6ECk/PQ78JiQdAeGbQjd4WnX1HaA8n59cO4SBKLwo5s/p+pI+Mq0I5bnKfryCGDCmF9U84oQ5NSmGL05HKeYd7md+iJ/iQSqS2L06TKTnktS+lnguDylbE1Cj0gWyF/pNwL2Wzs65qVmQlpcAS5Uu0g/E37tbU+Fs/qcAJ0= Received: by 10.78.201.15 with SMTP id y15mr758246huf.1185195445784; Mon, 23 Jul 2007 05:57:25 -0700 (PDT) Received: by 10.78.57.18 with HTTP; Mon, 23 Jul 2007 05:57:25 -0700 (PDT) Message-ID: <1678f8d80707230557s17db7109w9870c95d60552482@mail.gmail.com> Date: Mon, 23 Jul 2007 08:57:25 -0400 From: "Michael Levy" To: general@lucene.apache.org Subject: Re: Text Extractor In-Reply-To: <510143ac0707100715u3ff69a30x81b5930c09e91f6c@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_73790_21749783.1185195445721" References: <591E33C080D2E5449BB1791B70D66B28D964CE@coi11.coi.com> <510143ac0707100715u3ff69a30x81b5930c09e91f6c@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_73790_21749783.1185195445721 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline It might be worthwhile for you to review Nutch, a web search application based on Lucene that can also search local filesystems. It includes parsers for several common office type documents. http://lucene.apache.org/nutch/ On 7/10/07, Jukka Zitting wrote: > > Hi, > > On 7/10/07, Schuh, Stefan wrote: > > I am looking for a text extractor (tool set) which could be used, to get > > text data out of several file formats like office documents and so on. > > The text data (extract) could then be used to index with lucene. Best > > would be a java api, but not required. Does any one have knowledge > > of such a tool set or project? > > The Tika project [1] in the Apache Incubator is currently getting > started at implementing such a generic toolkit. Unfortunately we > haven't yet released anything. > > You may also want to check out the Lius project [2] that is one of the > source codebases to be used in Tika. Another potential match is the > Aperture project [3]. > > [1] http://incubator.apache.org/tika/ > [2] http://sourceforge.net/projects/lius/ > [3] http://aperture.sourceforge.net/ > > BR, > > Jukka Zitting > ------=_Part_73790_21749783.1185195445721--