lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Text Extractor
Date Tue, 10 Jul 2007 14:15:54 GMT
Hi,

On 7/10/07, Schuh, Stefan <Stefan.Schuh@coi.de> wrote:
> I am looking for a text extractor (tool set) which could be used, to get
> text data out of several file formats like office documents and so on.
> The text data (extract) could then be used to index with lucene.  Best
> would be a java api, but not required. Does any one have knowledge
> of such a tool set or project?

The Tika project [1] in the Apache Incubator is currently getting
started at implementing such a generic toolkit. Unfortunately we
haven't yet released anything.

You may also want to check out the Lius project [2] that is one of the
source codebases to be used in Tika. Another potential match is the
Aperture project [3].

[1] http://incubator.apache.org/tika/
[2] http://sourceforge.net/projects/lius/
[3] http://aperture.sourceforge.net/

BR,

Jukka Zitting

Mime
View raw message