lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <>
Subject Re: Text Extractor
Date Tue, 10 Jul 2007 14:15:54 GMT

On 7/10/07, Schuh, Stefan <> wrote:
> I am looking for a text extractor (tool set) which could be used, to get
> text data out of several file formats like office documents and so on.
> The text data (extract) could then be used to index with lucene.  Best
> would be a java api, but not required. Does any one have knowledge
> of such a tool set or project?

The Tika project [1] in the Apache Incubator is currently getting
started at implementing such a generic toolkit. Unfortunately we
haven't yet released anything.

You may also want to check out the Lius project [2] that is one of the
source codebases to be used in Tika. Another potential match is the
Aperture project [3].



Jukka Zitting

View raw message