Hi,
On 7/10/07, Schuh, Stefan <Stefan.Schuh@coi.de> wrote:
> I am looking for a text extractor (tool set) which could be used, to get
> text data out of several file formats like office documents and so on.
> The text data (extract) could then be used to index with lucene. Best
> would be a java api, but not required. Does any one have knowledge
> of such a tool set or project?
The Tika project [1] in the Apache Incubator is currently getting
started at implementing such a generic toolkit. Unfortunately we
haven't yet released anything.
You may also want to check out the Lius project [2] that is one of the
source codebases to be used in Tika. Another potential match is the
Aperture project [3].
[1] http://incubator.apache.org/tika/
[2] http://sourceforge.net/projects/lius/
[3] http://aperture.sourceforge.net/
BR,
Jukka Zitting
|