lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Levy" <michaelrl...@gmail.com>
Subject Re: Text Extractor
Date Mon, 23 Jul 2007 12:57:25 GMT
It might be worthwhile for you to review Nutch, a web search application
based on Lucene that can also search local filesystems.  It includes parsers
for several common office type documents.

http://lucene.apache.org/nutch/



On 7/10/07, Jukka Zitting <jukka.zitting@gmail.com> wrote:
>
> Hi,
>
> On 7/10/07, Schuh, Stefan <Stefan.Schuh@coi.de> wrote:
> > I am looking for a text extractor (tool set) which could be used, to get
> > text data out of several file formats like office documents and so on.
> > The text data (extract) could then be used to index with lucene.  Best
> > would be a java api, but not required. Does any one have knowledge
> > of such a tool set or project?
>
> The Tika project [1] in the Apache Incubator is currently getting
> started at implementing such a generic toolkit. Unfortunately we
> haven't yet released anything.
>
> You may also want to check out the Lius project [2] that is one of the
> source codebases to be used in Tika. Another potential match is the
> Aperture project [3].
>
> [1] http://incubator.apache.org/tika/
> [2] http://sourceforge.net/projects/lius/
> [3] http://aperture.sourceforge.net/
>
> BR,
>
> Jukka Zitting
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message