incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "TikaProposal" by JukkaZitting
Date Tue, 06 Mar 2007 10:03:35 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The following page has been changed by JukkaZitting:

The comment on the change is:
Updated the background and rationale sections

  The toolkit is targeted especially for search engines and other content indexing and analysis
tools, but will be useful also for other applications that need to extract meaningful information
from documents that might be presented as nothing else than binary streams.
- Instead of implementing it's own document parsers, Tika will use existing parser libraries
like Jakarta POI and PDFBox.
+ Instead of implementing it's own document parsers, Tika will use existing parser libraries
like [ Jakarta POI] and [ PDFBox].
  === Background ===
- The need for tools that automatically analyze and index content is increasing as ever
+ The initial idea for the Tika project was voiced in April 2006 by Jérôme Charron and Chris
A. Mattman on the Nutch mailing list. The Nutch parser framework and other content analysis
features were seen as value-added components that would benefit also other projects. The idea
received positive feedback, but lacked the momentum.
- ''TODO: Discuss the various related projects and the lack of a common analysis toolkit.
Note how many of the existing tools have grown as ad-hoc solutions to specific needs, and
are often tightly bound to a specific application or a parser library.''
+ The idea was revisited in August 2006 when Jukka Zitting from the Jackrabbit project contacted
Nutch for possible cooperation with similar ideas. The original Tika idea gained extra momentum
and a Google Code project was set up as a staging area for prototype code before deciding
how to best handle the setup of a new project. After a few initial commits the activity again
- Related discussions and events:
+ In January 2007 the idea started gaining more momentum when Rida Benjelloun offered to contribute
the [ Lius project] to Apache Lucene and when Mark Harwood
also started looking for a generic toolkit like Tika.
+ This proposal is the result of the above efforts and related discussions both in private
and on various public forums.
-  * April 2006 [ nutch-dev: New
Lucene sub-project]
-  * July 2006 [ nutch-dev: Text
extraction library]
-  * August 2006 [ Tika project at Google Code]
-  * August 2006 [ nutch-dev: Tika
-  * January 2007 [ java-dev:
Lius into apache incubator]
-  * ''TODO: What else?''
  === Rationale ===
- ''TODO''
+ There is ever more demand for tools that automatically analyze and index documents in various
formats. Search engines, content repositories, and other tools often need to extract metadata
and text content from documents given as nothing or little else than a simple octet stream.
While there are a number of existing parser libraries for various document types, each of
them comes with a custom API and there are no generic tools for automatically determining
which parser to use for which documents. Currently many projects end up creating their custom
content analysis and extraction tools.
+ The Tika project attempts to remove this duplication of efforts. We believe that by pooling
the efforts of multiple projects we will be able to create a generic toolkit that exceeds
the capabilities and quality of the custom solutions of any single project. A generic toolkit
project will also provide common ground for the developers of parser libraries and content
applications to interact.
  === Initial Goals ===

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message