incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "TikaProposal" by JukkaZitting
Date Thu, 01 Mar 2007 19:57:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The following page has been changed by JukkaZitting:

The comment on the change is:
Added the first draft of the Tika proposal

New page:
= Tika, a content analysis toolkit =

''This is a draft of a potential Tika proposal. Feel free to edit as you see fit, see also
the discussion section at the end of the page. You can "Subscribe" this page to get notified
whenever changes are made. See the [ proposal
guide] for a description of the expected proposal content.''

== Abstract ==

Tika is a toolkit for detecting and extracting metadata and text content from various documents
using existing parser libraries.

== Proposal ==

The Tika content analysis toolkit will include features for detecting the content types, character
encodings, languages, and other
characteristics of existing documents and for extracting structured text content from the

The toolkit is targeted especially for search engines and other content indexing and analysis
tools, but will be useful also for other
applications that need to extract meaningful information from documents that might be presented
as nothing else than binary streams.

Instead of implementing it's own document parsers, Tika will use existing parser libraries
like Jakarta POI and PDFBox.

== Background ==

The need for tools that automatically analyze and index content is increasing as ever more
information becomes available.

''TODO: Discuss the various related projects and the lack of a common analysis toolkit. Note
how many of the existing tools have grown as
ad-hoc solutions to specific needs, and are often tightly bound to a specific application
or a parser library.''

== Rationale ==


== Initial Goals ==


= Current Status =

== Meritocracy ==


== Community ==


== Core Developers ==


== Alignment ==


= Known Risks =

== Orphaned products ==

''TODO: There has been on-and-off interest in something like this for quite a while already.
How can we make sure that the current increase
in interest doesn't fade away?''

== Inexperience with Open Source ==

''TODO: Many of the interested participants have open source background.''

== Homogenous Developers ==

''TODO: There is no central company behind the proposal.''

== Reliance on Salaried Developers ==

''TODO: Some of us are salaried for this, other's are not.''

== Relationships with Other Apache Products ==

''TODO: Lucene, Nutch, Jackrabbit, Droids, ...''

== A Excessive Fascination with the Apache Brand ==


= Documentation =


= Initial Source =

''TODO: Tika, Lius, Nutch?, ...''

= Source and Intellectual Property Submission Plan =


= External Dependencies =

''TODO: Some of the potential parser libraries will be GPL-licensed or otherwise troublesome
for an ASF project. How to best handle such

= Cryptography =

''TODO: Some of the document formats are involve encryption and features like DRM. While Tika
itself will probably not include any
cryptographic code, the parser dependencies will most likely include such code.''

= Required Resources =

Mailing lists


Subversion Directory


Issue Tracking


Other Resources

 * none

= Initial Committers =


= Affiliations =


= Sponsors =


''TODO: Volunteers: Jukka Zitting''

Nominated Mentors

''TODO: Three (or more) mentors is the recommendation. Volunteers: Jukka Zitting, Doug Cutting

Sponsoring Entity

''TODO: Apache Lucene (?)''


= Discussion =

 * Use this area for discussing the contents of the proposal. - Jukka Zitting

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message