incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Trivial Update of "TikaProposal" by JukkaZitting
Date Sat, 03 Mar 2007 11:06:46 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The following page has been changed by JukkaZitting:
http://wiki.apache.org/incubator/TikaProposal

The comment on the change is:
Changed heading levels to make the page more readable

------------------------------------------------------------------------------
  
  See also the [http://thread.gmane.org/gmane.comp.search.nutch.devel/9684/focus=9693 earlier
proposal draft] by Chris Mattmann and Jerome Charron.
  
- == Abstract ==
+ === Abstract ===
  
  Tika is a toolkit for detecting and extracting metadata and text content from various documents
using existing parser libraries.
  
- == Proposal ==
+ === Proposal ===
  
  The Tika content analysis toolkit will include features for detecting the content types,
character encodings, languages, and other characteristics of existing documents and for extracting
structured text content from the documents.
  
@@ -18, +18 @@

  
  Instead of implementing it's own document parsers, Tika will use existing parser libraries
like Jakarta POI and PDFBox.
  
- == Background ==
+ === Background ===
  
  The need for tools that automatically analyze and index content is increasing as ever
  
@@ -33, +33 @@

   * January 2007 [http://thread.gmane.org/gmane.comp.jakarta.lucene.devel/16888 java-dev:
Lius into apache incubator]
   * ''TODO: What else?''
  
- == Rationale ==
+ === Rationale ===
  
  ''TODO''
  
- == Initial Goals ==
+ === Initial Goals ===
  
  ''TODO''
  
- = Current Status =
+ == Current Status ==
  
- == Meritocracy ==
+ === Meritocracy ===
  
  All the initial committers are familiar with the meritocracy principles of Apache, and have
already worked on the various source codebases. We will follow the normal meritocracy rules
also with other potential contributors.
  
- == Community ==
+ === Community ===
  
  There is not yet a clear Tika community. Instead we have a number of people and related
projects with an understanding that a shared toolkit project would best serve everyone's interests.
The primary goal of the incubating project is to build a a self-sustaining community around
this shared vision.
  
- == Core Developers ==
+ === Core Developers ===
  
  The initial set of developers comes from various backgrounds, with different but compatible
needs for the proposed project.
  
- == Alignment ==
+ === Alignment ===
  
  As a generic toolkit the Tika will likely be widely used by various open source and commercial
projects both together with and independent of other Apache tools like Lucene Java or Jakarta
POI. Other Apache projects like Nutch and Jackrabbit are potential candidates for using Tika
as an embedded component.
  
- = Known Risks =
+ == Known Risks ==
  
- == Orphaned products ==
+ === Orphaned products ===
  
  There are a number of projects at various stages of maturity that implement a subset of
the proposed features in Tika. For many potential users the existing tools are already enough,
which reduces the demand for a more generic toolkit. This can also be seen in the slow progress
of this proposal over the past year.
  
  However, once the project gets started we can quickly reach the feature level of existing
tools based on seed code from sources mentioned below. After that we believe to be able to
quickly grow the developer and user communities based on the benefits of a generic toolkit
over custom alternatives.
  
- == Inexperience with Open Source ==
+ === Inexperience with Open Source ===
  
  All the initial developers have worked on open source before and many are committers and
PMC members within other Apache projects.
  
- == Homogenous Developers ==
+ === Homogenous Developers ===
  
  The initial developers come from a variety of backgrounds and with a variety of needs for
the proposed toolkit.
  
- == Reliance on Salaried Developers ==
+ === Reliance on Salaried Developers ===
  
  Some of the developers are paid to work on this or related projects, but the proposed project
is not the primary task for anyone.
  
- == Relationships with Other Apache Products ==
+ === Relationships with Other Apache Products ===
  
  Tika is related to at least the following Apache projects. None of the projects is a direct
competitor for Tika, but there are many cases of potential overlap in functionality.
  
@@ -88, +88 @@

   * [http://jackrabbit.apache.org/ Apache Jackrabbit] - The Jackrabbit project contains a
text extraction component that also implements a subset of the proposed Tika features.
   * ''TODO: Other projects? Solr? The Droids lab?''
  
- == A Excessive Fascination with the Apache Brand ==
+ === A Excessive Fascination with the Apache Brand ===
  
  All of us are familiar with Apache and we have participated in Apache projects as contributors,
committers, and PMC members. We feel that the Apache Software Foundation is a natural home
for a project like this.
  
- = Documentation =
+ == Documentation ==
  
  There are bits and pieces of design discussions and other documentation around, see for
example the following:
  
@@ -105, +105 @@

  
  See also the potential parser libraries listed below for details on the various document
formats that Tika plans to support.
  
- = Initial Source =
+ == Initial Source ==
  
  Tika will start with a combination of seed code from the efforts listed below:
  
@@ -115, +115 @@

  
  No existing codebase is selected as "the" starting point of Tika to avoid inheriting the
world view and design limitations of any single project.
  
- = Source and Intellectual Property Submission Plan =
+ == Source and Intellectual Property Submission Plan ==
  
  All seed code and other contributions will be handled through the normal Apache contribution
process.
  
  We will also contact other related efforts for possible cooperation and contributions.
  
- = External Dependencies =
+ == External Dependencies ==
  
  Tika will depend on a number of external parser libraries with various licensing conditions.
An initial list of potential dependencies is shown below.
  
@@ -137, +137 @@

  
  Mechanisms for best handling LGPL and other legally challenging licenses in potential dependencies
will be discussed and decided during incubation. No such dependencies will be added to the
project before the legal implications have been cleared.
  
- = Cryptography =
+ == Cryptography ==
  
  Tika itself will not use cryptography, but it is possible that some of the external parser
libraries will include cryptographic code to handle features like DRM in various document
formats.
  
- = Required Resources =
+ == Required Resources ==
  
  Mailing lists
  
@@ -161, +161 @@

  
   * none
  
- = Initial Committers =
+ == Initial Committers ==
  
  || '''Name'''        || '''Email'''                              || '''CLA''' ||
  || Rida Benjelloun   || rida dot benjelloun at doculibre dot com || no        ||
@@ -169, +169 @@

  || Chris A. Mattmann || mattmann at apache dot org               || yes       ||
  || Jukka Zitting     || jukka at apache dot org                  || yes       ||
  
- = Affiliations =
+ == Affiliations ==
  
  || '''Name'''        || '''Affiliation'''                       ||
  || Jukka Zitting     || [http://www.day.com/ Day Management AG] ||
  
- = Sponsors =
+ == Sponsors ==
  
  Champion
  
@@ -191, +191 @@

  
  ----
  
- = Discussion =
+ == Discussion ==
  
   * Use this area for discussing the contents of the proposal. - Jukka Zitting
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message