incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "TikaProposal" by JukkaZitting
Date Tue, 06 Mar 2007 10:56:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The following page has been changed by JukkaZitting:
http://wiki.apache.org/incubator/TikaProposal

The comment on the change is:
Added initial goals and made other minor changes

------------------------------------------------------------------------------
  
  In January 2007 the idea started gaining more momentum when Rida Benjelloun offered to contribute
the [http://sourceforge.net/projects/lius/ Lius project] to Apache Lucene and when Mark Harwood
also started looking for a generic toolkit like Tika.
  
- This proposal is the result of the above efforts and related discussions both in private
and on various public forums.
+ This proposal is the result of the above efforts and related discussions both in private
and on various public forums. Some alternatives to incubation, like [http://labs.apache.org/
Apache Labs] or [http://jakarta.apache.org/commons/ Jakarta Commons], came up during the discussions
but we believe that taking the project to the Incubator is the best way to start growing a
viable community to sustain the Tika toolkit.
  
  === Rationale ===
  
@@ -36, +36 @@

  
  === Initial Goals ===
  
- ''TODO''
+ The initial goals of the proposed project are:
+ 
+  * Viable community around the Tika codebase
+  * Active relationships and possible cooperation with related projects and communities
+  * Generic parser API for extracting (structured) text content from various document formats
+  * Flexible metadata detection and extraction API
+  * Java implementations of the metadata standards mentioned below
  
  == Current Status ==
  
@@ -84, +90 @@

   * [http://lucene.apache.org/nutch/ Lucene Nutch] - The Nutch project already contains a
parser framework that does many of the things that Tika is designed to do.
   * [http://jackrabbit.apache.org/ Apache Jackrabbit] - The Jackrabbit project contains a
text extraction component that also implements a subset of the proposed Tika features.
   *  [http://incubator.apache.org/uima/ Apache UIMA] - The UIMA project provides a framework
and pluggable tools for analyzing text content and extracting information. Example tools include
language identification, sentence boundary detection and "entity extraction" - finding  references
to people, places and organisations. TIKA could be used by UIMA to parse text but TIKA should
be careful not to duplicate the subsequent text analysis features UIMA offers.
-  * ''TODO: Other projects? Solr? The Droids lab?''
  
  === A Excessive Fascination with the Apache Brand ===
  
@@ -107, +112 @@

  
  Tika will start with a combination of seed code from the efforts listed below:
  
-  * The [http://lucene.apache.org/nutch Apache Nutch] project, that contains another parser
framework and various content analysis tools
+  * The [http://lucene.apache.org/nutch Apache Nutch] project, that contains a parser framework
and various content analysis tools
   * The [http://sourceforge.net/projects/lius/ Lius project], an indexing framework for Apache
Lucene
-  * The [http://code.google.com/p/tika/ Tika project at Google Code], where some initial
draft code has been developed for this proposed project
+  * The [http://jackrabbit.apache.org/ Apache Jackrabbit] project, that contains a text extraction
component
  
  No existing codebase is selected as "the" starting point of Tika to avoid inheriting the
world view and design limitations of any single project.
  
@@ -157, +162 @@

  
  == Initial Committers ==
  
- || '''Name'''        || '''Email'''                              || '''CLA''' ||
+ || '''Name'''        || '''Email'''                              || '''CLA'''        ||
- || Rida Benjelloun   || rida dot benjelloun at doculibre dot com || no        ||
+ || Rida Benjelloun   || rida dot benjelloun at doculibre dot com || no (in progress) ||
- || Mark Harwood (?)  || mharwood at apache dot org               || yes       ||
+ || Mark Harwood      || mharwood at apache dot org               || yes              ||
- || Chris A. Mattmann || mattmann at apache dot org               || yes       ||
+ || Chris A. Mattmann || mattmann at apache dot org               || yes              ||
- || Sami Siren        || siren at apache dot org                  || yes       ||
+ || Sami Siren        || siren at apache dot org                  || yes              ||
- || Jukka Zitting     || jukka at apache dot org                  || yes       ||
+ || Jukka Zitting     || jukka at apache dot org                  || yes              ||
  
  == Affiliations ==
  
@@ -182, +187 @@

  
  Sponsoring Entity
  
-  * ''TODO: Apache Lucene (?, vote needed)''
+  * ''TODO: Apache Lucene (vote needed)''
  
  ----
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message