incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "TikaProposal" by JukkaZitting
Date Sat, 03 Mar 2007 07:52:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The following page has been changed by JukkaZitting:
http://wiki.apache.org/incubator/TikaProposal

The comment on the change is:
Draft update

------------------------------------------------------------------------------
  
  == Background ==
  
- The need for tools that automatically analyze and index content is increasing as ever more
information becomes available.
+ The need for tools that automatically analyze and index content is increasing as ever
  
  ''TODO: Discuss the various related projects and the lack of a common analysis toolkit.
Note how many of the existing tools have grown as ad-hoc solutions to specific needs, and
are often tightly bound to a specific application or a parser library.''
  
@@ -29, +29 @@

   * July 2007 [http://thread.gmane.org/gmane.comp.search.nutch.devel/9373 nutch-dev: Text
extraction library]
   * August 2006 [http://code.google.com/p/tika/ Tika project at Google Code]
   * August 2006 [http://thread.gmane.org/gmane.comp.search.nutch.devel/9684 nutch-dev: Tika
update]
-  * August 2006 [http://thread.gmane.org/gmane.comp.search.nutch.devel/9685 nutch-dev: Parser
design]
-  * September 2006 [http://thread.gmane.org/gmane.comp.search.nutch.devel/9969 nutch-dev:
Content type detection]
   * January 2007 [http://thread.gmane.org/gmane.comp.jakarta.lucene.devel/16888 java-dev:
Lius into apache incubator]
-  * February 2007 [http://code.google.com/p/tika/wiki/DesignDiscussion Tika wiki: Design
discussion]
   * ''TODO: What else?''
  
  == Rationale ==
@@ -94, +91 @@

  
  = Documentation =
  
- ''TODO''
+ There are bits and pieces of design discussions and other documentation around, see for
example the following:
+ 
+  * August 2006 [http://thread.gmane.org/gmane.comp.search.nutch.devel/9685 nutch-dev: Parser
design]
+  * September 2006 [http://thread.gmane.org/gmane.comp.search.nutch.devel/9969 nutch-dev:
Content type detection]
+  * October 2006 [http://www.doculibre.com/lius/doc-1.0_en.html Lius tutorial]
+  * February 2007 [http://code.google.com/p/tika/wiki/DesignDiscussion Tika wiki: Design
discussion]
+ 
+ Standards and conventions related to Tika include the [http://dublincore.org/ Dublin Core]
metadata set, the [http://freedesktop.org/wiki/Standards_2fshared_2dmime_2dinfo_2dspec Shared
MIME information] draft specification from [http://freedesktop.org/ freedesktop.org], and
of course RFCs [http://www.ietf.org/rfc/rfc2046.txt 2046] and [http://www.ietf.org/rfc/rfc3066.txt
3066] for identifying media types and languages.
+ 
+ See also the potential parser libraries listed below for details on the various document
formats that Tika plans to support.
  
  = Initial Source =
  
-  * http://code.google.com/p/tika/
-  * http://sourceforge.net/projects/lius/
-  * http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/parse/ (?)
-  * ''TODO: What else?''
+ Tika will start with a combination of seed code from the efforts listed below:
+ 
+  * The [http://code.google.com/p/tika/ Tika project at Google Code], where some initial
draft code has been developed for this proposed project
+  * The [http://sourceforge.net/projects/lius/ Lius project], an indexing framework for Apache
Lucene
+  * The [http://lucene.apache.org/nutch Apache Nutch] project, that contains another parser
framework and various content analysis tools
+ 
+ No existing codebase is selected as "the" starting point of Tika to avoid inheriting the
world view and design limitations of any single project.
  
  = Source and Intellectual Property Submission Plan =
  
- ''TODO''
+ All seed code and other contributions will be handled through the normal Apache contribution
process.
+ 
+ We will also contact other related efforts for possible cooperation and contributions.
  
  = External Dependencies =
  
- ''TODO: Some of the potential parser libraries will be GPL-licensed or otherwise troublesome
for an ASF project. How to best handle such cases?''
+ Tika will depend on a number of external parser libraries with various licensing conditions.
An initial list of potential dependencies is shown below.
  
-  * [http://jakarta.apache.org/poi/ Jakarta POI] - ASLv2
-  * [http://www.pdfbox.org/ PDFBox] - BSD
-  * [http://jexcelapi.sourceforge.net/ JExcelApi] - LGPL
+ || '''Library'''                                                       || '''License'''
       ||
+ || [http://jakarta.apache.org/poi/ Jakarta POI]                        || ASLv2        
       ||
+ || [http://www.pdfbox.org/ PDFBox]                                     || BSD          
       ||
+ || [http://jexcelapi.sourceforge.net/ JExcelApi]                       ||              
       ||
-  * [http://www.artofsolving.com/opensource/jodconverter JODConverter] - LGPL
+ || [http://www.artofsolving.com/opensource/jodconverter JODConverter]  || LGPL         
       ||
-  * [http://people.apache.org/~andyc/neko/doc/html/index.html NekoHTML] - CyberNeko license
(like ASL)
+ || [http://people.apache.org/~andyc/neko/doc/html/index.html NekoHTML] || CyberNeko (like
ASL) ||
-  * [http://jtidy.sourceforge.net/ JTidy]
+ || [http://jtidy.sourceforge.net/ JTidy]                               ||              
       ||
-  * [http://javamusictag.sourceforge.net/ Java ID3 Tag Library] - LGPL
+ || [http://javamusictag.sourceforge.net/ Java ID3 Tag Library]         || LGPL         
       ||
-  * [http://jid3.blinkenlights.org/ JID3] - LGPL
-  * ''TODO: Many others...''
+ || [http://jid3.blinkenlights.org/ JID3]                               || LGPL         
       ||
+ 
+ Mechanisms for best handling LGPL and other legally challenging licenses in potential dependencies
will be discussed and decided during incubation. No such dependencies will be added to the
project before the legal implications have been cleared.
  
  = Cryptography =
  
- ''TODO: Some of the document formats are involve encryption and features like DRM. While
Tika itself will probably not include any cryptographic code, the parser dependencies will
most likely include such code.''
+ Tika itself will not use cryptography, but it is possible that some of the external parser
libraries will include cryptographic code to handle features like DRM in various document
formats.
  
  = Required Resources =
  
  Mailing lists
  
   * tika-dev@incubator.apache.org
+  * tika-commits@incubator.apache.org
+  * tika-private@incubator.apache.org
  
  Subversion Directory
  
@@ -137, +152 @@

  
  Issue Tracking
  
-  * JIRA TIKA
+  * JIRA Tika (TIKA)
  
  Other Resources
  
@@ -145, +160 @@

  
  = Initial Committers =
  
-  * Rida Benjelloun 
-  * Mark Harwood (?)
-  * Jukka Zitting
-  * Chris A. Mattmann
-  * ''TODO: Who's interested?''
+ || '''Name'''        || '''Email'''                              || '''CLA''' ||
+ || Rida Benjelloun   || rida dot benjelloun at doculibre dot com || no        ||
+ || Mark Harwood (?)  || mharwood at apache dot org               || yes       ||
+ || Chris A. Mattmann || mattmann at apache dot org               || yes       ||
+ || Jukka Zitting     || jukka at apache dot org                  || yes       ||
  
  = Affiliations =
  
- ''TODO''
+ || '''Name'''        || '''Affiliation'''                       ||
+ || Jukka Zitting     || [http://www.day.com/ Day Management AG] ||
  
  = Sponsors =
  
  Champion
  
- ''TODO: Volunteers: Jukka Zitting''
+  * Jukka Zitting (jukka at apache dot org)
  
  Nominated Mentors
  
- ''TODO: Three (or more) mentors is the recommendation. Volunteers: Jukka Zitting, Doug Cutting
(?)''
+  * Doug Cutting (cutting at apache dot org)
+  * Jukka Zitting (jukka at apache dot org)
  
  Sponsoring Entity
  
- ''TODO: Apache Lucene (?)''
+  * ''TODO: Apache Lucene (?, vote needed)''
  
  ----
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message