incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "OODTProposal" by chrismattmann
Date Thu, 31 Dec 2009 17:48:30 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "OODTProposal" page has been changed by chrismattmann.
The comment on this change is: update to all sections.
http://wiki.apache.org/incubator/OODTProposal?action=diff&rev1=4&rev2=5

--------------------------------------------------

  Each set of components exist as independently organized Maven2 projects, that reference
each other (where appropriate), forming a layered set of components and a framework for grid
computing.
  
  === Background ===
- The initial idea for the Tika project was voiced in April 2006 by Jérôme Charron and Chris
A. Mattman on the Nutch mailing list. The Nutch parser framework and other content analysis
features were seen as value-added components that would benefit also other projects. The idea
received positive feedback, but lacked the momentum.
+ OODT is an established project within NASA JPL and in use at several NASA centers, as well
as univerities, and other government organizations and industrial collaborations. Chris Mattmann,
a JPL employee, and ASF PMC (Lucene) and Committer (Nutch, Tika), has been working for the
past 2 years on obtaining the necessary permission from JPL to release OODT into Apache. After
initially being stalled, JPL has granted permission to allow OODT into Apache.
  
- The idea was revisited in August 2006 when Jukka Zitting from the Jackrabbit project contacted
Nutch for possible cooperation with similar ideas. The original Tika idea gained extra momentum
and a Google Code project was set up as a staging area for prototype code before deciding
how to best handle the setup of a new project. After a few initial commits the activity again
declined.
+ Through his academic relationship with Justin Erekrantz, Apache President, and through their
collective Ph.D. studies, OODT has been discussed between Chris and Justin on several occasions,
and Justin offered to help champion OODT into the Apache Incubator when JPL was ready to release
OODT. In December 2009, that permission was granted.
  
+ This proposal is the result of the above efforts and related discussions. Some alternatives
to incubation, like [[http://labs.apache.org/|Apache Labs]] came up during the discussions
but we believe that taking the project to the Incubator is the best way to start growing a
viable Apache-based community to sustain OODT. Furthermore, given its larger code base and
existing sub-projects, the goal would be for OODT to leverage the incubator to graduate into
Apache's first top-level grid project, rather than graduate into a sub-project.
- In January 2007 the idea started gaining more momentum when Rida Benjelloun offered to contribute
the [[http://sourceforge.net/projects/lius/|Lius project]] to Apache Lucene and when Mark
Harwood also started looking for a generic toolkit like Tika.
- 
- This proposal is the result of the above efforts and related discussions both in private
and on various public forums. Some alternatives to incubation, like [[http://labs.apache.org/|Apache
Labs]] or [[http://jakarta.apache.org/commons/|Jakarta Commons]], came up during the discussions
but we believe that taking the project to the Incubator is the best way to start growing a
viable community to sustain the Tika toolkit.
  
  === Rationale ===
  There is ever more demand for tools that automatically analyze and index documents in various
formats. Search engines, content repositories, and other tools often need to extract metadata
and text content from documents given as nothing or little else than a simple octet stream.
While there are a number of existing parser libraries for various document types, each of
them comes with a custom API and there are no generic tools for automatically determining
which parser to use for which documents. Currently many projects end up creating their custom
content analysis and extraction tools.
@@ -92, +90 @@

  All of us are familiar with Apache and we have participated in Apache projects as contributors,
committers, and PMC members. We feel that the Apache Software Foundation is a natural home
for a project like this.
  
  == Documentation ==
- There are bits and pieces of design discussions and other documentation around, see for
example the following:
+ There is a wealth of documentation available on OODT. The best starting point is the existing
OODT JPL website (which will be ported to be sync'ed or just a pointer to the Apache website)[[http://oodt.jpl.nasa.gov]]
  
-  * August 2006 [[http://thread.gmane.org/gmane.comp.search.nutch.devel/9685|nutch-dev: Parser
design]]
-  * September 2006 [[http://thread.gmane.org/gmane.comp.search.nutch.devel/9969|nutch-dev:
Content type detection]]
-  * October 2006 [[http://www.doculibre.com/lius/doc-1.0_en.html|Lius tutorial]]
-  * February 2007 [[http://code.google.com/p/tika/wiki/DesignDiscussion|Tika wiki: Design
discussion]]
+  * [[http://oodt.jpl.nasa.gov|OODT website at JPL]]
+  * Mattmann's OODT paper that appeared at the 28th International Conference on Software
Engineering in Shanghai, China.
+  * Crichton's seminal OODT paper appearing at the CODATA conference.
+  * Google Scholar search on OODT
  
+ Standards and conventions related to OODT include the [[http://dublincore.org/|Dublin Core]]
metadata set, [[http://www.iso.org/iso/catalogue_detail.htm?csnumber=1758|ISO/IEC 11179]],
the [[http://www.w3.org/Protocols/rfc2616/rfc2616.html|HTTP 1.1 RFC]], Grid-based standards
including the [[http://www.globus.org/alliance/publications/papers/ogsa.pdf|Open Grid Services
Architecture (OGSA)]] adnasd ads
- Standards and conventions related to Tika include the [[http://dublincore.org/|Dublin Core]]
metadata set, the [[http://freedesktop.org/wiki/Standards_2fshared_2dmime_2dinfo_2dspec|Shared
MIME information]] draft specification from [[http://freedesktop.org/|freedesktop.org]], and
of course RFCs [[http://www.ietf.org/rfc/rfc2046.txt|2046]] and [[http://www.ietf.org/rfc/rfc3066.txt|3066]]
for identifying media types and languages.
- 
- See also the potential parser libraries listed below for details on the various document
formats that Tika plans to support.
  
  == Initial Source ==
+ OODT will start with seed code donated by NASA JPL via Mattmann and the rest of the initial
committers.
- Tika will start with a combination of seed code from the efforts listed below:
- 
-  * The [[http://lucene.apache.org/nutch|Apache Nutch]] project that contains a parser framework
and various content analysis tools
-  * The [[http://sourceforge.net/projects/lius/|Lius project]], an indexing framework for
Apache Lucene
-  * The [[http://jackrabbit.apache.org/|Apache Jackrabbit]] project that contains a text
extraction component
- 
- No existing codebase is selected as "the" starting point of Tika to avoid inheriting the
world view and design limitations of any single project.
  
  == Source and Intellectual Property Submission Plan ==
- All seed code and other contributions will be handled through the normal Apache contribution
process.
+ All seed code and other contributions will be handled through the normal Apache contribution
process. Mattmann has been authorized by NASA JPL to lead the contribution of OODT into the
Incubator via his existing Apache CLA.
  
  We will also contact other related efforts for possible cooperation and contributions.
  
  == External Dependencies ==
- Tika will depend on a number of external parser libraries with various licensing conditions.
An initial list of potential dependencies is shown below.
- ||'''Library''' ||'''License''' ||
- ||[[http://jakarta.apache.org/poi/|Jakarta POI]] ||ASLv2 ||
- ||[[http://www.pdfbox.org/|PDFBox]] ||BSD ||
- ||[[http://people.apache.org/~andyc/neko/doc/html/index.html|NekoHTML]] ||!CyberNeko (like
ASL) ||
- ||[[http://jtidy.sourceforge.net/|JTidy]] ||W3C ||
+ OODT depends on will depend on a number of external connector libraries with various licensing
conditions. An initial list of such dependencies (taken from one of the OODT sub-components,
the CAS file manager) is shown below.
+ 
+ ||<tableclass="bodyTable"rowclass="b">'''Library'''||'''License'''||
+ ||<rowclass="b">commons-codec||ASL v2||
+ ||<rowclass="a">commons-dbcp||ASL v2||
+ ||<rowclass="b">commons-httpclient||ASL v2||
+ ||<rowclass="a">commons-io||ASL v2||
+ ||<rowclass="b">commons-pool||ASL v2||
+ ||<rowclass="a">cas-metadata||(to be ASL v2)||
+ ||<rowclass="b">edm-commons||(to be ASL v2)||
+ ||<rowclass="a">hsqldb||LGPL v2.1||
+ ||<rowclass="b">jug-asl||ASL v2||
+ ||<rowclass="a">lucene-core||ASL v2||
+ ||<rowclass="b">xmlrpc||ASL v2||
  
  
+ There are also some LGPL parser libraries that would be useful. Whether and how such dependencies
could be handled will be discussed during incubation. No such dependencies will be added to
the project before the legal implications have been cleared.Existing LGPL dependencies, such
as hsqldb above for the CAS file manager, will be removed and a suitable ASL friendly alternative
will be investigated and used to replace the LGPL dependencies.
- 
- 
- There are also some LGPL parser libraries that would be useful. Whether and how such dependencies
could be handled will be discussed during incubation. No such dependencies will be added to
the project before the legal implications have been cleared.
  
  == Cryptography ==
- Tika itself will not use cryptography, but it is possible that some of the external parser
libraries will include cryptographic code to handle features like DRM in various document
formats.
+ OODT itself will not use cryptography, but it is possible that some of the external product
or profile server or CAS libraries will include cryptographic code to handle features like
DRM in various science data formats. The current OODT code base relies on [[http://lucene.apache.org/tika/|Apache
Tika]] which contains an export control statement regarding cryptographic code per Apache
policy. We will follow a similar approach with OODT. Mattmann lead this effort in [[http://lucene.apache.org/nutch/|Apache
Nutch]] and saw Jukka Zitting lead this effort in Apache Tika, so he is familiar with this
process.
  
  == Required Resources ==
  Mailing lists
@@ -150, +146 @@

  
  Other Resources
  
-  * OODT Wiki [[http://wiki.apache.org/oodt|http://wiki.apache.org/oodt/]]
+  * OODT Wiki http://cwiki.apache.org/OODT
  
  == Initial Committers ==
- ||'''Name''' ||'''Email''' ||'''CLA''' ||
+ ||'''Name''' ||'''Email''' || ||'''Affiliation'''||'''CLA''' ||
- ||Rida Benjelloun ||rida dot benjelloun at doculibre dot com ||yes ||
- ||Mark Harwood ||mharwood at apache dot org ||yes ||
- ||Chris A. Mattmann ||mattmann at apache dot org ||yes ||
- ||Sami Siren ||siren at apache dot org ||yes ||
- ||Jukka Zitting ||jukka at apache dot org ||yes ||
+ ||Chris A. Mattmann ||mattmann at apache dot org || ||[[http://www.jpl.nasa.gov/|NASA Jet
Propulsion Laboratory]]||yes ||
+ ||Daniel J. Crichton ||crichton at jpl dot nasa dot gov || ||[[http://www.jpl.nasa.gov/|NASA
Jet Propulsion Laboratory]]||no||
+ ||Paul Ramirez ||pramirez at jpl dot nasa dot gov || ||[[http://www.jpl.nasa.gov/|NASA Jet
Propulsion Laboratory]]||no ||
+ ||Sean Kelly ||kelly at jpl dot nasa dot gov || ||[[http://www.jpl.nasa.gov/|NASA Jet Propulsion
Laboratory]]||no ||
+ ||Sean Hardman ||shardman at jpl dot nasa dot gov || ||[[http://www.jpl.nasa.gov/|NASA Jet
Propulsion Laboratory]]||no ||
+ ||Andrew F. Hart||ahart at jpl dot nasa dot gov|| ||[[http://www.jpl.nasa.gov/|NASA Jet
Propulsion Laboratory]]||no||
+ ||Joshua Garcia||joshua at jpl dot nasa dot gov|| ||[[http://www.jpl.nasa.gov/|NASA Jet
Propulsion Laboratory]] ||no||
+ ||David Woollard||woollard at jpl dot nasa dot gov|| ||[[http://www.jpl.nasa.gov/|NASA Jet
Propulsion Laboratory]]||no||
+ ||Brian Foster||bfoster at jpl dot nasa dot gov|| ||[[http://www.jpl.nasa.gov/|NASA Jet
Propulsion Laboratory]]||no||
+ ||Sean McCleese||smcclees at jpl dot nasa dot gov|| ||[[http://www.jpl.nasa.gov/|NASA Jet
Propulsion Laboratory]]||no||
  
  
- == Affiliations ==
- ||'''Name''' ||'''Affiliation''' ||
- ||Rida Benjelloun ||[[http://www.doculibre.com/index_en.html|Doculibre inc.]] ||
- ||Chris A. Mattmann ||[[http://www.jpl.nasa.gov/|NASA Jet Propulsion Laboratory]] ||
- ||Jukka Zitting ||[[http://www.day.com/|Day Management AG]] ||
  
  
  == Sponsors ==
  Champion
  
-  * Jukka Zitting (jukka at apache dot org)
+  * Justin Erenkrantz (jerenkrantz at apache dot org)
  
  Nominated Mentors
  
+  * Justin Erenkrantz (jerenkrantz at apache dot org)
-  * Doug Cutting (cutting at apache dot org)
-  * Bertrand Delacretaz (bdelacretaz at apache dot org)
-  * Jukka Zitting (jukka at apache dot org)
  
  Sponsoring Entity
  
-  * Apache Lucene
+  * Apache Incubator
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message