incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "OODTProposal" by chrismattmann
Date Sun, 20 Dec 2009 19:19:08 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "OODTProposal" page has been changed by chrismattmann.
The comment on this change is: update required resources.
http://wiki.apache.org/incubator/OODTProposal?action=diff&rev1=3&rev2=4

--------------------------------------------------

  = OODT, a framework for science data processing, information integration, and retrieval.
=
- 
  === Abstract ===
- 
  OODT is a middleware framework used on a number of successful projects at [[http://www.jpl.nasa.gov|NASA's
Jet Propulsion Laboratory]], and many other research institutions and universities, specifically
those part of the:
  
   * [[http://cancer.gov/edrn|National Cancer Institute's (NCI's) Early Detection Research
Network (EDRN)]] project - over 40+ institutions all performing research into discovering
biomarkers which are early indicators of disease.
@@ -13, +11 @@

  From the [[http://oodt.jpl.nasa.gov|OODT]] website:
  
  It's middleware for metadata:
+ 
-     * Transparent access to distributed resources
+  * Transparent access to distributed resources
-     * Data discovery and query optimization
+  * Data discovery and query optimization
-     * Distributed processing and virtual archives
+  * Distributed processing and virtual archives
  
  It's a software architecture:
-     * Models for information representation
-     * Solutions to knowledge capture problems
-     * Unification of technology, data, and metadata
  
+  * Models for information representation
+  * Solutions to knowledge capture problems
+  * Unification of technology, data, and metadata
  
  === Proposal ===
- 
  OODT is an established open source project, with 9+ years of existence, and deployment at
universities, federal research institutions, other NASA centers, and the NIH (it won runner-up
NASA software of the year in 2003). It has a strong community of those that operate and support
its growth. Our proposal is to bring OODT into Apache to strengthen its support and its capabilities
 even further on the laurels of Apache's brand and its growing huge community of developers
from all over the world. In short, bringing OODT into Apache will significantly enhance OODT's
widespread use, will likely improve its codebase, and furthermore will help Apache philosophy
and community spread into OODT's already large community-base reaching across government,
academia and industry.
  
- OODT will be, to the best of our knowledge, the ''first'' grid community project to bear
the Apache brand. By ''grid'' technology, we mean a technology that provides the ability to
create ''virtual organizations'', as originally described by Kesselman and Foster in their
[[http://www.globus.org/alliance/publications/papers/anatomy.pdf|seminal paper on grid computing]].
OODT provides both computational ''and'' data grid support, and is built with a component-philosophy.
OODT includes components that allow for virtual information integration across organizations
(provided by the ''Profile'', ''Product'' and ''Query'' server components), and that allow
for distributed data management and processing across heterogeneous virtual organizations
(provided by the Catalog and Archive Service set of components, including ''File Manager'',
''Workflow Manager'' and ''Resource Manager''). 
+ OODT will be, to the best of our knowledge, the ''first'' grid community project to bear
the Apache brand. By ''grid'' technology, we mean a technology that provides the ability to
create ''virtual organizations'', as originally described by Kesselman and Foster in their
[[http://www.globus.org/alliance/publications/papers/anatomy.pdf|seminal paper on grid computing]].
OODT provides both computational ''and'' data grid support, and is built with a component-philosophy.
OODT includes components that allow for virtual information integration across organizations
(provided by the ''Profile'', ''Product'' and ''Query'' server components), and that allow
for distributed data management and processing across heterogeneous virtual organizations
(provided by the Catalog and Archive Service set of components, including ''File Manager'',
''Workflow Manager'' and ''Resource Manager'').
  
  Each set of components exist as independently organized Maven2 projects, that reference
each other (where appropriate), forming a layered set of components and a framework for grid
computing.
  
  === Background ===
- 
  The initial idea for the Tika project was voiced in April 2006 by Jérôme Charron and Chris
A. Mattman on the Nutch mailing list. The Nutch parser framework and other content analysis
features were seen as value-added components that would benefit also other projects. The idea
received positive feedback, but lacked the momentum.
  
  The idea was revisited in August 2006 when Jukka Zitting from the Jackrabbit project contacted
Nutch for possible cooperation with similar ideas. The original Tika idea gained extra momentum
and a Google Code project was set up as a staging area for prototype code before deciding
how to best handle the setup of a new project. After a few initial commits the activity again
declined.
@@ -42, +39 @@

  This proposal is the result of the above efforts and related discussions both in private
and on various public forums. Some alternatives to incubation, like [[http://labs.apache.org/|Apache
Labs]] or [[http://jakarta.apache.org/commons/|Jakarta Commons]], came up during the discussions
but we believe that taking the project to the Incubator is the best way to start growing a
viable community to sustain the Tika toolkit.
  
  === Rationale ===
- 
  There is ever more demand for tools that automatically analyze and index documents in various
formats. Search engines, content repositories, and other tools often need to extract metadata
and text content from documents given as nothing or little else than a simple octet stream.
While there are a number of existing parser libraries for various document types, each of
them comes with a custom API and there are no generic tools for automatically determining
which parser to use for which documents. Currently many projects end up creating their custom
content analysis and extraction tools.
  
  The Tika project attempts to remove this duplication of efforts. We believe that by pooling
the efforts of multiple projects we will be able to create a generic toolkit that exceeds
the capabilities and quality of the custom solutions of any single project. A generic toolkit
project will also provide common ground for the developers of parser libraries and content
applications to interact.
  
  === Initial Goals ===
- 
  The initial goals of the proposed project are:
  
   * Viable community around the Tika codebase
@@ -58, +53 @@

   * Java implementations of the metadata standards mentioned below
  
  == Current Status ==
- 
  === Meritocracy ===
- 
  All the initial committers are familiar with the meritocracy principles of Apache, and have
already worked on the various source codebases. We will follow the normal meritocracy rules
also with other potential contributors.
  
  === Community ===
- 
  There is not yet a clear Tika community. Instead we have a number of people and related
projects with an understanding that a shared toolkit project would best serve everyone's interests.
The primary goal of the incubating project is to build a self-sustaining community around
this shared vision.
  
  === Core Developers ===
- 
  The initial set of developers comes from various backgrounds, with different but compatible
needs for the proposed project.
  
  === Alignment ===
- 
  As a generic toolkit the Tika will likely be widely used by various open source and commercial
projects both together with and independent of other Apache tools like Lucene Java or Jakarta
POI. Other Apache projects like Nutch and Jackrabbit are potential candidates for using Tika
as an embedded component.
  
  == Known Risks ==
- 
  === Orphaned products ===
- 
  There are a number of projects at various stages of maturity that implement a subset of
the proposed features in Tika. For many potential users the existing tools are already enough,
which reduces the demand for a more generic toolkit. This can also be seen in the slow progress
of this proposal over the past year.
  
  However, once the project gets started we can quickly reach the feature level of existing
tools based on seed code from sources mentioned below. After that we believe to be able to
quickly grow the developer and user communities based on the benefits of a generic toolkit
over custom alternatives.
  
  === Inexperience with Open Source ===
- 
  All the initial developers have worked on open source before and many are committers and
PMC members within other Apache projects.
  
  === Homogenous Developers ===
- 
  The initial developers come from a variety of backgrounds and with a variety of needs for
the proposed toolkit.
  
  === Reliance on Salaried Developers ===
- 
  Some of the developers are paid to work on this or related projects, but the proposed project
is not the primary task for anyone.
  
  === Relationships with Other Apache Products ===
- 
  Tika is related to at least the following Apache projects. None of the projects is a direct
competitor for Tika, but there are many cases of potential overlap in functionality.
  
   * [[http://lucene.apache.org/java/|Apache Lucene]] - The analysis part of Lucene contains
code that might overlap with some of the potential Tika functionality. There might also be
some overlap regarding the Document model in Lucene.
   * [[http://lucene.apache.org/nutch/|Lucene Nutch]] - The Nutch project already contains
a parser framework that does many of the things that Tika is designed to do.
   * [[http://jackrabbit.apache.org/|Apache Jackrabbit]] - The Jackrabbit project contains
a text extraction component that also implements a subset of the proposed Tika features.
-  *  [[http://incubator.apache.org/uima/|Apache UIMA]] - The UIMA project provides a framework
and pluggable tools for analyzing text content and extracting information. Example tools include
language identification, sentence boundary detection and "entity extraction" - finding references
to people, places and organisations. Tika could be used by UIMA to parse text but Tika should
be careful not to duplicate the subsequent text analysis features UIMA offers.
+  * [[http://incubator.apache.org/uima/|Apache UIMA]] - The UIMA project provides a framework
and pluggable tools for analyzing text content and extracting information. Example tools include
language identification, sentence boundary detection and "entity extraction" - finding references
to people, places and organisations. Tika could be used by UIMA to parse text but Tika should
be careful not to duplicate the subsequent text analysis features UIMA offers.
  
  === A Excessive Fascination with the Apache Brand ===
- 
  All of us are familiar with Apache and we have participated in Apache projects as contributors,
committers, and PMC members. We feel that the Apache Software Foundation is a natural home
for a project like this.
  
  == Documentation ==
- 
  There are bits and pieces of design discussions and other documentation around, see for
example the following:
  
   * August 2006 [[http://thread.gmane.org/gmane.comp.search.nutch.devel/9685|nutch-dev: Parser
design]]
@@ -122, +104 @@

  See also the potential parser libraries listed below for details on the various document
formats that Tika plans to support.
  
  == Initial Source ==
- 
  Tika will start with a combination of seed code from the efforts listed below:
  
   * The [[http://lucene.apache.org/nutch|Apache Nutch]] project that contains a parser framework
and various content analysis tools
@@ -132, +113 @@

  No existing codebase is selected as "the" starting point of Tika to avoid inheriting the
world view and design limitations of any single project.
  
  == Source and Intellectual Property Submission Plan ==
- 
  All seed code and other contributions will be handled through the normal Apache contribution
process.
  
  We will also contact other related efforts for possible cooperation and contributions.
  
  == External Dependencies ==
+ Tika will depend on a number of external parser libraries with various licensing conditions.
An initial list of potential dependencies is shown below.
+ ||'''Library''' ||'''License''' ||
+ ||[[http://jakarta.apache.org/poi/|Jakarta POI]] ||ASLv2 ||
+ ||[[http://www.pdfbox.org/|PDFBox]] ||BSD ||
+ ||[[http://people.apache.org/~andyc/neko/doc/html/index.html|NekoHTML]] ||!CyberNeko (like
ASL) ||
+ ||[[http://jtidy.sourceforge.net/|JTidy]] ||W3C ||
  
- Tika will depend on a number of external parser libraries with various licensing conditions.
An initial list of potential dependencies is shown below.
  
+ 
- || '''Library'''                                                       || '''License'''
        ||
- || [[http://jakarta.apache.org/poi/|Jakarta POI]]                        || ASLv2      
          ||
- || [[http://www.pdfbox.org/|PDFBox]]                                     || BSD        
          ||
- || [[http://people.apache.org/~andyc/neko/doc/html/index.html|NekoHTML]] || !CyberNeko (like
ASL) ||
- || [[http://jtidy.sourceforge.net/|JTidy]]                               || W3C        
          ||
  
  There are also some LGPL parser libraries that would be useful. Whether and how such dependencies
could be handled will be discussed during incubation. No such dependencies will be added to
the project before the legal implications have been cleared.
  
  == Cryptography ==
- 
  Tika itself will not use cryptography, but it is possible that some of the external parser
libraries will include cryptographic code to handle features like DRM in various document
formats.
  
  == Required Resources ==
- 
  Mailing lists
  
-  * tika-dev@incubator.apache.org
+  * oodt-dev@incubator.apache.org
-  * tika-commits@incubator.apache.org
+  * oodt-commits@incubator.apache.org
-  * tika-private@incubator.apache.org
+  * oodt-private@incubator.apache.org
  
  Subversion Directory
  
-  * https://svn.apache.org/repos/asf/incubator/tika
+  * https://svn.apache.org/repos/asf/incubator/oodt
  
  Issue Tracking
  
-  * JIRA Tika (TIKA)
+  * JIRA OODT (OODT)
  
  Other Resources
  
-  * none
+  * OODT Wiki [[http://wiki.apache.org/oodt|http://wiki.apache.org/oodt/]]
  
  == Initial Committers ==
+ ||'''Name''' ||'''Email''' ||'''CLA''' ||
+ ||Rida Benjelloun ||rida dot benjelloun at doculibre dot com ||yes ||
+ ||Mark Harwood ||mharwood at apache dot org ||yes ||
+ ||Chris A. Mattmann ||mattmann at apache dot org ||yes ||
+ ||Sami Siren ||siren at apache dot org ||yes ||
+ ||Jukka Zitting ||jukka at apache dot org ||yes ||
  
- || '''Name'''        || '''Email'''                              || '''CLA'''        ||
- || Rida Benjelloun   || rida dot benjelloun at doculibre dot com || yes              ||
- || Mark Harwood      || mharwood at apache dot org               || yes              ||
- || Chris A. Mattmann || mattmann at apache dot org               || yes              ||
- || Sami Siren        || siren at apache dot org                  || yes              ||
- || Jukka Zitting     || jukka at apache dot org                  || yes              ||
  
  == Affiliations ==
+ ||'''Name''' ||'''Affiliation''' ||
+ ||Rida Benjelloun ||[[http://www.doculibre.com/index_en.html|Doculibre inc.]] ||
+ ||Chris A. Mattmann ||[[http://www.jpl.nasa.gov/|NASA Jet Propulsion Laboratory]] ||
+ ||Jukka Zitting ||[[http://www.day.com/|Day Management AG]] ||
  
- || '''Name'''        || '''Affiliation'''                                         ||
- || Rida Benjelloun   || [[http://www.doculibre.com/index_en.html|Doculibre inc.]]   ||
- || Chris A. Mattmann || [[http://www.jpl.nasa.gov/|NASA Jet Propulsion Laboratory]] ||
- || Jukka Zitting     || [[http://www.day.com/|Day Management AG]]                   ||
  
  == Sponsors ==
- 
  Champion
  
   * Jukka Zitting (jukka at apache dot org)

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message