incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "OODTProposal" by chrismattmann
Date Thu, 31 Dec 2009 22:16:55 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "OODTProposal" page has been changed by chrismattmann.
The comment on this change is: first full draft, complete..
http://wiki.apache.org/incubator/OODTProposal?action=diff&rev1=7&rev2=8

--------------------------------------------------

  = OODT, a framework for science data processing, information integration, and retrieval.
=
  === Abstract ===
- OODT is a grid middleware framework used on a number of successful projects at [[http://www.jpl.nasa.gov|NASA's
Jet Propulsion Laboratory]], and many other research institutions and universities, specifically
those part of the:
+ OODT is a grid middleware framework used on a number of successful projects at [[http://jpl.nasa.gov|NASA's
Jet Propulsion Laboratory/California Institute of Technology]], and many other research institutions
and universities, specifically those part of the:
  
   * [[http://cancer.gov/edrn|National Cancer Institute's (NCI's) Early Detection Research
Network (EDRN)]] project - over 40+ institutions all performing research into discovering
biomarkers which are early indicators of disease.
   * [[http://pds.nasa.gov|NASA's Planetary Data System (PDS)]] - NASA's planetary data archive,
a repository and registry for all planetary data collected over the past 30+ years.
@@ -30, +30 @@

  Each set of components exist as independently organized Maven2 projects, that reference
each other (where appropriate), forming a layered set of components and a framework for grid
computing.
  
  === Background ===
- OODT is an established project within NASA JPL and in use at several NASA centers, as well
as univerities, and other government organizations and industrial collaborations. Chris Mattmann,
a JPL employee, and ASF PMC (Lucene) and Committer (Nutch, Tika), has been working for the
past 2 years on obtaining the necessary permission from JPL to release OODT into Apache. After
initially being stalled, JPL has granted permission to allow OODT into Apache.
+ OODT is an established project within JPL and in use at several NASA centers, as well as
univerities, and other government organizations and industrial collaborations. Chris Mattmann,
a JPL employee, and ASF PMC (Lucene) and Committer (Nutch, Tika), has been working for the
past 2 years on obtaining the necessary permission from JPL to release OODT into Apache. After
initially being stalled, JPL has granted permission to allow OODT into Apache.
  
  Through his academic relationship with Justin Erekrantz, Apache President, and through their
collective Ph.D. studies, OODT has been discussed between Chris and Justin on several occasions,
and Justin offered to help champion OODT into the Apache Incubator when JPL was ready to release
OODT. In December 2009, that permission was granted.
  
- This proposal is the result of the above efforts and related discussions. Some alternatives
to incubation, like [[http://labs.apache.org/|Apache Labs]] came up during the discussions
but we believe that taking the project to the Incubator is the best way to start growing a
viable Apache-based community to sustain OODT. Furthermore, given its larger code base and
existing sub-projects, the goal would be for OODT to leverage the incubator to graduate into
Apache's first top-level grid project, rather than graduate into a sub-project.
+ This proposal is the result of the above efforts and related discussions. Some alternatives
to incubation, like [[http://labs.apache.org/|Apache Labs]] came up during the discussions
but we believe that taking the project to the Incubator is the best way to start growing a
viable Apache-based community to sustain OODT. Furthermore, given its larger code base and
existing sub-projects, the goal would be for OODT to leverage the incubator to graduate into
Apache's first top-level grid project, rather than graduate into a sub-project of an existing
TLP.
  
  === Rationale ===
- There is ever more demand for tools that automatically analyze and index documents in various
formats. Search engines, content repositories, and other tools often need to extract metadata
and text content from documents given as nothing or little else than a simple octet stream.
While there are a number of existing parser libraries for various document types, each of
them comes with a custom API and there are no generic tools for automatically determining
which parser to use for which documents. Currently many projects end up creating their custom
content analysis and extraction tools.
+ Grid computing has been around for the past 10 years and has gained widespread notoriety
 and attention in industry and academia. Scientific collaborations are increasingly virtual
and require the capabilities (data and compute) of thousands of computers and resources that
span organizations. There are a number of existing grid technologies ([[http://globus.org|Globus]]
being the most popular, [[http://www.dspace.org/|DSpace]], [[http://irods.org|iRODS/Storage
Resource Broker]], [[http://wwwp.dnsalias.org/w/images/3/3f/AnatomyPhysiologyGridRevisited66.pdf|see
this paper]] for a full study), however Apache has '''no current grid technology''' under
its umbrella and world reknown think tank. Morever, efforts are few and far between in terms
of standing up Apache-based software that is applicable to the scientific community and grid
community outside of use of fine-grained components in these systems. Other open source organizations
(e.g., the [[http://go-essp.gfdl.noaa.gov/|Global Organization for Earth System Science Portals,
GO-ESSP]]) have embraced the construction of such technology and there is a lot of work going
on, e.g., at NOAA. This proposal aims to remedy this fact and to bring scientific data management/grid
software into the Apache family and its worldwide community.
  
- The Tika project attempts to remove this duplication of efforts. We believe that by pooling
the efforts of multiple projects we will be able to create a generic toolkit that exceeds
the capabilities and quality of the custom solutions of any single project. A generic toolkit
project will also provide common ground for the developers of parser libraries and content
applications to interact.
+ OODT is a widely successful grid project with applicability and existing deployments across
broad-reaching domains (planetary and earth sciences, cancer research/biomedicine, climate
modeling and atmospheric science, etc.). The marriage of OODT and Apache will engender OODT's
widespread, global use via the Apache brand, and will make Apache a player in the grid/scientific
data community.
  
  === Initial Goals ===
  The initial goals of the proposed project are:
  
-  * Viable community around the Tika codebase
+  * Stand up a sustaining Apache-based community around the OODT codebase.
-  * Active relationships and possible cooperation with related projects and communities
+  * Active relationships and possible cooperation with related projects and communities.
-  * Generic parser API for extracting structured text content from various document formats
-  * Flexible metadata detection and extraction API
-  * Java implementations of the metadata standards mentioned below
+  * Refactor and bring up-to-date the OODT profile and product server components.
+  * Explore various underlying communication substrates. OODT currently uses REST (via its
[[http://oodt.jpl.nasa.gov/web-grid/|Web-Grid]] component).
+  * Create configuration-based OODT deployments. Currently the deployments are primarily
code-based, or the configuration is strewn about the various sub-components. The goal would
be to bring this configuration under a single umbrella project. The idea would be to create
science data pipelines from configuration.
+  * Explore Python-based client and server implementations of OODT and implementations in
other languages (Ruby).
  
  == Current Status ==
  === Meritocracy ===
- All the initial committers are familiar with the meritocracy principles of Apache, and have
already worked on the various source codebases. We will follow the normal meritocracy rules
also with other potential contributors.
+ Many of the proposed initial committers are familiar with the meritocracy principles of
Apache, and have already worked on the various source codebases (contributing via patches,
emails, JIRA issues, and in Mattmann's case, as a Nutch, and Tika committer, and Lucene PMC
member). We will follow the normal meritocracy rules also with other potential contributors.
  
  === Community ===
- There is not yet a clear Tika community. Instead we have a number of people and related
projects with an understanding that a shared toolkit project would best serve everyone's interests.
The primary goal of the incubating project is to build a self-sustaining community around
this shared vision.
+ There is an existing, established community of developers and users of OODT within over
40 centers at NASA, NIH, DOE and academia, however there is no Apache OODT community as of
yet. Our principal goal of this effort is to leverage the Apache Incubator to grow an Apache
community base (in addition to OODT's existing community), and to build a self-sustaining
community around this shared vision, and eventual Apache TLP status for OODT. With many sub
projects (CAS, Product/Profile servers, Query Server, Web-grid, commons, etc.), OODT should
attract a broad audience of developers with various interests.
  
  === Core Developers ===
- The initial set of developers comes from various backgrounds, with different but compatible
needs for the proposed project.
+ The initial set of developers comes from NASA JPL, and with various backgrounds, with different
but compatible needs for the proposed project. JPL is home to data management and grid projects
spanning the domains of cancer research/bioinformatics, earth science, planetary science,
astrophysics, and climate modeling.
  
  === Alignment ===
- As Apache's first grid-based framework will likely be widely used by various open source
and commercial projects both together with and independent of other Apache tools like Lucene
Java or Jakarta POI. Other Apache projects like Nutch and Jackrabbit are potential candidates
for using Tika as an embedded component.
+ As Apache's first grid-based framework will likely be widely used by various open source,
scientific and commercial projects both together with and independent of other Apache tools.
 With OODT's existing community we will also bring developers and organizations outside of
Apache into the Apache ecosystem.
  
  == Known Risks ==
  === Orphaned products ===
- OODT has supported its self through successful deployments at NASA, at the U.S. National
Institutes of Health (NIH), and recently at DOE-based laboratories and at academic centers.
Further, OODT has been an active participant in IEEE/ACM-based conferences and meetings/journal
publications over the past 9 years. There is active support on several existing NASA earth
science missions, and the team at JPL is experienced and will continue to champion and develop
OODT in the Apache area.
+ OODT has supported itself through successful deployments at NASA, at the U.S. National Institutes
of Health (NIH), and recently at DOE-based laboratories and at academic centers. Further,
OODT has been an active participant in IEEE/ACM-based conferences and meetings/journal publications
over the past 9 years. There is active support on several existing NASA earth science missions,
and the team at JPL is experienced and will continue to champion and develop OODT in the Apache
area.
  
- Our goal is to take OODT from the early stage of Incubation into a thriving Apache top-level
project, and leverage it in the existing manner at NASA, the NIH, at DOE, and in academia
and industry. Since OODT is a grid framework, it depends on many external services and projects,
no one of which controls OODT's code-base.
+ Our goal is to take OODT from the early stage of Apache Incubation into a thriving Apache
top-level project, and leverage it in the existing manner at NASA, the NIH, at DOE, and in
academia and industry. Since OODT is a grid framework, it depends on many external services
and projects, no one of which controls OODT's code-base.
  
- We feel that the time is ripe to bring OODT into Apache and to grow the community of developers
who maintain OODT. We feel that Incubation will bring a slew of industry-based developers
(and even those in academia, and government) who have no prior experience with OODT, but who
could use OODT at their jobs. We want to attract such developers to become part of the core
OODT development team, and project management aspect.
+ We feel that the time is ripe to bring OODT into Apache and to grow the community of developers
who maintain OODT. We feel that Incubation will bring a slew of industry-based developers
(and even those in academia, and government) who have no prior experience with OODT, but who
could use OODT at their jobs and who are attracted to the brand name and community that Apache
brings. We want to attract such developers to become part of the core OODT development team,
and project management aspect.
  
  === Inexperience with Open Source ===
  All the initial developers have worked on open source before and at least one (Mattmann)
is a committer and PMC members in the Apache Lucene ecosystem. Sean Kelly is a well-respected
Plone committer and has made several open source contributions over the years to FreeBSD and
other software. Foster, McCleese and Woollard have all contributed to Apache projects by way
of email, mailing lists, issue reporting and testing.

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message