incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "OODTProposal" by chrismattmann
Date Sun, 20 Dec 2009 19:14:13 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "OODTProposal" page has been changed by chrismattmann.
The comment on this change is: Update to the proposal section.


  === Proposal ===
- The Tika content analysis toolkit will include features for detecting the content types,
character encodings, languages, and other characteristics of existing documents and for extracting
structured text content from the documents.
+ OODT is an established open source project, with 9+ years of existence, and deployment at
universities, federal research institutions, other NASA centers, and the NIH (it won runner-up
NASA software of the year in 2003). It has a strong community of those that operate and support
its growth. Our proposal is to bring OODT into Apache to strengthen its support and its capabilities
 even further on the laurels of Apache's brand and its growing huge community of developers
from all over the world. In short, bringing OODT into Apache will significantly enhance OODT's
widespread use, will likely improve its codebase, and furthermore will help Apache philosophy
and community spread into OODT's already large community-base reaching across government,
academia and industry.
- The toolkit is targeted especially for search engines and other content indexing and analysis
tools, but will be useful also for other applications that need to extract meaningful information
from documents that might be presented as nothing else than binary streams.
+ OODT will be, to the best of our knowledge, the ''first'' grid community project to bear
the Apache brand. By ''grid'' technology, we mean a technology that provides the ability to
create ''virtual organizations'', as originally described by Kesselman and Foster in their
[[|seminal paper on grid computing]].
OODT provides both computational ''and'' data grid support, and is built with a component-philosophy.
OODT includes components that allow for virtual information integration across organizations
(provided by the ''Profile'', ''Product'' and ''Query'' server components), and that allow
for distributed data management and processing across heterogeneous virtual organizations
(provided by the Catalog and Archive Service set of components, including ''File Manager'',
''Workflow Manager'' and ''Resource Manager''). 
- Instead of implementing its own document parsers, Tika will use existing parser libraries
like [[|Jakarta POI]] and [[|PDFBox]].
+ Each set of components exist as independently organized Maven2 projects, that reference
each other (where appropriate), forming a layered set of components and a framework for grid
  === Background ===

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message