oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Approaching OODT as a new user
Date Wed, 12 Jan 2011 15:42:21 GMT
Hi Scott,

Thanks for your detailed and informative email, giving us the user perspective! 

My comments inline below:

On Jan 11, 2011, at 4:43 PM, Scott Konzem wrote:

> First of all, I'd like to congratulate OODT on becoming a top level project and NASA
for making this project available. Thank you!

No problemo! We're very happy to be working on OODT in open source, with the rest of the community!

> 
> From all the nasa.gov email addresses around here, I get the impression that in the early
days of this project, most of the developers and users have been in direct contact or even
within the same organization, so I'd like to share my experience as a complete outsider. 
I am familiar with the challenges of managing research data at a large organization with many
research groups, so I've been trying to figure out what OODT does and what it could do for
me.  So far most of what I've found has been written either at a very abstract level for managers
(the TLP press release and the OODT main page) or a very detailed level for developers (the
javadocs). I haven't seen much so far for the "data people" in the middle -- the people who
need enough technical detail to put the system into practice because they're tired of coding
their own.  This is my experience trying to get that information.

Sorry that you've had that experience so far. The guide for the file manager that you stumbled
upon below is an effort to start to obviate some of those concerns. I agree that much of the
documentation as it stands is Javadoc type documentation, or high level architecture, but
I'd also point you to more guides like the below (there are more). In fact, many of the OODT
components have a few such guides that can help out at least in getting started. I'll reply
more on these on the below paragraph because they are more applicable there.

> 
> The website has a lot of stub pages for the individual components, so I thought that
I might be able to get some more information by downloading and running the software.  This
started as a NASA project, so there have to be stacks of documentation somewhere, right? 
I downloaded the trunk and built it using the instructions I eventually found on the File
Manager page (http://oodt.apache.org/components/maven/filemgr/user/basic.html), but now I
have a directory with a bunch of folders in it, and I have no idea what to do with them. 
The only tutorial I can find is for the File Manager -- which I very much appreciate, even
though it doesn't completely work for me -- and there are only two files named README.txt
in the entire project.

Thanks. Can you elaborate on what part of the guide doesn't completely work? 

The filemgr, workflow, and resource components are 3 sort of canonical services that help
you implement data processing and management. File Manager tracks file locations, their metadata,
handles data transfers, and provides the ability to transform that captured metadata in a
variety of ways (e.g., output it as RSS or RDF via the cas-product webapp), and to deliver
those files and metadata to folks who ask for them. The workflow manager is a light-weight
wrapper where you can cook up control flow and data flow (sets of Tasks chained together)
in XML files, you can execute those Tasks locally on a single machine, or you can plug the
workflow manager into a resource manager, and have those tasks be distributed out onto a cluster,
a cloud, a grid or whatever type of hardware you have to execute processes and jobs on. These
components, by themselves, are useful independently of one another. In fact, they don't have
any direct dependencies on one another unless you tell them to. What that means is that you
can use the filemgr as an independent component simply to programmatically capture information
about files and metadata; but never do anything with them that involves a workflow manager
or resource manager. You can simply use the workflow system if you want, independent of the
filemgr or resource manager; you can use resource manager similarly. 

However, when you put these 3 services together, you start to have a really powerful substrate
to perform data management system functions on. For example, the crawler framework combines
the power of automatic file identification, and ingestion, with the file manager, to rapidly
build up your file manager based archive and catalog; it also provides the ability to notify
the workflow manager when files are ingested to kick off tasks and processes (algorithms)
associated with the ingestion of those files. The pushpull framework is a remote content acquisition
system, that can go get you ancillary files and metadata, pull them down locally, and feed
them to the crawler for ingestion and management in your data management system. Finally the
PGE component is a specialized workflow task jar library, that when dropped into the context
of the workflow manager's lib directory, gives you a high powered workflow task that can easily
communicate with the filemgr, workflow manager or resource manager, and feed information to
your algorithm that otherwise you'd have to write lots of specialized data management code
for.

The above is a description of what *one set* of OODT components (the CAS family) do; there's
a whole other set of those components that handle information integration. The use case here
is that you have a bunch of existing databases or data systems that you'd like to link together,
but you don't control their population, schema, or business processes associated with them.
In this case, we have the profile (metadata) and product (data) server components, which expose
the underlying metadata and data from these systems and make it easily available for query,
representation and dissemination. Profile and product servers run on top of the web-grid WAR
file, a Tomcat webapp that turns them into REST-ful services. The best place to get started
here is to look at:

http://oodt.apache.org/components/maven/grid/slides.pdf

NOTE: those slides were made pre-Apache OODT, so some of them will contain old properties
and paths for Web Grid, but should still give you an idea of what's going on. The Apache OODT
web-grid is basically the same component that you see in those slides.

Once you are familiar with web-grid there are a few custom, extensible profile and product
server handlers that we have been working on. xmlps (available as a top-level OODT module)
is a XML-configurable profile/profile server that can easily connect to JDBC-accesible databases
and dump out the bits and metadata from them. OPeNDAPPs is a XML configurable profile server
that can connect to OPeNDAP accessible data servers and extract metadata and data from them.

> 
> As a result, I still have a lot of very basic questions:  What do I do with all of these
components? What do they all do?  Which ones do I need, and which are optional? Are they standalone
executables?  Web services that require some sort of container?  Do I interact with them using
the command line, or do they have web or web services interfaces?  What are the configuration
options?  What kinds of data and metadata can I manage? What kinds of roles do I need to have
within my organization (administrator, content owner, metadata maintainer), and how does the
software handle these? What do I want to do that this project can't? (In this type of software,
there's always something that's just a little too specific to the original purpose or organization.)

Hopefully what I mentioned above will give you a basic idea of what's going on. Apache OODT
is a framework that by itself doesn't build your data system for you; it needs some TLC from
a person like you that knows your data system requirements, etc., and can help map those to
the specific components and resultant architecture provided by OODT to use it for your application.

Check out what I mentioned above, and then if you need more help just jump on list and let
us know. At that point it would be nice if you could give us some more detail about what you
are actually trying to do in terms of data management/etc., as that would give us a better
idea of how to suggest help in configuring and using OODT for your specific case.

> 
> OODT claims to have a large user community apart from the original developers.  How did
it come to be that these organizations and individuals knew how to use the software?  What
sort of documentation and support did the developers need to provide in order to get them
up and running?  How can I get some of that? :)

Like Dave Kale mentioned in his email, a lot of the work to date has come from collaborative
research grants and shared effort on projects with folks working in the organizations that
have used OODT. Now that it's here at Apache many of those folks are lurking on these lists,
and available to help out and discuss issues with the software, etc., also in the hopes that
it will help out their specific deployments.

> 
> Again, I'm very grateful that this product exists and am excited to find out more about
it.  Thanks for making it available for me to puzzle over!

Thanks for your email and welcome!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Mime
View raw message