oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: NCT3 Data ingest - revisit aggregation of IPs and eliminate unwanted data types?
Date Thu, 05 May 2011 02:11:01 GMT
[replying to dev@oodt.apache.org, since this conversation I think could help some users who
are thinking about similar things]

> Ok, since you jumped in, maybe you can elaborate.
> 
> How would we implement a process in the crawler to perform a 200-1 down
> sample of push-pull downloaded files to aggregated, ingested products,
> without involving other PCS components, e.g., filemgr, workflow, etc.?
> 
> The production rule would be to gather and wait for all (or maybe just
> select the optimal set of) temporally coincident files (in this case 16 ~30
> sec files spanning 8 min), simultaneously corresponding to ~12 different
> file types, using some rule-based modulo-time boundary.

What would the down select involve? Throwing out the files that don't meet the criteria? Or,
still archiving them, but not worrying about them together, as a whole?

> 
> Perhaps one simplification to this problem would be to trigger the
> processing (and even better, derive the time boundaries) based on crawling a
> separate file type that we expect would be delivered at the desired temporal
> resolution.

Yep that's one way to do it. That's how we created the FTS pipeline in OCO, by having a separate
("mock") product, called FTSSavesetDir that we ingested (and on ingest, notified the WM that
processing should occur). We controlled how and when these FTSSavesetDirs were made, and when
they got moved into the appropriate staging area with the appropriate FTSSavesetDirCrawler
watching for them.

>  After the aggregate product is generated, the executing process
> would need to move all its 200 input files out of the push pull staging area
> to a separate disk area for storage. But we would still want this process to
> wait on executing until it got all its expected input files (or reached some
> appropriate time out) before creating its product.

Wouldn't one way to do this just be to do it with Versioning? It sounds like you have sets
of files that you'd like to be archived to the "nominal" archive (aka defined in a versioner,
maybe your NPP PEATE std one) -- and then a set of them that you still want archived but to
a separate disk area for storage, correct?

One way to do this would be simply to create a ShadowProduct (aka one you don't care about
from an Ops Perspective), and then archive the 200 input files as this "ShadowProduct" with
a versioner that dumps them into the separate disk for storage, outside of your std product
archive.

> 
> Of course at this point it seems to me we are basically buying into
> duplicating most of the basic filemgr and workflow capabilities, without
> using either.

Yeah -- it's been a careful tradeoff. I fought long and card to keep the crawler from evolving
into its own WM. As part of that, I think its simple phase model was the right tradeoff. You
can do some phase-based actions, and customize behavior, but it's not full out control or
data flow, which is a win in my mind.

> 
> ps. A separate concept that we kicked around with Brian at one time was to
> have the PCS track not single files but directories (aggregations) of files
> that could be continually ingested into (along with appropriate metadata
> updates), each time another matching file arrived.  But we never fleshed out
> the details of how this would be implemented.

It's pretty much there with the Hierarchical product concept, but the only catch is that all
the refs slow things down sometimes. But I have some ideas (aka lazy loading of refs on demand)
that may help there.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Mime
View raw message