oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Bennett <lmzxq....@gmail.com>
Subject Re: Data transfer questions
Date Mon, 19 Mar 2012 19:18:12 GMT
Hey Chris,

Thanks for you're reply, much appreciated. You've cleared up a few issues
in my understanding.

I've gone through your reply and just added a few notes for completeness.

Crawler data transfer. I.e. not using the File Manager as a client.*

there are 2 ways to configure data transfer. If you are using a Crawler,
> the crawler is going to
> handle client side transfer to the FM server. You can configure Local,
> Remote, or InPlace transfer at the moment,
> or roll your own client side transfer and then pass it via the crawler
> command line or config.

1) Local data transfer

> Local means that the
> source and dest file paths need to be visible from the crawler's machine
> (or at least "appear" that way. A common pattern
> here is to use a Distributed File System like HDFS or ClusterFS to
> virtualize local disk, and mount it at a global virtual
> root. That way even though the data itself is distributed, to the Crawler
> and thus to LocalDataTransfer, it looks like
> it's on the same path).

2) Remote data transfer

> Remote means that the dest path can live on a different host, and that the
> client will work
> with the file manager server to chunk and transfer (via XML-RPC) that data
> from the client to the server.

3) In place data transfer

> InPlace means
> that no data transfer will occur at all.

(Great explanations - thanks!)

*Versioner schemes*

> The Data Transferers have an acute coupling with the Versioner scheme,
> case in point: if you are doing InPlaceTransfer,
> you need a versioner that will handle file paths that don't change from
> src to dest.

The Versioner is used to describe who a target directory is created for a
file to archive. I.e a directory structure where the data will be place. So
if I have an archive root at /var/kat/archive/data/ and I use a basic
versioner it will archive a file called 1234567890.h5 at
/var/kat/archive/data/1234567890.h5/1234567890.h5. So this would describe
the destination for a local data transfer.

I have the following versioner set in my policy/product-types.xml.

<versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>

Just out of curiosity... why is this called a versioner?
*Using the File Manager as the client*

Configuring a data trransfer in filemgr.properties, and then not using the
> crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
> you can tell the server (on the ingest(...) method) to handle all the file
> transfers for you. In that case, the server needs a
> Data Transferer configured, and the above properties apply, with the
> caveat that the FM server is now the "client" that is transferring
> the data to itself :)

If I set the following property in the etc/filemgr.property file


I did a quick try of this today, trying an ingest on my localhost, (to
avoid any sticky network issues) and I was able to perform an ingest.

I see you can specify the data transfer factory to use, so I assume then
that the filemgr.datatransfer.factory setting is just the default if none
is specified on the command line. Is this true?

I ran a version of the command line client (my own version of
filemgr-client with abs paths to the configuration files):

cas-filemgr-client.sh --url
--ingestProduct --refs /Users/thomas/1331871808.h5
--productStructure Flat --productTypeName KatFile
--metadataFil/Users/thomas/1331871808.h5.met --productName 1331871808.h5
--clientTransfer --dataTransfer

With the data factory also type spec'ed as:


And the versioner set as:

<versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>

And it ingested the file. +1 for OODT!

*Local and remote transfers to the same filemgr*

> One way to do this is to write a Facade java class, e.g., MultiTransferer,
> that can e.g., on a per-product type basis,
> decide whether to call and delegate to LocalDataTransfer or
> RemoteDataTransfer. If wrote in a configurable way, that would be
> an awesome addition to the OODT code base. We could call it
> ProductTypeDelegatingDataTransfer.

I'm thinking I would prefer to have some crawlers specifying how file
should be transferred. Is there any particular reason why this would not be
a good idea - as long as the client specifies the transfer method to use?

*Getting the product to a second archive*

> One way to do it is to simply stand up a file manager at the remote site
> and catalog, and then do remote data transfer (and met transfer) to take
> care of that.
> Then as long as your XML-RPC ports are open both the data and metadata can
> be backed up by simply doing the same ingestion mechanisms. You could
> wire that up as a Workflow task to run periodically, or as part of your
> std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs
> up to the remote
> site by ingesting into the remote backup file manager).

Okay. Got it! I'll see if I can wire up both options!

> I'd be happy to help you down either path.

Thanks! Much appreciated.

> I was thinking, perhaps using the functionality described in OODT-84
(Ability for File Manager to stage an ingested Product to one of its
clients) and then have a second crawler on the backup archive which will
then update it's own catalogue.

> +1, that would work too!

Once again, thanks for the input and advice - always informative ;)


View raw message