airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Supun Nakandala <supun.nakand...@gmail.com>
Subject [GSoC] Integrating DataCat System with Apache Airavata & production GridChem
Date Thu, 26 Mar 2015 09:02:36 GMT
Hi All,

I have submitted a proposal for Google Summer of Code program to Integrate
DataCat System with Apache Airavata and production GridChem. My proposal
can be found at [1] and I have also attached it to Airavata wiki[2].

The high level architecture for this integration will be as shown in the
following diagram.


‚Äč

The flow of execution will be as follows.

   1. Scientist uses a web based reference gateway to submit a job to a
   computational resource using Airavata.
   2. Airavata executes the application in remote resources.
   3. After successful completion of the application execution Airavata
   will call DataCat handler (which is a new component getting added).
   4. DataCat handler will then copy the generated data products from
   remote locations and copy them to a data archive for long term
   preservation. This is important because in the current version of Airavata
   data products are getting generated in the /tmp folder and they are not
   persistent.
   5. After copying the data DataCat handler will publish a message to a
   RabbitMQ message broker about the generation of the data product and other
   related provenance information such as application name, experiment name,
   inputs etc.
   6. DataCat agent will be subscribed to the message broker and will get
   this message. Then the agent will access the data product and index it
   DataCat server.
   7. Web based reference gateway will incorporate search features which
   uses the DataCat service methods behind the scene.

In the proposed solution the coupling between the two systems is minimized
as the communication is done via a message queue. If required Airavata can
be run independently without running the DataCat system.

But I have the following concern with respect to the above architecture.
>From Airavata point of view experiment ID is used to uniquely identify a
single experiment execution and all other data in the registry relating to
an experiment are indexed under the experiment ID. In the DataCat system
after indexing the metadata for a particular data product it will generate
a document id for the metadata document. Some how we need to map this
document id with the experiment id in the Airavata registry.

One way to do this is to run a message queue listener in Airavata side
which get notified of (exp_id, metadata_doc_id) pairs and update the
registry to include the corresponding metadata doc id. At the DataCat end
after successfully indexing a metadata doc it will publish the (exp_id,
metadata_doc_id) pair to a message queue.

WDYT about this approach?

-Supun

[1] -
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/scnakandala/5751725713522688
[2] -
https://cwiki.apache.org/confluence/display/AIRAVATA/Integrating+DataCat+System+to+Apache+Airavata

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message