airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Marru (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRAVATA-1646) [GSoC] Brainstorm Airavata Data Management Needs
Date Wed, 25 Mar 2015 21:37:54 GMT

    [ https://issues.apache.org/jira/browse/AIRAVATA-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380830#comment-14380830
] 

Suresh Marru commented on AIRAVATA-1646:
----------------------------------------

Hi Doug, Please see the responses embedded below:

Do we have access to the apache thrift data model currently in use by Airavata? If so, can
we modify this model?
-- I consider this project as a exploratory, so yes we could branch the master and have you
modify the thrift data models. You can look at them here - https://github.com/apache/airavata/tree/master/airavata-api/thrift-interface-descriptions
What other object store technologies are you interested in (Cassandra and MongoDB)?
--It will be premature to state a preference. The key thing here is to understand the problem
well enough and make a recommendation if relational databases are good, or if key-value or
column, document or graph databases can better address Airavata metadata needs.  
How will the metadata be used? Depending on metadata usage it can affect which technologies
and which features of that specific technology we should enable.
--This is a very open ended question. I will hope you can propose a project keeping in mind
you will need to explore this answer in interactions with airavata community.
What are some examples of meta data is being stored? Is the data structured or unstructured?
--Currently all the metadata is very structured. An example would be to look the experiment
model. User requests an experiment which will get executed on remote resources, in the process
transforms data. The metadata capture also included states of simulation or data analysis
tasks. Once you run sample experiments, this will be more clearer. 
What kind of provenance data is being stored?
--Currently very minimal to none. Basic information like user provided metadata, resources
used to compute, job dimensions. A big missing piece is to collate provenance of input data
and augment provenance of generated data with application details and simulation/analysis
configurations. 
What kind of queries would you expect to be run on the provenance data?
--this will be very subjective to the data domain. An example could be, query for all radar
assimilation data which have a quality score of 5. We could find more concrete pointers. 
Do we need look into Apache Storm for querying streaming data?
-- Not right away, but I could foresee some usage. For instance, if we have to process metadata
extraction from all the archived data, I could see storm helping to run such a topology. We
could also employ a storm cluster to shred deep data from all input requests. Again, we need
to adapt with the usecases a bit here. 
Will we receive accounts on NSF XSEDE clusters for this project?
--Yes we could get you access to various clusters including XSEDE if absolutely needed by
the project. 



> [GSoC] Brainstorm Airavata Data Management Needs
> ------------------------------------------------
>
>                 Key: AIRAVATA-1646
>                 URL: https://issues.apache.org/jira/browse/AIRAVATA-1646
>             Project: Airavata
>          Issue Type: Brainstorming
>            Reporter: Suresh Marru
>              Labels: gsoc, gsoc2015,, mentor
>
> Currently Airavata focuses on Execution Management and the Registry Sub-System (with
app, resource and experiment catalogs) capture metadata about applications and executions.
There were few efforts (primarily from student projects) to explore this void. It will be
good to concretely propose data management solutions to for input data registration, input
and generated retrieval, data transfers and replication management. 
> Metadata Catalog: In addition current metadata management is based on shredding thrift
data models into mysql/derby schema. This is described in [1]. We have discussed extensively
on using Object Store data bases with a conclusion of understanding the requirements more
systematically. A good stand alone task would be to understand current metadata management
and propose alternative solutions with proof of concept implementations. Once the community
is convinced, we can then plan on implementing them into production. 
> Provenance: Airavata could be enhanced to capture provenance to organize the data for
reuse, discovery, comparison and sharing. This is a well explored field. There might be good
compelling third party solutions. Especially it will be good to explore in the bigdata space
and identify leverages (either concepts, or even better implementations).
> Auditing and Traceability:  As Airavata mediates executions on behalf of gateways, it
has to strike a balance between abstracting the compute resource interactions at the same
time providing transparent execution trace. This will bloat the amount of data to be catalogued.
A good effort will be to understand the current extent of airavata audits and provide suggestions.

> BigData Leverage: Airavata needs to leverage the influx of tools in this space. Any suggestions
on relevant tools which will enhance Airavata experience will be a good fit. 
> [1] - https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Data+Models+0.12
> [2] - http://markmail.org/thread/4lguliiktjohjmsd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message