incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stian Soiland-Reyes <soiland-re...@cs.manchester.ac.uk>
Subject [Proposal] Taverna workflow
Date Tue, 23 Sep 2014 12:43:21 GMT
I hereby present the Apache Incubator proposal for the project Taverna.


Also available in rich text in the Taverna wiki (with more hyperlinks!):

http://dev.mygrid.org.uk/wiki/display/developer/Taverna+incubator+proposal

(Could someone grant me access to edit the Incubator wiki pages? My
wiki username is soilandreyes)




# Abstract

Taverna is an open source and domain-independent suite of tools used
to design and execute data-driven workflows.


# Proposal

The Taverna suite includes:

* Taverna Workbench, a Java-based desktop application for graphically
composing, editing and executing workflows of distributed web services
and local tools
* Taverna Commandline Tool which allows repeated execution of
parameterized workflow definitions
* Taverna Server provides a REST and SOAP API for executing workflows
* Taverna Player is a Ruby-based web interface towards the Server,
providing a high-level view of workflow executions and their results,
and allows further integrations with Ruby on Rails applications.

Taverna can browse and combine different service types, allowing
workflows to integrate steps of arbitrary REST and SOAP web services
with command line tools (local and SSH), scripts (Beanshell, R,
Jython) and finally visualize the results.

The goal of the Taverna suite is to help researchers to access
distributed datasets and processing capabilities by the construction
of pipelines, and also to simplify the execution of  these pipelines
in various environments.

The Taverna suite of products is already successful and in wide-use
across different domains. The software is currently licensed as LGPL
2.1, with copyright owned by University of Manchester. External
contributors have all signed Apache-like CLAs.


# Background

Taverna workflows coordinate inputs and outputs between computational
processes and Web Services. The workflow is designed in a graphical
interface which shows the workflow as a series of boxes and arrows;
representing processes and their data connections. The different
processes in a workflow can be command line tools, REST and WSDL Web
Services; which are used for combining steps such as data acquisition,
filtering, cleaning, integrating, analysis and visualization. Taverna
calls these processes "services", as they generally are provided by
remote (third-party) servers.

These kind of computational workflows, also known as pipelines and
dataflows, focus on the movement of data rather than the execution
order of the underlying processes. Features such as implicit
iterations (where an input list of values causes multiple process
executions) and parallel invocations (independent processes are
executed as soon as their data is available) are intrinsic to a
dataflow system, not requiring any particular constructs by the
workflow designer.

As a visual programming environment, workflows aids collaboration and
reuse of workflows. At the highest level, a workflow represents the
conceptual level of an analysis, allowing understanding, discussion
and communication of the overall analysis protocol. More detail can be
revealed and modified for individual steps. At the individual process
level, the workflow defines execution specifics such as operations,
parameters and command line tools.

Sharing of the workflow definitions allows re-use and re-purposing of
the computational analysis. During workflow execution, provenance can
be collected from every step, allowing deep inspection of intermediate
values for the purpose of debugging and validation.


# Rationale

There is a strong need to lower the barrier of entry to datasets and
computational resources widely available on the Internet, to increase
their use by researchers who understand the computational steps needed
to produce their results, but who are not necessarily expert
programmers. Taverna has already shown its success and popularity in a
wide range of scientific disciplines.


# Initial Goals

* Transition mailing lists to Apache (keep existing subscribers, but
invite more)
* Taverna developer workshop (2014-10-30)
* Prepare git repositories for move:
  * Update headers/metadata to indicate Apache License 2.0
  * Restructure git repositories
  * Rename Maven groupIds to org.apache.taverna.*
  * Rename packages to org.apache.taverna.*

* Move Github repositories to Apache git
* Automated builds in Apache's Jenkins
* Update to latest releases of Apache dependencies
* Propose updated release & testing procedure under Apache
* Moved Website and documentation

We intend to only release the current development version Taverna 3.x
http://www.taverna.org.uk/developers/work-in-progress/taverna-3/ under
the Apache umbrella (). 3.0 is not yet officially released - however
the Taverna 3.0 Command Line can be released almost "as-is" after
migration. The Taverna 3.0 Server is at beta quality, while the
Taverna 3.0 Workbench is at alpha stage and would need to be
stabilized to an initial beta release.

* Before first release: Maven Central releases of Taverna support
libraries (e.g. taverna-scufl2 and taverna-databundle)
* First release: Apache Taverna Command Line 3.0 (OSGi-based)
* Release: Apache Taverna Server 3.0
* Release: Apache Taverna Workbench 3.0 beta
* Provenance exchange with relevant Apache products (e.g. Apache
CXF->Taverna->CouchDB)
* Release: Apache Taverna Workbench 3.0

It is not yet decided if the current Workbench Editions
http://www.taverna.org.uk/download/workbench/2-5/ will be carried over
to Taverna 3, or if this can be solved by having a "Install extra
plugin" step on first start-up of Apache Taverna. In any case, we
imagine that some of these specializing editions will be maintained
outside (but in collaboration with) the Apache project. This is
particularly the case for the Astronomy edition as it depends on
several LGPL/GPL libraries and is maintained by the AstroTaverna team.


# Current Status

## Meritocracy

Taverna was initially created by the myGrid consortium in 2003. Since
2006, the majority of contributions to Taverna's core code-base, its
architecture and direction have been led by staff at The University of
Manchester and The European Bioinformatics Institute (EMBL-EBI).

The project have benefited of a high-degree of extensions and
integrations by other developers - but mainly in the form of plugins
(http://www.taverna.org.uk/documentation/taverna-2-x/taverna-2-x-plugins/)
and integrations
(http://www.taverna.org.uk/developers/work-in-progress/taverna-online/
http://www.taverna.org.uk/download/associated-tools/).

Taverna's developer community have unfortunately not had a culture of
submitting patches that would warrant later commit access - perhaps
due to its background in the science community. However contributors
have been added as committers when the plugin becomes a part of the
core distribution (e.g. External Tool plugin by Möller and Krabbenhöft
and AstroTaverna by Garrido), or when their development has required
patches to the existing code base.


## Community

Taverna has an active community of plug-in developers and users. The
developer mailing list (taverna-hackers@lists.sourceforge.net) has 248
members, the user mailing list (taverna-users@lists.sourceforge.net)
has 370 members.

1500 users have registered as of 19 August 2014. Total downloads of
all products since version 2.1 (released December 2009) is 35000.

A Taverna Developer workshop is being arranged for 30 October 2014 to
bring together developers and integrators of Taverna. We want to
encourage plug-in developers to participate further also in the core
development of Taverna, by introducing them to the code base and how
to contribute. http://dev.mygrid.org.uk/wiki/display/developer/Taverna+Open+Development+Workshop

Active steps to grow the communities of users and developers by
targeting specific research domains such as the work by Kevin Benson
on Taverna's use in the Heliophysics and Astrophysics community.
Susheel Varma is increasing usage of Taverna within the Biomedical
domain. Julián Garrido and his work on AstroTaverna is promoting
Taverna within the IVOA Virtual Astronomy community. Sonja Holl and
Björn Hagemeier's are targeting high performance computing.


## Core Developers

What we currently consider to be the core Taverna Team is (in
alphabetical order):

Christian Brenninkmeijer (University of Manchester)
Donal Fellows (University of Manchester)
Robert Haines (University of Manchester)
Aleksandra Nenadic (University of Manchester)
Dmitry Repchevsky (Barcelona Supercomputing Center)
Stian Soiland-Reyes (University of Manchester)
Shoaib Sufi  (University of Manchester)
Vadim Surpin (Institute for Information Transmission Problems in Moscow)
Alan Williams (University of Manchester)

The team consists of experienced developers who have worked on a
multitude projects, particular within writing software for supporting
scientists. The committers list (See below) includes additionally
plugin developers whose contributions have become part of Taverna.
Part of our desire to join the Apache Foundation is to recognise their
effort and promote them into also being "core developers".


## Alignment

Taverna dependencies include Apache Commons, Axis, Abdera, Batik, CXF,
Derby, Felix, HttpComponents, Jena, log4j, Maven, POI, Velocity,
Xerces, XMLBeans, Xalan, We use Tomcat for testing and deployment of
the Taverna Server.
As part of moving to Apache-compatible dependencies, Taverna will
probably adopt OpenJPA to replace (LGPL) Hibernate.



# Known Risks

## Orphaned products

Most of the core developers are from the myGrid team at University of
Manchester, but are funded through a series of projects - see
http://www.mygrid.org.uk/projects/. Many of these projects incorporate
Taverna, so the effort from Manchester is partially based on direct
project requirements, but also partially a volunteer effort for
project maintenance and general development. The myGrid team has
guaranteed funding until 2017.

The developers that are outside Manchester are generally funded for
other activities, and so their effort to Taverna is to a greater
extent a volunteer effort - although again project-specific
requirements steer their effort (e.g. for a new Taverna plugin).

One of the reasons for our desire to move to the Apache Foundation is
to formalise this volunteering/contribution effort so that it becomes
obvious that it is not just University of Manchester that is
contributing to the core code base - and therefore reducing the
impression that Taverna is vulnerable to Manchester’s future funding
and projects.


## Inexperience with Open Source

Taverna has been an open-source project since its first release in
2003. Most of the contributors also have experience with working with
and contributing to other open source projects (e.g. TCL, CXF, Jena),
particularly as Taverna strongly relies on other open source tools.
Most of the research projects which the myGrid members have
participated in produces open-source software.


## Homogeneous Developers

The committers list includes many people from myGrid, University of
Manchester in United Kingdom - but these developers have been working
on a range of distributed and European projects in the field of
scientific software - see http://www.mygrid.org.uk/projects/

The other developers on the committers list come from many different
projects and institutions across the world, from Russia, Canada,
Germany and Spain.


## Reliance on Salaried Developers

Development for Taverna is mainly performed as part of the developers'
salaried work, but funded through many different projects at several
institutions (see above). These projects don't generally have
"contribute to Taverna" as their main goals - so therefore in many
ways the effort is still volunteer-based - contributing to Taverna as
a way to support one's own work.

>From our experience of running Taverna over the last 10 years, new
contributors will continue to join as Taverna becomes an ingredient in
new projects, while existing contributors more slowly fade out of
their involvement. Often existing contributors and users gives the
personal link to the new contributors.


## Relationships with Other Apache Products

Apache already contains projects that seem relevant to Taverna.

Apache Pig https://pig.apache.org/ is a high-level language for
creating Map-Reduce programs for Apache Hadoop. There already exists
third-party efforts to convert Taverna Workflows to Hadoop and Pig -
https://github.com/umaqsud/taverna-to-pig
https://github.com/schenck/taverna-to-hadoop (thus making a graphical
interface for building Apache Pig workflows) - and part of the Apache
Taverna effort would be to invite these to join the project.

Apache Airavata http://airavata.apache.org/ is a software framework
for executing and managing computational jobs and workflows on
distributed computing resources. Taverna's concern is not as much job
coordination, but more of a data flow between services. Airavata's
XBaya Workflow Suite can export workflows in Taverna 1 format SCUFL,
but could be updated to work with Taverna 3's SCUFL2 format.

Apache ODE https://ode.apache.org/ is a WS-BPEL workflow engine. BPEL
as a workflow language is quite verbose compared to dataflow languages
like Taverna, and is additionally bound to a particular protocol
(SOAP). Nevertheless,  a sub-section of Taverna workflows could in
theory run on the Apache ODE engine - and the Taverna 3 Platform API
has facilities for plugging in alternative workflow engines. We have
previously considered Apache Hadoop as one such alternate engine for
executing a different subset of workflows with local command line
tools.

Apache Storm http://storm.incubator.apache.org/ is a distributed
realtime computation framework. Experiments are under development to
use Taverna as a front-end for creating Apache Storm workflows -
http://markmail.org/message/zg5ylo2aucpwfc5j

Apache has several popular frameworks for building REST/SOAP web
services (Apache CXF, Apache Clerezza),  data services (Apache Jena,
Apache Hive, Apache CouchDB) and specific workflow engines (Apache
Oozie for Hadoop, Apache ODE for WS-BPEL). Taverna as a general REST
and SOAP service client can be used for combining, testing and
demonstrating such services.


## A Excessive Fascination with the Apache Brand

Taverna is a long-running project (since 2003) with an existing user-
and developer base across the academic world. Our main motivation for
moving to Apache is to further encourage an open development process
and engage existing and new developers to contribute to the core code
base.  We also want to ensure long-term continuity of the Taverna
products, and for its future directions to be decided by the whole
Taverna community rather than one of the parties involved.



# Documentation

Taverna's documentation is available from
http://www.taverna.org.uk/documentation/taverna-2-x/, including an
extensive user manual at
http://www.mygrid.org.uk/dev/wiki/display/taverna/User+Manual and
tutorials http://www.taverna.org.uk/documentation/taverna-2-x/tutorials/
and videos http://www.taverna.org.uk/documentation/taverna-2-x/videos/.

The developer documentation
http://dev.mygrid.org.uk/wiki/display/developer/Developers+Guide
includes tutorials
http://dev.mygrid.org.uk/wiki/display/developer/Tutorials for working
with Taverna's source code and creating plugins.


# Initial Source

Taverna's source code is available from the 'taverna' github team
account: https://github.com/taverna/. These 85 git repositories
reflect the current modules of Taverna's plugin system after recently
transitioning from Google Code SVN at
http://taverna.googlecode.com/svn/taverna/. The history of Taverna's
code base goes back to being hosted in CVS at SourceForge
http://taverna.cvs.sourceforge.net/, transitioned as of
http://taverna.googlecode.com/svn/archived/cvs2svn-2008-09-25/. Note
that reasonable steps have been made to preserve commit history when
moving between version control system, this has not always been
achieved when moving between modules and refactoring larger Java
packages. Some source files might therefore in git have initial
commits like "Moved from /taverna/utils/trunk" referring to SVN paths.

One of the reason for many repositories is that we rely on Apache
Maven and a plugin system (since Taverna 3 OSGi-based) where different
modules have different version numbers and release cycles (e.g.
tags/branches). This is essential for the plug-in support of Taverna
as the plug-ins depend on the semantic versioning of the APIs and
required implementations.

It is however in our current plans to merge repositories that have
similar release cycles and greatly reduce the number of repositories.

Taverna source code uses the package names (and children packages):

net.sf.taverna - since Taverna 2
uk.org.taverna  - new from Taverna 3
org.taverna (sic) - Taverna Server

Some contributed code uses package names depending on their
originating projects:

org.purl.wf4ever.provtaverna
org.biomart.martservice

We intend to release only the upcoming Taverna 3.0 version under the
Apache umbrella (not 2.x) - therefore, according to semantic
versioning rules http://semver.org/, the transition period of the
Apache Incubator would be the best (and possibly only) chance to
rename Java packages and Maven groupIDs to org.apache.taverna.* Under
OSGi the packaging and JAR goes hand-in-hand (several JARs don't
normally provide the same package), and therefore any package rename
would be done together with the repository restructuring.


# Source and Intellectual Property Submission Plan

Taverna source code from http://github.com/taverna/

(c) University of Manchester.
Signed Apache-like CLAs for all external contributors.
Current license is LGPL 2.1 (and GPL3 for one domain-specific
download), as copyright holder Manchester can change this to Apache
License 2.0

taverna.org.uk domain - registrant University of Manchester
http://www.taverna.org.uk/  content (c) University of Manchester
http://dev.mygrid.org.uk/wiki/display/tav250/ Confluence wiki content
(c) University of Manchester
http://dev.mygrid.org.uk/wiki/display/developer Confluence wiki
content (c) University of Manchester

The details of intellectual property submission will be worked out
together with myGrid project manager Shoaib Sufi and the University of
Manchester's Contracts Office.


# External Dependencies

Taverna, as an integrating workflow system, has a fairly large number
of dependencies - the latest 2.5.0 Core Workbench distribution has 517
JARs (although many of those are duplicates in different versions)

We are intending for our first Apache-based release to be Taverna 3,
which has already reduced this dependency list.

We have performed an analysis of our dependencies of Taverna 3 at
http://dev.mygrid.org.uk/wiki/display/developer/Taverna+Dependencies -
but this is not yet a complete list.

A second analysis looks at the license of those dependencies at
http://dev.mygrid.org.uk/wiki/display/developer/Third-party+licenses -
where we have some incompatible (LGPL) dependencies. Most of these are
resolvable as they are part of optional plugins to Taverna (e.g. R
support, BioMart). The dependency on Hibernate requires some developer
effort to be replaced with either Apache Open JPA or a "No-SQL"
solution.


# Cryptography

Taverna uses these cryptography dependencies:

BouncyCastle
OpenJDK builds with the default JCE full encryption policy (bundled in
installer)

Taverna utilise these to form of an encrypted keystore (storing
username/password and client certificates for third-party services
accessed by the designed workflow) with corresponding user interface,
and additionally binds to Java's SSL support to provide UI and command
line options for security interactions, e.g. accepting new server
certificates, or asking for username/passwords for HTTP Basic
authentication (which can then be stored in the keystore).


# Required Resources

Taverna currently relies on a mixture of infrastructure hosted for
free by third-parties (e.g. Github, SourceForge, GoogleCode,
Launchpad, Bitbucket) and infrastructure hosted by myGrid at
University of Manchester (Jenkins, Jira, Confluence, Wordpress).

## Mailing lists

Existing mailing lists for Taverna are hosted at Sourceforge with
archives at markmail. See http://www.taverna.org.uk/about/

commits@taverna.incubator.apache.org  (replacing
taverna-cvs@lists.sourceforge.net)
private@taverna.incubator.apache.org (replacing support@mygrid.org.uk
- to a lesser degree as we would want to encourage openness)
dev@taverna.incubator.apache.org (replacing
taverna-hackers@lists.sourceforge.net, 240 members)
users@taverna.incubator.apache.org (replacing
taverna-users@lists.sourceforge.net, 370 members)


## Git repositories

The Taverna community would prefer to keep using git and Github, and
we would request for experimental writable git repositories
http://www.apache.org/dev/writable-git with mirroring to Github.

The repositories would be named taverna-*, as the current repositories
on the github team: https://github.com/taverna/. This repository
organization is styled equivalent to the git repositories of cordova-*
and couchdb-*.

Exactly how repositories are split/merged is open for discussion - it
is part of our current plan to reduce the number of repositories by
merging common modules with a similar release cycle - this could be
done at an early phase of the incubation period.


## Issue Tracking

JIRA Taverna (TAV)

Existing issues in Taverna 3's current JIRA -
http://dev.mygrid.org.uk/issues/browse/T3 - should be imported - but
its current list of Modules should be further agreed.


## Other Resources

Wiki spaces in Confluence https://cwiki.apache.org/confluence -
importing the most recent Taverna-related spaces and documentation
from http://dev.mygrid.org.uk/wiki/spacedirectory/view.action?startIndex=24
Jenkins - replacing myGrid Jenkins at http://build.mygrid.org.uk/ci/
Maven repository at https://repository.apache.org/ - replacing myGrid
artifactory http://repository.mygrid.org.uk/
File-based web space for Plugin Update Site - replacing
http://updates.taverna.org.uk/ and
http://www.mygrid.org.uk/taverna/updates/
Home pages - to be transitioned from from http://www.taverna.org.uk/ (Wordpress)
Binary distribution download hosting, about ~8 GB pr release,
replacing http://www.taverna.org.uk/download/ (currently downloads are
hosted by http://launchpad.net/ and https://bitbucket.org/)


# Initial Committers

The initial list of committers reflect the current list of active
developers at the Github team: https://github.com/orgs/taverna/people
(Note that not all of these have made their membership public on
Github)


Alan R Williamsalan.r.williams@manchester.ac.uk
Aleksandra Nenadica.nenadic@manchester.ac.uk
Christian Y. Brenninkmeijerbrenninc@cs.man.ac.uk
David Withersdavid.withers@gmail.com
Dmitriy Repchevsky dmitry.repchevski@bsc.es
Donal K. Fellowsdonal.k.fellows@manchester.ac.uk
Finn Bacallfinn.bacall@manchester.ac.uk
Hajo Nils Krabbenhöfthajo@krabbenhoeft.de
Ian Dunlopian.dunlop@manchester.ac.uk
Ingo WassinkI.H.C.Wassink@ewi.utwente.nl
Julián Garridojgarrido@iaa.es
Mark Wilkinsonmarkw@illuminae.com
Luke McCarthyelmccarthy@gmail.com
Robert Hainesrhaines@manchester.ac.uk
Shoaib Sufishoaib.sufi@manchester.ac.uk
Steffen Möllermoeller@inb.uni-luebeck.de
Stian Soiland-Reyesstian@soiland-reyes.com   (Apache CLA Signed)
Stuart Owensowen@cs.manchester.ac.uk

In addition to the Core Team (mentioned earlier), this list also
reflects Taverna's existing meritocrazy as it includes plugin
developers whose contributions have been merged into the main code
base. We acknowledge that not all of these are likely to continue as
"Core" developers, but would like to encourage that during the
Incubating process.


# Affiliations

The majority of the initial committers are employed by University of
Manchester as part of the myGrid team, including responsibilities for
contributing to and supporting Taverna.
http://www.mygrid.org.uk/about-us/people/core-mygrid-team/.

Dmitriy Repchevsky is employed by the Barcelona Supercomputing Center,
including responsibilities for contributing to Taverna. Steffen Möller
is employed by University of Lübeck. Julián Garrido is employed by
Instituto de Astrofísica de Andalucía.


# Sponsor Champion

Andy Seaborne


# Nominated Mentors

* Andy Seaborne


# Sponsoring Entity

The Incubator.





Your feedback is very much welcome!


-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message