airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lahiru Gunathilake <>
Subject Re: Airavata/Hadoop Integration
Date Wed, 27 Feb 2013 01:11:07 GMT
Hi Danushka,

On Mon, Feb 25, 2013 at 4:48 PM, Danushka Menikkumbura <> wrote:

> Hi Devs,
> I am looking into extending Big Data capabilities of Airavata as my M.Sc.
> research work. I have identified certain possibilities and am going to
> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
> Airavata.
> According to what I have understood, the best approach would be to have a
> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can
> have a new parameter in the ApplicationContext (say TargetApplication) to
> define the target application type and resolve correct provider in the GFac
> Scheduler based on that. I see that having this capability in the Scheduler
> class is already a TODO. I have been able to do these changes locally and
> invoke a simple Hadoop job using GFac. Thus, I can assure that this
> approach is viable except for any other implication that I am missing.
> I think we can store Hadoop job definitions in the Airavata Registry where
> each definition would essentially include a unique identifier and other
> attributes like mapper, reducer, sorter, formaters, etc that can be defined
> using XBaya. Information about these building blocks could be loaded from
> XML meta data files (of a known format) included in jar files. It should
> also be possible to compose Hadoop job "chains" using XBaya. So, what we
> specify in the application context would be the target application type
> (say Hadoop), job/chain id, input file location and the output file
> location. In addition I am thinking of having job monitoring support based
> on constructs provided by the Hadoop API (that I have already looked into)
> and data querying based on Apache Hive/Pig.
I think we have pretty much this functionality done in the similar way you
are explaining. I have added the code in to trunk and will provide some
test classes and will update the schedular to return the HadoopProvider.

> Furthermore, apart from Hadoop there are two other similar frameworks that
> look quite promising.
> 1. Sector/Sphere
> Sector/Sphere [1] is an open source software framework for high-performance
> distributed data storage and processing. It is comparable with Apache
> HDFS/Hadoop. Sector is a distributed file system and Sphere is the
> programming framework that supports massive in-storage parallel data
> processing on data stored in Sector. The key motive is that Sector/Sphere
> is claimed to be about 2 - 4 times faster than Hadoop.
> 2. Hyracks
> Hyracks [2] is another framework for data-intensive computing that is
> roughly in the same space as Apache Hadoop. It has support for composing
> and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks
> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].
> I am +1 to enable these API to enable to use other components, but do you
think actual users would have a concern  about the underneath library we
use for Mapreduce jobs ? I am not quite confident about the way people are
using these. But anyhow its nice to have a support for these.


> I am yet to look into the API's of these two frameworks but they should
> ideally work with the same GFac implementation that I have proposed for
> Hadoop.
> I would strongly appreciate your feedback on this approach. Also pros and
> cons of using Sector/Sphere or Hyracks if you have any experience with them
> already.
> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A
> flexible and extensible foundation for data-intensive computing,” in Data
> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp.
> 1151–1162.
> [3]
> Thanks,
> Danushka

System Analyst Programmer
Indiana University

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message