Mailing-List: contact dev-help@airavata.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@airavata.apache.org
Received-SPF: pass (athena.apache.org: domain of glahiru@gmail.com designates
 74.125.82.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <A835FA2F-EC40-4A11-8148-1F5B30B357CB@apache.org>
References: 
 <CAF4V=3GqonQ62+DWAMLaew794cd0UGgM6JJJeQR+t_WqqBVmYg@mail.gmail.com>
	<CAKq9_6+Vm8Z5n1E+UJ-1c7AoKK=xeZ=tz5pSbwJB3v_TX-4QjQ@mail.gmail.com>
	<A835FA2F-EC40-4A11-8148-1F5B30B357CB@apache.org>
Date: Tue, 26 Feb 2013 11:47:38 -0500
Message-ID: 
 <CAKq9_6Jc-BywhKc3TDVgkA_kTfWK5Gqkfu-tLkeaiJ8y5QdwLA@mail.gmail.com>
Subject: Re: Airavata/Hadoop Integration
From: Lahiru Gunathilake <glahiru@gmail.com>
To: dev@airavata.apache.org
Content-Type: multipart/alternative; boundary=f46d040fa03c4f13d804d6a36a39

--f46d040fa03c4f13d804d6a36a39
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi Danushka,

I'm on it right now, will fnish in couple of hours.

Lahiru

On Tue, Feb 26, 2013 at 10:23 AM, Suresh Marru <smarru@apache.org> wrote:

> On Feb 26, 2013, at 7:04 AM, Lahiru Gunathilake <glahiru@gmail.com> wrote=
:
>
> > Hi Danushka,
> >
> > I think we already have a provider to handle Hadoop jobs which uses
> Apache
> > Whirr to setup the Hadoop cluster and submit the job.
>
> I think this Lahiru is referring to the GSOC projects -
> https://code.google.com/a/apache-extras.org/p/airavata-gsoc-sandbox/
>
> Suresh
>
> >
> > We still didn't port this code to Airavata, once I do will send an emai=
l
> to
> > the list.
> >
> > Regards
> > Lahiru
> >
> > On Mon, Feb 25, 2013 at 4:48 PM, Danushka Menikkumbura <
> > danushka.menikkumbura@gmail.com> wrote:
> >
> >> Hi Devs,
> >>
> >> I am looking into extending Big Data capabilities of Airavata as my
> M.Sc.
> >> research work. I have identified certain possibilities and am going to
> >> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
> >> Airavata.
> >>
> >> According to what I have understood, the best approach would be to hav=
e
> a
> >> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We
> can
> >> have a new parameter in the ApplicationContext (say TargetApplication)
> to
> >> define the target application type and resolve correct provider in the
> GFac
> >> Scheduler based on that. I see that having this capability in the
> Scheduler
> >> class is already a TODO. I have been able to do these changes locally
> and
> >> invoke a simple Hadoop job using GFac. Thus, I can assure that this
> >> approach is viable except for any other implication that I am missing.
> >>
> >> I think we can store Hadoop job definitions in the Airavata Registry
> where
> >> each definition would essentially include a unique identifier and othe=
r
> >> attributes like mapper, reducer, sorter, formaters, etc that can be
> defined
> >> using XBaya. Information about these building blocks could be loaded
> from
> >> XML meta data files (of a known format) included in jar files. It shou=
ld
> >> also be possible to compose Hadoop job "chains" using XBaya. So, what =
we
> >> specify in the application context would be the target application typ=
e
> >> (say Hadoop), job/chain id, input file location and the output file
> >> location. In addition I am thinking of having job monitoring support
> based
> >> on constructs provided by the Hadoop API (that I have already looked
> into)
> >> and data querying based on Apache Hive/Pig.
> >>
> >> Furthermore, apart from Hadoop there are two other similar frameworks
> that
> >> look quite promising.
> >>
> >> 1. Sector/Sphere
> >>
> >> Sector/Sphere [1] is an open source software framework for
> high-performance
> >> distributed data storage and processing. It is comparable with Apache
> >> HDFS/Hadoop. Sector is a distributed file system and Sphere is the
> >> programming framework that supports massive in-storage parallel data
> >> processing on data stored in Sector. The key motive is that
> Sector/Sphere
> >> is claimed to be about 2 - 4 times faster than Hadoop.
> >>
> >> 2. Hyracks
> >>
> >> Hyracks [2] is another framework for data-intensive computing that is
> >> roughly in the same space as Apache Hadoop. It has support for composi=
ng
> >> and executing native Hyracks jobs plus running Hadoop jobs in the
> Hyracks
> >> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3]=
.
> >>
> >> I am yet to look into the API's of these two frameworks but they shoul=
d
> >> ideally work with the same GFac implementation that I have proposed fo=
r
> >> Hadoop.
> >>
> >> I would strongly appreciate your feedback on this approach. Also pros
> and
> >> cons of using Sector/Sphere or Hyracks if you have any experience with
> them
> >> already.
> >>
> >> [1] Y. Gu and R. L. Grossman, =93Lessons learned from a year=92s worth=
 of
> >> benchmarks of large data clouds,=94 in Proceedings of the 2nd Workshop=
 on
> >> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
> >>
> >> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, =93Hyrac=
ks:
> A
> >> flexible and extensible foundation for data-intensive computing,=94 in
> Data
> >> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011,
> pp.
> >> 1151=961162.
> >>
> >> [3] http://asterix.ics.uci.edu/
> >>
> >> Thanks,
> >> Danushka
> >>
> >
> >
> >
> > --
> > System Analyst Programmer
> > PTI Lab
> > Indiana University
>
>


--=20
System Analyst Programmer
PTI Lab
Indiana University

--f46d040fa03c4f13d804d6a36a39--