Return-Path: X-Original-To: apmail-airavata-dev-archive@www.apache.org Delivered-To: apmail-airavata-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4F833E551 for ; Tue, 26 Feb 2013 03:52:35 +0000 (UTC) Received: (qmail 87877 invoked by uid 500); 26 Feb 2013 03:52:34 -0000 Delivered-To: apmail-airavata-dev-archive@airavata.apache.org Received: (qmail 87775 invoked by uid 500); 26 Feb 2013 03:52:33 -0000 Mailing-List: contact dev-help@airavata.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airavata.apache.org Delivered-To: mailing list dev@airavata.apache.org Received: (qmail 87738 invoked by uid 99); 26 Feb 2013 03:52:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Feb 2013 03:52:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of danushka.menikkumbura@gmail.com designates 209.85.216.42 as permitted sender) Received: from [209.85.216.42] (HELO mail-qa0-f42.google.com) (209.85.216.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Feb 2013 03:52:25 +0000 Received: by mail-qa0-f42.google.com with SMTP id cr7so2108236qab.8 for ; Mon, 25 Feb 2013 19:52:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=gXlIIhsxHpMIQvneyTur4xOwTOFQLsr4JvixQbDiqXQ=; b=LyARRYkjoqUUka61fnpiXv5OfLxSdiCWCgpbiYCuCI5xQlJgY/n5DpAn0kTaFm/rld QYvw2ztokSU6+QF4cn0tvT+wKVQDdgt5+ldGXgtdwOn8A10TpkQh6QHGkbNevmAob2Kx gTKg50z8bws6L7P0rj+u9FiXQN/RRfFvCIcoQk+JtOi2VuZIu6N/WqiIn8uH2/IDvIEj JCpxq1oePhdTaUUcNfmnyqvgd9Y9PteBAd66YQ8xxMeDz7CgcIPM+R9/P/TIZwhOkb8f M3xQ96N3yGMF7SNneAQ0qBbQ90LSXJvUjBFhGMNOWRiGwmzysNJB05H28Nj0xjuTW2pG Iyug== X-Received: by 10.229.193.141 with SMTP id du13mr3573714qcb.72.1361850724347; Mon, 25 Feb 2013 19:52:04 -0800 (PST) MIME-Version: 1.0 Received: by 10.49.5.99 with HTTP; Mon, 25 Feb 2013 19:51:44 -0800 (PST) In-Reply-To: References: From: Danushka Menikkumbura Date: Tue, 26 Feb 2013 09:21:44 +0530 Message-ID: Subject: Re: Airavata/Hadoop Integration To: dev@airavata.apache.org Content-Type: multipart/alternative; boundary=0016e68ee146a981ae04d6989429 X-Virus-Checked: Checked by ClamAV on apache.org --0016e68ee146a981ae04d6989429 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Sounds great! Thanks Amila. On Tue, Feb 26, 2013 at 8:46 AM, Amila Jayasekara wrote: > On Mon, Feb 25, 2013 at 9:59 PM, Danushka Menikkumbura > wrote: > > Also, I suggest we have a simple plug-in architecture for providers tha= t > > would make having custom providers possible. > > Hi Dhanushka, > > I guess the plugin mechanism for providers is already in-place with > new GFac architecture. Lahiru will be able to give more information > about this. > > Thanks > Amila > > > > > Thanks, > > Danushka > > > > > > On Tue, Feb 26, 2013 at 3:18 AM, Danushka Menikkumbura < > > danushka.menikkumbura@gmail.com> wrote: > > > >> Hi Devs, > >> > >> I am looking into extending Big Data capabilities of Airavata as my > M.Sc. > >> research work. I have identified certain possibilities and am going to > >> start with integrating Apache Hadoop (and Hadoop-like frameworks) with > >> Airavata. > >> > >> According to what I have understood, the best approach would be to hav= e > a > >> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We > can > >> have a new parameter in the ApplicationContext (say TargetApplication) > to > >> define the target application type and resolve correct provider in the > GFac > >> Scheduler based on that. I see that having this capability in the > Scheduler > >> class is already a TODO. I have been able to do these changes locally > and > >> invoke a simple Hadoop job using GFac. Thus, I can assure that this > >> approach is viable except for any other implication that I am missing. > >> > >> I think we can store Hadoop job definitions in the Airavata Registry > where > >> each definition would essentially include a unique identifier and othe= r > >> attributes like mapper, reducer, sorter, formaters, etc that can be > defined > >> using XBaya. Information about these building blocks could be loaded > from > >> XML meta data files (of a known format) included in jar files. It shou= ld > >> also be possible to compose Hadoop job "chains" using XBaya. So, what = we > >> specify in the application context would be the target application typ= e > >> (say Hadoop), job/chain id, input file location and the output file > >> location. In addition I am thinking of having job monitoring support > based > >> on constructs provided by the Hadoop API (that I have already looked > into) > >> and data querying based on Apache Hive/Pig. > >> > >> Furthermore, apart from Hadoop there are two other similar frameworks > that > >> look quite promising. > >> > >> 1. Sector/Sphere > >> > >> Sector/Sphere [1] is an open source software framework for > >> high-performance distributed data storage and processing. It is > comparable > >> with Apache HDFS/Hadoop. Sector is a distributed file system and Spher= e > is > >> the programming framework that supports massive in-storage parallel da= ta > >> processing on data stored in Sector. The key motive is that > Sector/Sphere > >> is claimed to be about 2 - 4 times faster than Hadoop. > >> > >> 2. Hyracks > >> > >> Hyracks [2] is another framework for data-intensive computing that is > >> roughly in the same space as Apache Hadoop. It has support for composi= ng > >> and executing native Hyracks jobs plus running Hadoop jobs in the > Hyracks > >> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3]= . > >> > >> I am yet to look into the API's of these two frameworks but they shoul= d > >> ideally work with the same GFac implementation that I have proposed fo= r > >> Hadoop. > >> > >> I would strongly appreciate your feedback on this approach. Also pros > and > >> cons of using Sector/Sphere or Hyracks if you have any experience with > them > >> already. > >> > >> [1] Y. Gu and R. L. Grossman, =93Lessons learned from a year=92s worth= of > >> benchmarks of large data clouds,=94 in Proceedings of the 2nd Workshop= on > >> Many-Task Computing on Grids and Supercomputers, 2009, p. 3. > >> > >> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, =93Hyrac= ks: > A > >> flexible and extensible foundation for data-intensive computing,=94 in > Data > >> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, > pp. > >> 1151=961162. > >> > >> [3] http://asterix.ics.uci.edu/ > >> > >> Thanks, > >> Danushka > >> > --0016e68ee146a981ae04d6989429--