Return-Path: X-Original-To: apmail-airavata-dev-archive@www.apache.org Delivered-To: apmail-airavata-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 17930EC59 for ; Tue, 26 Feb 2013 16:48:05 +0000 (UTC) Received: (qmail 80194 invoked by uid 500); 26 Feb 2013 16:48:05 -0000 Delivered-To: apmail-airavata-dev-archive@airavata.apache.org Received: (qmail 80114 invoked by uid 500); 26 Feb 2013 16:48:04 -0000 Mailing-List: contact dev-help@airavata.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airavata.apache.org Delivered-To: mailing list dev@airavata.apache.org Received: (qmail 80105 invoked by uid 99); 26 Feb 2013 16:48:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Feb 2013 16:48:04 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of glahiru@gmail.com designates 74.125.82.177 as permitted sender) Received: from [74.125.82.177] (HELO mail-we0-f177.google.com) (74.125.82.177) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Feb 2013 16:47:59 +0000 Received: by mail-we0-f177.google.com with SMTP id d7so3646685wer.36 for ; Tue, 26 Feb 2013 08:47:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=9HtD0G1FZcHBBF7dZcxWN8TvFxUpBQ+8IyxTp/t13Lg=; b=cQ3SAZQlBISXAEYl9bs4a0ewNp2n6AbdFC8Ltq2f1iKVxZVD06nwE9G70aukfIFDIt LAf7dJUdOf8YG5YcGw0L5OCJQEfs60Ia3n6MANwniY+nvn1WLJ+UKRsxh67deTxfd/ts 1fy8RuNxx9sb/Ipr+w5ew3Gmik9Ww1r/PdoSF2KTE64016Msb7fo5WhjvIFUh0rrCGs7 YML5Sh6WmtInqTdGMa9lXRH7XubYURoHP/HiAaO9fUTHNI3HEbh4uQdcmRg6hebaQi2L v4W6Dpm1PAfAXZXB8dhwsTzbjrbcMUvSGkUG7nH+pp43Cc5WwhG44TVkx4S6XgVOfCOe ojRg== MIME-Version: 1.0 X-Received: by 10.180.14.233 with SMTP id s9mr20663633wic.25.1361897258419; Tue, 26 Feb 2013 08:47:38 -0800 (PST) Received: by 10.216.106.69 with HTTP; Tue, 26 Feb 2013 08:47:38 -0800 (PST) In-Reply-To: References: Date: Tue, 26 Feb 2013 11:47:38 -0500 Message-ID: Subject: Re: Airavata/Hadoop Integration From: Lahiru Gunathilake To: dev@airavata.apache.org Content-Type: multipart/alternative; boundary=f46d040fa03c4f13d804d6a36a39 X-Virus-Checked: Checked by ClamAV on apache.org --f46d040fa03c4f13d804d6a36a39 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi Danushka, I'm on it right now, will fnish in couple of hours. Lahiru On Tue, Feb 26, 2013 at 10:23 AM, Suresh Marru wrote: > On Feb 26, 2013, at 7:04 AM, Lahiru Gunathilake wrote= : > > > Hi Danushka, > > > > I think we already have a provider to handle Hadoop jobs which uses > Apache > > Whirr to setup the Hadoop cluster and submit the job. > > I think this Lahiru is referring to the GSOC projects - > https://code.google.com/a/apache-extras.org/p/airavata-gsoc-sandbox/ > > Suresh > > > > > We still didn't port this code to Airavata, once I do will send an emai= l > to > > the list. > > > > Regards > > Lahiru > > > > On Mon, Feb 25, 2013 at 4:48 PM, Danushka Menikkumbura < > > danushka.menikkumbura@gmail.com> wrote: > > > >> Hi Devs, > >> > >> I am looking into extending Big Data capabilities of Airavata as my > M.Sc. > >> research work. I have identified certain possibilities and am going to > >> start with integrating Apache Hadoop (and Hadoop-like frameworks) with > >> Airavata. > >> > >> According to what I have understood, the best approach would be to hav= e > a > >> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We > can > >> have a new parameter in the ApplicationContext (say TargetApplication) > to > >> define the target application type and resolve correct provider in the > GFac > >> Scheduler based on that. I see that having this capability in the > Scheduler > >> class is already a TODO. I have been able to do these changes locally > and > >> invoke a simple Hadoop job using GFac. Thus, I can assure that this > >> approach is viable except for any other implication that I am missing. > >> > >> I think we can store Hadoop job definitions in the Airavata Registry > where > >> each definition would essentially include a unique identifier and othe= r > >> attributes like mapper, reducer, sorter, formaters, etc that can be > defined > >> using XBaya. Information about these building blocks could be loaded > from > >> XML meta data files (of a known format) included in jar files. It shou= ld > >> also be possible to compose Hadoop job "chains" using XBaya. So, what = we > >> specify in the application context would be the target application typ= e > >> (say Hadoop), job/chain id, input file location and the output file > >> location. In addition I am thinking of having job monitoring support > based > >> on constructs provided by the Hadoop API (that I have already looked > into) > >> and data querying based on Apache Hive/Pig. > >> > >> Furthermore, apart from Hadoop there are two other similar frameworks > that > >> look quite promising. > >> > >> 1. Sector/Sphere > >> > >> Sector/Sphere [1] is an open source software framework for > high-performance > >> distributed data storage and processing. It is comparable with Apache > >> HDFS/Hadoop. Sector is a distributed file system and Sphere is the > >> programming framework that supports massive in-storage parallel data > >> processing on data stored in Sector. The key motive is that > Sector/Sphere > >> is claimed to be about 2 - 4 times faster than Hadoop. > >> > >> 2. Hyracks > >> > >> Hyracks [2] is another framework for data-intensive computing that is > >> roughly in the same space as Apache Hadoop. It has support for composi= ng > >> and executing native Hyracks jobs plus running Hadoop jobs in the > Hyracks > >> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3]= . > >> > >> I am yet to look into the API's of these two frameworks but they shoul= d > >> ideally work with the same GFac implementation that I have proposed fo= r > >> Hadoop. > >> > >> I would strongly appreciate your feedback on this approach. Also pros > and > >> cons of using Sector/Sphere or Hyracks if you have any experience with > them > >> already. > >> > >> [1] Y. Gu and R. L. Grossman, =93Lessons learned from a year=92s worth= of > >> benchmarks of large data clouds,=94 in Proceedings of the 2nd Workshop= on > >> Many-Task Computing on Grids and Supercomputers, 2009, p. 3. > >> > >> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, =93Hyrac= ks: > A > >> flexible and extensible foundation for data-intensive computing,=94 in > Data > >> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, > pp. > >> 1151=961162. > >> > >> [3] http://asterix.ics.uci.edu/ > >> > >> Thanks, > >> Danushka > >> > > > > > > > > -- > > System Analyst Programmer > > PTI Lab > > Indiana University > > --=20 System Analyst Programmer PTI Lab Indiana University --f46d040fa03c4f13d804d6a36a39--