Return-Path: X-Original-To: apmail-hama-user-archive@www.apache.org Delivered-To: apmail-hama-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BE5E8185AD for ; Wed, 5 Aug 2015 01:44:31 +0000 (UTC) Received: (qmail 24151 invoked by uid 500); 5 Aug 2015 01:44:31 -0000 Delivered-To: apmail-hama-user-archive@hama.apache.org Received: (qmail 24117 invoked by uid 500); 5 Aug 2015 01:44:31 -0000 Mailing-List: contact user-help@hama.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hama.apache.org Delivered-To: mailing list user@hama.apache.org Received: (qmail 24106 invoked by uid 99); 5 Aug 2015 01:44:31 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Aug 2015 01:44:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id DBDD019AECD for ; Wed, 5 Aug 2015 01:44:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.74 X-Spam-Level: * X-Spam-Status: No, score=1.74 tagged_above=-999 required=6.31 tests=[KAM_INFOUSMEBIZ=0.75, KAM_LAZY_DOMAIN_SECURITY=1, SPF_HELO_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id xm-h7LP4to2k for ; Wed, 5 Aug 2015 01:44:16 +0000 (UTC) Received: from mailout3.samsung.com (mailout3.samsung.com [203.254.224.33]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 17BF642B0A for ; Wed, 5 Aug 2015 01:44:15 +0000 (UTC) Received: from epcpsbgm2new.samsung.com (epcpsbgm2 [203.254.230.27]) by mailout3.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTP id <0NSL013TZ6590370@mailout3.samsung.com> for user@hama.apache.org; Wed, 05 Aug 2015 10:43:57 +0900 (KST) X-AuditID: cbfee61b-f79706d000001b96-3a-55c16a5c8107 Received: from epmmp1.local.host ( [203.254.227.16]) by epcpsbgm2new.samsung.com (EPCPMTA) with SMTP id 6F.9B.07062.C5A61C55; Wed, 5 Aug 2015 10:43:57 +0900 (KST) Received: from secPC ([10.251.52.188]) by mmp1.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTPA id <0NSL001ID6580F60@mmp1.samsung.com> for user@hama.apache.org; Wed, 05 Aug 2015 10:43:56 +0900 (KST) From: "Edward J. Yoon" To: user@hama.apache.org References: <001c01d0aed7$fcf6f8b0$f6e4ea10$@samsung.com> In-reply-to: Subject: RE: Hama vs Spark Date: Wed, 05 Aug 2015 10:43:58 +0900 Message-id: <001901d0cf20$325c0960$97141c20$@samsung.com> X-Mailer: Microsoft Outlook 14.0 Thread-index: AQHKOMMwIZxOJIgy1oq1swqsJTkHJgJ3IJRuAWAjwQwBk1dl0AGebdlUAeCKToUB7cRj9J2zEkoA Content-language: ko X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrLJMWRmVeSWpSXmKPExsVy+t9jAd3YrIOhBp+3sVjsvTaTzYHR493h RsYAxigum5TUnMyy1CJ9uwSujONfhAvWOFbcaHvO1MC40biLkZNDQsBE4tv1g6wQtpjEhXvr 2boYuTiEBJYySjxau4gZwmlkklgyewFYFZuAgcTaRauZQGwRAQmJI68nQnU8ZZbYtmg+UIKD g1MgWOLRW1eQGmEBKYlFTy8ygtgsAqoSD8/PAbN5BSwlpnROYYfYrCCx4+xrRoiZMRJ7bz4C s5kFRCT2vXjHOIGRbxYSdwEj4ypGidSC5ILipPRco7zUcr3ixNzi0rx0veT83E2M4EB5Jr2D 8fAu90OMAhyMSjy8H5wPhgqxJpYVV+YeYpTgYFYS4d0RBRTiTUmsrEotyo8vKs1JLT7EKM3B oiTOq2+yKVRIID2xJDU7NbUgtQgmy8TBKdXA6PitbdfXN5sXXJ53q5pjFpPU9ROsH83/Rmtw b3eODCx7sLToib/OX+/O13F7WDfz5DEyRQZ8N+c88VgyMmV1W3PRmj3VfFVWe6f+mauTntgS +epjY6PCg9LePfcqd6iw/vtw55Zp/7Zre2zLlUuDdqjvWnznzScJhvk8Lft+zjy3T/fKLY/W ZiWW4oxEQy3mouJEABS/tiUQAgAA Hi, I don't fully understand how graphlab works but I'm sure that there are pros and cons either way. At the moment I have no plan. :-) However, I noticed that region barrier synchronization feature within single BSP job (default is global barrier synchronization) is quite useful. This can be used for performing asynchronous mini-batches. -- Best Regards, Edward J. Yoon -----Original Message----- From: Behroz Sikander [mailto:behroz89@gmail.com] Sent: Monday, August 03, 2015 7:38 PM To: user@hama.apache.org Subject: Re: Hama vs Spark I think I wrote it wrong. It should be Asynchronous Iterations. I found the following a few months back. It was a thesis description: *SUPPORT FOR ASYNCHRONOUS ITERATIONS IN FLINK (IN COLLABORATION WITH KTH ROYAL INSTITUTE FOR TECHNOLOGY, SWE)* *Context:* Currently, most of the large scale graph processing systems adopt the bulk synchronous parallel (BSP) model. According to this model, iterative computations happen in well -defined supersteps, which are marked by a global barrier. BSP simplifies application development and ensures determinism. However, it has been shown that asynchronous execution often leads to faster convergence, for several algorithms [LBG+12]. The main goal of this thesis is to add support for asynchronous iterative execution, in Apache Flink, a general- purpose, distributed data processing system. http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf On Mon, Aug 3, 2015 at 3:16 AM, Edward J. Yoon wrote: > I'm not sure how it can be possible. However, I think user can find > the slowest machine in each superstep and re-balance the loads. This > can be handled from client (user) side. > > On Sat, Aug 1, 2015 at 4:17 AM, Behroz Sikander > wrote: > > +1. This is great. > > > > Btw our current implementation of Hama is Synchronous BSP i.e we have to > > wait for the slowest machine to sync in order to move to the next super > > step. Is there anything like Asynchronous BSP out yet ? If yes, do you > have > > plans to add it to this framework ? > > > > Regards, > > Behroz > > > > On Wed, Jul 29, 2015 at 3:12 AM, Edward J. Yoon > > wrote: > > > >> I found research paper somewhat related with this topic. > >> > >> "Both the disk based method, i.e., MR, and the memory based method, > >> i.e., BSP and Spark, need to load the data into main memory and > >> conduct the expensive computation. However, when processing topk > >> joins, BSP is clearly the best method as it is the only one that is > >> able to perform top-k joins on large datasets. This is because BSP > >> supports the frequent synchronizations between workers when performing > >> the joining procedure, which quickly lowers the joining threshold for > >> a given k. The winner between the MR and the Spark algorithms change > >> from datasets to datasets: Spark is beaten by MR on A and B while > >> beats MR on C." - > >> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf > >> > >> On Thu, Jun 25, 2015 at 9:02 PM, Behroz Sikander > >> wrote: > >> > Hi all, > >> > *>>Apache Spark is definitely more suited for ML (iterative > algorithms) > >> > than* > >> > > >> > > >> > *legacy Hadoop due to its preservation of state and optimized > >> > executionstrategy (RDDs). However, their approaches are still in > >> > synchronous iterativecommunication pattern.* > >> > So, Hama has a better communication model. That is a good point. > >> > > >> > *>>Moreover, BSP can have virtual **shared memory and many more > >> benefits.* > >> > I read somewhere that Spark has shared variables. BSP virtual shared > >> memory > >> > is something else or is it like shared variables in Spark ? > >> > > >> > *>>In addition, another one convincing* > >> > > >> > *point I think can be a utilization ability of modern acceleration > >> > accessoriessuch as InfiniBand and GPUs* > >> > Yes, it is a good point but I found the following link. Apparently, > Spark > >> > is also capable of doing processing on GPU's. > >> > > >> > https://spark-summit.org/east-2015/talk/heterospark-a-heterogeneous-cpugpu-spark-platform-for-deep-learning-algorithms-2 > >> > > >> > *>>I'm sure that this feature will bring a* > >> > > >> > *completely new wave of big data. The problem we faced is only a lack > >> > ofinterest to BSP programming model. :-)* > >> > My knowledge is quite limited but I think you are right. With the > rise of > >> > IoT and stream processing, GPU's will become vital. Yes, I do not > >> > understand that why BSP is not the programming model of choice now a > >> days. > >> > It has a strong theoretical background which was proposed decades back > >> and > >> > still MapReduce/Spark models are used. > >> > > >> > > >> > *>>Just FYI, one of my friends said after reading this thread, "if > >> > AmazonEC2 = MR or BSP, Google App Engine = Spark". Maybe usability > side.* > >> > I have not written a Spark job before, but I have seen the code. BSP > >> looks > >> > more intuitive to me somehow. > >> > > >> > *>>Hama = GraphX (Library of Spark (Pregel model) [1])* > >> > The graph module of Hama is definitely equal to GraphX of Spark. > >> > > >> > Regards, > >> > Behroz > >> > > >> > On Thu, Jun 25, 2015 at 1:46 AM, Edward J. Yoon < > edward.yoon@samsung.com > >> > > >> > wrote: > >> > > >> >> Hi, here's my few thoughts. > >> >> > >> >> Apache Spark is definitely more suited for ML (iterative algorithms) > >> than > >> >> legacy Hadoop due to its preservation of state and optimized > execution > >> >> strategy (RDDs). However, their approaches are still in synchronous > >> >> iterative > >> >> communication pattern. > >> >> > >> >> In Apache Hama case, it's a general-purpose pure BSP framework. > While I > >> >> admit > >> >> that synchronization costs are high, the communication can be more > >> >> efficiently > >> >> realized with the message-passing BSP model. Moreover, BSP can have > >> virtual > >> >> shared memory and many more benefits. In addition, another one > >> convincing > >> >> point I think can be a utilization ability of modern acceleration > >> >> accessories > >> >> such as InfiniBand and GPUs. I'm sure that this feature will bring a > >> >> completely new wave of big data. The problem we faced is only a lack > of > >> >> interest to BSP programming model. :-) > >> >> > >> >> > 2) Do we have any recent benchmarks between the 2 systems ? > >> >> > >> >> It's in my todo list. > >> >> > >> >> -- > >> >> Best Regards, Edward J. Yoon > >> >> > >> >> -----Original Message----- > >> >> From: Behroz Sikander [mailto:behroz89@gmail.com] > >> >> Sent: Thursday, June 25, 2015 12:57 AM > >> >> To: user@hama.apache.org > >> >> Subject: Hama vs Spark > >> >> > >> >> Hi, > >> >> A few days back, I started reading about Apache Spark. It is a pretty > >> good > >> >> BigData platform. But a question arises to my mind that where Hama > lies > >> in > >> >> comparison with Spark if we have to implement an iterative algorithm > >> which > >> >> is compute intensive (Machine learning or Optimization) ? > >> >> > >> >> I found some resources online but none answers my questions. > >> >> > >> >> 1)BSP vs MapReduce paper > >> >> 2) > >> >> > >> >> > >> > https://people.apache.org/~edwardyoon/documents/Hama_BSP_for_Advanced_Analytics.pdf > >> >> 3) I actually found the following benchmark but it is quite old. > >> >> > >> >> > >> >> > >> > http://markmail.org/message/vyjsdpv355kua7rm#query:+page:1+mid:vstgda4fhmz52pdw+state:results > >> >> > >> >> Questions: > >> >> 1) Is there any specific advantage when we chose BSP model instead of > >> SPARK > >> >> paradigm ? > >> >> 2) Do we have any recent benchmarks between the 2 systems ? > >> >> 3) What is the main convincing point to use Hama over Spark ? > >> >> 4) Any scientific paper that compares both systems ? (I was not able > to > >> >> find any) > >> >> > >> >> Regards, > >> >> Behroz Sikander > >> >> > >> >> > >> >> > >> > >> > >> > >> -- > >> Best Regards, Edward J. Yoon > >> > > > > -- > Best Regards, Edward J. Yoon >