Mailing-List: contact user-help@hama.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hama.apache.org
From: "Edward J. Yoon" <edward.yoon@samsung.com>
To: user@hama.apache.org
References: 
 <CAAp_xXFo-xmBfvu+W+3QJHeYSX_RdRW89dXDRnuxJ1Bt_2CabA@mail.gmail.com>
 <001c01d0aed7$fcf6f8b0$f6e4ea10$@samsung.com>
 <CAAp_xXENOc5X9OjJrw1q9XMt_+1g6n7M_6eV9Q1T1VH2NJqioQ@mail.gmail.com>
 <CAGQgZQRFRgzk-9a8EhoPiyV4thpwWvMZauTa9=B4s6CbvMC16Q@mail.gmail.com>
 <CAAp_xXFbzuU6fT2ykuT0+_cRZoM09dmX2LmmRpumyDNteeMjoA@mail.gmail.com>
 <CAGQgZQRwGJB+wDvDsiOt2r5iPBTLWG7m3QaeMcpJKR9hsbqcOA@mail.gmail.com>
 <CAAp_xXE5sczOho4jP+HBPu8tNnD0O47d4v5EVrXRxgrYQmag8g@mail.gmail.com>
In-reply-to: 
 <CAAp_xXE5sczOho4jP+HBPu8tNnD0O47d4v5EVrXRxgrYQmag8g@mail.gmail.com>
Subject: RE: Hama vs Spark
Date: Wed, 05 Aug 2015 10:43:58 +0900
Message-id: <001901d0cf20$325c0960$97141c20$@samsung.com>
Thread-index: 
 AQHKOMMwIZxOJIgy1oq1swqsJTkHJgJ3IJRuAWAjwQwBk1dl0AGebdlUAeCKToUB7cRj9J2zEkoA
Content-language: ko

Hi,

I don't fully understand how graphlab works but I'm sure that there are pros 
and cons either way. At the moment I have no plan. :-)

However, I noticed that region barrier synchronization feature within single 
BSP job (default is global barrier synchronization) is quite useful. This can 
be used for performing asynchronous mini-batches.

--
Best Regards, Edward J. Yoon

-----Original Message-----
From: Behroz Sikander [mailto:behroz89@gmail.com]
Sent: Monday, August 03, 2015 7:38 PM
To: user@hama.apache.org
Subject: Re: Hama vs Spark

I think I wrote it wrong. It should be Asynchronous Iterations. I found the
following a few months back. It was a thesis description:

*SUPPORT FOR ASYNCHRONOUS ITERATIONS IN FLINK (IN COLLABORATION WITH KTH
ROYAL INSTITUTE FOR TECHNOLOGY, SWE)*

*Context:* Currently, most of the large scale graph processing systems
adopt the bulk synchronous parallel (BSP) model. According to this model,
iterative computations happen in well -defined supersteps, which are marked
by a global barrier. BSP simplifies application development and ensures
determinism. However, it has been shown that asynchronous execution often
leads to faster convergence, for several algorithms [LBG+12]. The main goal
of this thesis is to add support for asynchronous iterative execution, in
Apache Flink, a general- purpose, distributed data processing system.

http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf

On Mon, Aug 3, 2015 at 3:16 AM, Edward J. Yoon <edwardyoon@apache.org>
wrote:

> I'm not sure how it can be possible. However, I think user can find
> the slowest machine in each superstep and re-balance the loads. This
> can be handled from client (user) side.
>
> On Sat, Aug 1, 2015 at 4:17 AM, Behroz Sikander <behroz89@gmail.com>
> wrote:
> > +1. This is great.
> >
> > Btw our current implementation of Hama is Synchronous BSP i.e we have to
> > wait for the slowest machine to sync in order to move to the next super
> > step. Is there anything like Asynchronous BSP out yet ? If yes, do you
> have
> > plans to add it to this framework ?
> >
> > Regards,
> > Behroz
> >
> > On Wed, Jul 29, 2015 at 3:12 AM, Edward J. Yoon <edwardyoon@apache.org>
> > wrote:
> >
> >> I found research paper somewhat related with this topic.
> >>
> >> "Both the disk based method, i.e., MR, and the memory based method,
> >> i.e., BSP and Spark, need to load the data into main memory and
> >> conduct the expensive computation. However, when processing topk
> >> joins, BSP is clearly the best method as it is the only one that is
> >> able to perform top-k joins on large datasets. This is because BSP
> >> supports the frequent synchronizations between workers when performing
> >> the joining procedure, which quickly lowers the joining threshold for
> >> a given k. The winner between the MR and the Spark algorithms change
> >> from datasets to datasets: Spark is beaten by MR on A and B while
> >> beats MR on C." -
> >> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf
> >>
> >> On Thu, Jun 25, 2015 at 9:02 PM, Behroz Sikander <behroz89@gmail.com>
> >> wrote:
> >> > Hi all,
> >> > *>>Apache Spark is definitely more suited for ML (iterative
> algorithms)
> >> > than*
> >> >
> >> >
> >> > *legacy Hadoop due to its preservation of state and optimized
> >> > executionstrategy (RDDs). However, their approaches are still in
> >> > synchronous iterativecommunication pattern.*
> >> > So, Hama has a better communication model. That is a good point.
> >> >
> >> > *>>Moreover, BSP can have virtual **shared memory and many more
> >> benefits.*
> >> > I read somewhere that Spark has shared variables. BSP virtual shared
> >> memory
> >> > is something else or is it like shared variables in Spark ?
> >> >
> >> > *>>In addition, another one convincing*
> >> >
> >> > *point I think can  be a utilization ability of modern acceleration
> >> > accessoriessuch as InfiniBand and GPUs*
> >> > Yes, it is a good point but I found the following link. Apparently,
> Spark
> >> > is also capable of doing processing on GPU's.
> >> >
> >>
> https://spark-summit.org/east-2015/talk/heterospark-a-heterogeneous-cpugpu-spark-platform-for-deep-learning-algorithms-2
> >> >
> >> > *>>I'm sure that this feature will bring a*
> >> >
> >> > *completely new wave of big data. The problem we faced is only a lack
> >> > ofinterest to BSP programming model. :-)*
> >> > My knowledge is quite limited but I think you are right. With the
> rise of
> >> > IoT and stream processing, GPU's will become vital. Yes, I do not
> >> > understand that why BSP is not the programming model of choice now a
> >> days.
> >> > It has a strong theoretical background which was proposed decades back
> >> and
> >> > still MapReduce/Spark models are used.
> >> >
> >> >
> >> > *>>Just FYI, one of my friends said after reading this thread, "if
> >> > AmazonEC2 = MR or BSP, Google App Engine = Spark". Maybe usability
> side.*
> >> > I have not written a Spark job before, but I have seen the code. BSP
> >> looks
> >> > more intuitive to me somehow.
> >> >
> >> > *>>Hama = GraphX (Library of Spark (Pregel model) [1])*
> >> > The graph module of Hama is definitely equal to GraphX of Spark.
> >> >
> >> > Regards,
> >> > Behroz
> >> >
> >> > On Thu, Jun 25, 2015 at 1:46 AM, Edward J. Yoon <
> edward.yoon@samsung.com
> >> >
> >> > wrote:
> >> >
> >> >> Hi, here's my few thoughts.
> >> >>
> >> >> Apache Spark is definitely more suited for ML (iterative algorithms)
> >> than
> >> >> legacy Hadoop due to its preservation of state and optimized
> execution
> >> >> strategy (RDDs). However, their approaches are still in synchronous
> >> >> iterative
> >> >> communication pattern.
> >> >>
> >> >> In Apache Hama case, it's a general-purpose pure BSP framework.
> While I
> >> >> admit
> >> >> that synchronization costs are high, the communication can be more
> >> >> efficiently
> >> >> realized with the message-passing BSP model. Moreover, BSP can have
> >> virtual
> >> >> shared memory and many more benefits. In addition, another one
> >> convincing
> >> >> point I think can  be a utilization ability of modern acceleration
> >> >> accessories
> >> >> such as InfiniBand and GPUs. I'm sure that this feature will bring a
> >> >> completely new wave of big data. The problem we faced is only a lack
> of
> >> >> interest to BSP programming model. :-)
> >> >>
> >> >> > 2) Do we have any recent benchmarks between the 2 systems ?
> >> >>
> >> >> It's in my todo list.
> >> >>
> >> >> --
> >> >> Best Regards, Edward J. Yoon
> >> >>
> >> >> -----Original Message-----
> >> >> From: Behroz Sikander [mailto:behroz89@gmail.com]
> >> >> Sent: Thursday, June 25, 2015 12:57 AM
> >> >> To: user@hama.apache.org
> >> >> Subject: Hama vs Spark
> >> >>
> >> >> Hi,
> >> >> A few days back, I started reading about Apache Spark. It is a pretty
> >> good
> >> >> BigData platform. But a question arises to my mind that where Hama
> lies
> >> in
> >> >> comparison with Spark if we have to implement an iterative algorithm
> >> which
> >> >> is compute intensive (Machine learning or Optimization) ?
> >> >>
> >> >> I found some resources online but none answers my questions.
> >> >>
> >> >> 1)BSP vs MapReduce paper <http://arxiv.org/pdf/1203.2081v2.pdf>
> >> >> 2)
> >> >>
> >> >>
> >>
> https://people.apache.org/~edwardyoon/documents/Hama_BSP_for_Advanced_Analytics.pdf
> >> >> 3) I actually found the following benchmark but it is quite old.
> >> >>
> >> >>
> >> >>
> >>
> http://markmail.org/message/vyjsdpv355kua7rm#query:+page:1+mid:vstgda4fhmz52pdw+state:results
> >> >>
> >> >> Questions:
> >> >> 1) Is there any specific advantage when we chose BSP model instead of
> >> SPARK
> >> >> paradigm ?
> >> >> 2) Do we have any recent benchmarks between the 2 systems ?
> >> >> 3) What is the main convincing point to use Hama over Spark ?
> >> >> 4) Any scientific paper that compares both systems ? (I was not able
> to
> >> >> find any)
> >> >>
> >> >> Regards,
> >> >> Behroz Sikander
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
>