hama-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Behroz Sikander <behro...@gmail.com>
Subject Re: Hama vs Spark
Date Fri, 31 Jul 2015 19:17:37 GMT
+1. This is great.

Btw our current implementation of Hama is Synchronous BSP i.e we have to
wait for the slowest machine to sync in order to move to the next super
step. Is there anything like Asynchronous BSP out yet ? If yes, do you have
plans to add it to this framework ?

Regards,
Behroz

On Wed, Jul 29, 2015 at 3:12 AM, Edward J. Yoon <edwardyoon@apache.org>
wrote:

> I found research paper somewhat related with this topic.
>
> "Both the disk based method, i.e., MR, and the memory based method,
> i.e., BSP and Spark, need to load the data into main memory and
> conduct the expensive computation. However, when processing topk
> joins, BSP is clearly the best method as it is the only one that is
> able to perform top-k joins on large datasets. This is because BSP
> supports the frequent synchronizations between workers when performing
> the joining procedure, which quickly lowers the joining threshold for
> a given k. The winner between the MR and the Spark algorithms change
> from datasets to datasets: Spark is beaten by MR on A and B while
> beats MR on C." -
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf
>
> On Thu, Jun 25, 2015 at 9:02 PM, Behroz Sikander <behroz89@gmail.com>
> wrote:
> > Hi all,
> > *>>Apache Spark is definitely more suited for ML (iterative algorithms)
> > than*
> >
> >
> > *legacy Hadoop due to its preservation of state and optimized
> > executionstrategy (RDDs). However, their approaches are still in
> > synchronous iterativecommunication pattern.*
> > So, Hama has a better communication model. That is a good point.
> >
> > *>>Moreover, BSP can have virtual **shared memory and many more
> benefits.*
> > I read somewhere that Spark has shared variables. BSP virtual shared
> memory
> > is something else or is it like shared variables in Spark ?
> >
> > *>>In addition, another one convincing*
> >
> > *point I think can  be a utilization ability of modern acceleration
> > accessoriessuch as InfiniBand and GPUs*
> > Yes, it is a good point but I found the following link. Apparently, Spark
> > is also capable of doing processing on GPU's.
> >
> https://spark-summit.org/east-2015/talk/heterospark-a-heterogeneous-cpugpu-spark-platform-for-deep-learning-algorithms-2
> >
> > *>>I'm sure that this feature will bring a*
> >
> > *completely new wave of big data. The problem we faced is only a lack
> > ofinterest to BSP programming model. :-)*
> > My knowledge is quite limited but I think you are right. With the rise of
> > IoT and stream processing, GPU's will become vital. Yes, I do not
> > understand that why BSP is not the programming model of choice now a
> days.
> > It has a strong theoretical background which was proposed decades back
> and
> > still MapReduce/Spark models are used.
> >
> >
> > *>>Just FYI, one of my friends said after reading this thread, "if
> > AmazonEC2 = MR or BSP, Google App Engine = Spark". Maybe usability side.*
> > I have not written a Spark job before, but I have seen the code. BSP
> looks
> > more intuitive to me somehow.
> >
> > *>>Hama = GraphX (Library of Spark (Pregel model) [1])*
> > The graph module of Hama is definitely equal to GraphX of Spark.
> >
> > Regards,
> > Behroz
> >
> > On Thu, Jun 25, 2015 at 1:46 AM, Edward J. Yoon <edward.yoon@samsung.com
> >
> > wrote:
> >
> >> Hi, here's my few thoughts.
> >>
> >> Apache Spark is definitely more suited for ML (iterative algorithms)
> than
> >> legacy Hadoop due to its preservation of state and optimized execution
> >> strategy (RDDs). However, their approaches are still in synchronous
> >> iterative
> >> communication pattern.
> >>
> >> In Apache Hama case, it's a general-purpose pure BSP framework. While I
> >> admit
> >> that synchronization costs are high, the communication can be more
> >> efficiently
> >> realized with the message-passing BSP model. Moreover, BSP can have
> virtual
> >> shared memory and many more benefits. In addition, another one
> convincing
> >> point I think can  be a utilization ability of modern acceleration
> >> accessories
> >> such as InfiniBand and GPUs. I'm sure that this feature will bring a
> >> completely new wave of big data. The problem we faced is only a lack of
> >> interest to BSP programming model. :-)
> >>
> >> > 2) Do we have any recent benchmarks between the 2 systems ?
> >>
> >> It's in my todo list.
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >>
> >> -----Original Message-----
> >> From: Behroz Sikander [mailto:behroz89@gmail.com]
> >> Sent: Thursday, June 25, 2015 12:57 AM
> >> To: user@hama.apache.org
> >> Subject: Hama vs Spark
> >>
> >> Hi,
> >> A few days back, I started reading about Apache Spark. It is a pretty
> good
> >> BigData platform. But a question arises to my mind that where Hama lies
> in
> >> comparison with Spark if we have to implement an iterative algorithm
> which
> >> is compute intensive (Machine learning or Optimization) ?
> >>
> >> I found some resources online but none answers my questions.
> >>
> >> 1)BSP vs MapReduce paper <http://arxiv.org/pdf/1203.2081v2.pdf>
> >> 2)
> >>
> >>
> https://people.apache.org/~edwardyoon/documents/Hama_BSP_for_Advanced_Analytics.pdf
> >> 3) I actually found the following benchmark but it is quite old.
> >>
> >>
> >>
> http://markmail.org/message/vyjsdpv355kua7rm#query:+page:1+mid:vstgda4fhmz52pdw+state:results
> >>
> >> Questions:
> >> 1) Is there any specific advantage when we chose BSP model instead of
> SPARK
> >> paradigm ?
> >> 2) Do we have any recent benchmarks between the 2 systems ?
> >> 3) What is the main convincing point to use Hama over Spark ?
> >> 4) Any scientific paper that compares both systems ? (I was not able to
> >> find any)
> >>
> >> Regards,
> >> Behroz Sikander
> >>
> >>
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message