hama-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Hama vs Spark
Date Mon, 03 Aug 2015 01:16:16 GMT
I'm not sure how it can be possible. However, I think user can find
the slowest machine in each superstep and re-balance the loads. This
can be handled from client (user) side.

On Sat, Aug 1, 2015 at 4:17 AM, Behroz Sikander <behroz89@gmail.com> wrote:
> +1. This is great.
>
> Btw our current implementation of Hama is Synchronous BSP i.e we have to
> wait for the slowest machine to sync in order to move to the next super
> step. Is there anything like Asynchronous BSP out yet ? If yes, do you have
> plans to add it to this framework ?
>
> Regards,
> Behroz
>
> On Wed, Jul 29, 2015 at 3:12 AM, Edward J. Yoon <edwardyoon@apache.org>
> wrote:
>
>> I found research paper somewhat related with this topic.
>>
>> "Both the disk based method, i.e., MR, and the memory based method,
>> i.e., BSP and Spark, need to load the data into main memory and
>> conduct the expensive computation. However, when processing topk
>> joins, BSP is clearly the best method as it is the only one that is
>> able to perform top-k joins on large datasets. This is because BSP
>> supports the frequent synchronizations between workers when performing
>> the joining procedure, which quickly lowers the joining threshold for
>> a given k. The winner between the MR and the Spark algorithms change
>> from datasets to datasets: Spark is beaten by MR on A and B while
>> beats MR on C." -
>> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf
>>
>> On Thu, Jun 25, 2015 at 9:02 PM, Behroz Sikander <behroz89@gmail.com>
>> wrote:
>> > Hi all,
>> > *>>Apache Spark is definitely more suited for ML (iterative algorithms)
>> > than*
>> >
>> >
>> > *legacy Hadoop due to its preservation of state and optimized
>> > executionstrategy (RDDs). However, their approaches are still in
>> > synchronous iterativecommunication pattern.*
>> > So, Hama has a better communication model. That is a good point.
>> >
>> > *>>Moreover, BSP can have virtual **shared memory and many more
>> benefits.*
>> > I read somewhere that Spark has shared variables. BSP virtual shared
>> memory
>> > is something else or is it like shared variables in Spark ?
>> >
>> > *>>In addition, another one convincing*
>> >
>> > *point I think can  be a utilization ability of modern acceleration
>> > accessoriessuch as InfiniBand and GPUs*
>> > Yes, it is a good point but I found the following link. Apparently, Spark
>> > is also capable of doing processing on GPU's.
>> >
>> https://spark-summit.org/east-2015/talk/heterospark-a-heterogeneous-cpugpu-spark-platform-for-deep-learning-algorithms-2
>> >
>> > *>>I'm sure that this feature will bring a*
>> >
>> > *completely new wave of big data. The problem we faced is only a lack
>> > ofinterest to BSP programming model. :-)*
>> > My knowledge is quite limited but I think you are right. With the rise of
>> > IoT and stream processing, GPU's will become vital. Yes, I do not
>> > understand that why BSP is not the programming model of choice now a
>> days.
>> > It has a strong theoretical background which was proposed decades back
>> and
>> > still MapReduce/Spark models are used.
>> >
>> >
>> > *>>Just FYI, one of my friends said after reading this thread, "if
>> > AmazonEC2 = MR or BSP, Google App Engine = Spark". Maybe usability side.*
>> > I have not written a Spark job before, but I have seen the code. BSP
>> looks
>> > more intuitive to me somehow.
>> >
>> > *>>Hama = GraphX (Library of Spark (Pregel model) [1])*
>> > The graph module of Hama is definitely equal to GraphX of Spark.
>> >
>> > Regards,
>> > Behroz
>> >
>> > On Thu, Jun 25, 2015 at 1:46 AM, Edward J. Yoon <edward.yoon@samsung.com
>> >
>> > wrote:
>> >
>> >> Hi, here's my few thoughts.
>> >>
>> >> Apache Spark is definitely more suited for ML (iterative algorithms)
>> than
>> >> legacy Hadoop due to its preservation of state and optimized execution
>> >> strategy (RDDs). However, their approaches are still in synchronous
>> >> iterative
>> >> communication pattern.
>> >>
>> >> In Apache Hama case, it's a general-purpose pure BSP framework. While I
>> >> admit
>> >> that synchronization costs are high, the communication can be more
>> >> efficiently
>> >> realized with the message-passing BSP model. Moreover, BSP can have
>> virtual
>> >> shared memory and many more benefits. In addition, another one
>> convincing
>> >> point I think can  be a utilization ability of modern acceleration
>> >> accessories
>> >> such as InfiniBand and GPUs. I'm sure that this feature will bring a
>> >> completely new wave of big data. The problem we faced is only a lack of
>> >> interest to BSP programming model. :-)
>> >>
>> >> > 2) Do we have any recent benchmarks between the 2 systems ?
>> >>
>> >> It's in my todo list.
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >>
>> >> -----Original Message-----
>> >> From: Behroz Sikander [mailto:behroz89@gmail.com]
>> >> Sent: Thursday, June 25, 2015 12:57 AM
>> >> To: user@hama.apache.org
>> >> Subject: Hama vs Spark
>> >>
>> >> Hi,
>> >> A few days back, I started reading about Apache Spark. It is a pretty
>> good
>> >> BigData platform. But a question arises to my mind that where Hama lies
>> in
>> >> comparison with Spark if we have to implement an iterative algorithm
>> which
>> >> is compute intensive (Machine learning or Optimization) ?
>> >>
>> >> I found some resources online but none answers my questions.
>> >>
>> >> 1)BSP vs MapReduce paper <http://arxiv.org/pdf/1203.2081v2.pdf>
>> >> 2)
>> >>
>> >>
>> https://people.apache.org/~edwardyoon/documents/Hama_BSP_for_Advanced_Analytics.pdf
>> >> 3) I actually found the following benchmark but it is quite old.
>> >>
>> >>
>> >>
>> http://markmail.org/message/vyjsdpv355kua7rm#query:+page:1+mid:vstgda4fhmz52pdw+state:results
>> >>
>> >> Questions:
>> >> 1) Is there any specific advantage when we chose BSP model instead of
>> SPARK
>> >> paradigm ?
>> >> 2) Do we have any recent benchmarks between the 2 systems ?
>> >> 3) What is the main convincing point to use Hama over Spark ?
>> >> 4) Any scientific paper that compares both systems ? (I was not able to
>> >> find any)
>> >>
>> >> Regards,
>> >> Behroz Sikander
>> >>
>> >>
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>>



-- 
Best Regards, Edward J. Yoon

Mime
View raw message