hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adaryl \"Bob\" Wakefield, MBA" <adaryl.wakefi...@hotmail.com>
Subject Big Data tech stack (was Spark vs. Storm)
Date Wed, 02 Jul 2014 20:25:05 GMT
You know what I’m really trying to do? I’m trying to come up with a best practice technology
stack. There are so many freaking projects it is overwhelming. If I were to walk into an organization
that had no Big Data capability, what mix of projects would be best to implement based on
performance, scalability and easy of use/implementation? So far I’ve got:
Ubuntu
Hadoop
Cassandra (Seems to be the highest performing NoSQL database out there.)
Storm (maybe?)
Python (Easier than Java. Maybe that shouldn’t be a concern.)
Hive (For people to leverage their existing SQL skillset.)

That would seem to cover transaction processing and warehouse storage and the capability to
do batch and real time analysis. What am I leaving out or what do I have incorrect in my assumptions?

B.



From: Stephen Boesch 
Sent: Wednesday, July 02, 2014 3:07 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs. Storm

Spark Streaming discretizes the stream by configurable intervals of no less than 500Milliseconds.
Therefore it is not appropriate for true real time processing.So if you need to capture events
in the low 100's of milliseonds range or less than stick with Storm (at least for now). 

If you can afford one second+ of latency then spark provides advantages of interoperability
with the other Spark components and capabilities.



2014-07-02 12:59 GMT-07:00 Shahab Yunus <shahab.yunus@gmail.com>:

  Not exactly. There are of course  major implementation differences and then some subtle
and high level ones too. 

  My 2-cents:


  Spark is in-memory M/R and it simulated streaming or real-time distributed process for large
datasets by micro-batching. The gain in speed and performance as opposed to batch paradigm
is in-memory buffering or batching (and I am here being a bit naive/crude in explanation.)

  Storm on the other hand, supports stream processing even at a single record level (known
as tuple in its lingo.) You can do micro-batching on top of it as well (using Trident API
which is good for state maintenance too, if your BL requires that). This is more applicable
where you want control to a single record level rather than set, collection or batch of records.

  Having said that, Spark Streaming is trying to simulate Storm's extreme granular approach
but as far as I recall, it still is built on top of core Spark (basically another level of
abstraction over core Spark constructs.)

  So given this, you can pick the framework which is more attuned to your needs.



  On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <adaryl.wakefield@hotmail.com>
wrote:

    Do these two projects do essentially the same thing? Is one better than the other?


Mime
View raw message