hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Boesch <java...@gmail.com>
Subject Re: Big Data tech stack (was Spark vs. Storm)
Date Wed, 02 Jul 2014 21:23:47 GMT
You will not be arriving at a generic stack without oversimplifying to the
point of serious deficiencies. There are as you say a multitude of
options.  You are attempting to boil them down to  A vs B as opposed to A
may work better under the following conditions ..

2014-07-02 13:25 GMT-07:00 Adaryl "Bob" Wakefield, MBA <

>   You know what I’m really trying to do? I’m trying to come up with a
> best practice technology stack. There are so many freaking projects it is
> overwhelming. If I were to walk into an organization that had no Big Data
> capability, what mix of projects would be best to implement based on
> performance, scalability and easy of use/implementation? So far I’ve got:
> Ubuntu
> Hadoop
> Cassandra (Seems to be the highest performing NoSQL database out there.)
> Storm (maybe?)
> Python (Easier than Java. Maybe that shouldn’t be a concern.)
> Hive (For people to leverage their existing SQL skillset.)
> That would seem to cover transaction processing and warehouse storage and
> the capability to do batch and real time analysis. What am I leaving out or
> what do I have incorrect in my assumptions?
> B.
>  *From:* Stephen Boesch <javadba@gmail.com>
> *Sent:* Wednesday, July 02, 2014 3:07 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs. Storm
>  Spark Streaming discretizes the stream by configurable intervals of no
> less than 500Milliseconds. Therefore it is not appropriate for true real
> time processing.So if you need to capture events in the low 100's of
> milliseonds range or less than stick with Storm (at least for now).
> If you can afford one second+ of latency then spark provides advantages of
> interoperability with the other Spark components and capabilities.
> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <shahab.yunus@gmail.com>:
>> Not exactly. There are of course  major implementation differences and
>> then some subtle and high level ones too.
>> My 2-cents:
>> Spark is in-memory M/R and it simulated streaming or real-time
>> distributed process for large datasets by micro-batching. The gain in speed
>> and performance as opposed to batch paradigm is in-memory buffering or
>> batching (and I am here being a bit naive/crude in explanation.)
>> Storm on the other hand, supports stream processing even at a single
>> record level (known as tuple in its lingo.) You can do micro-batching on
>> top of it as well (using Trident API which is good for state maintenance
>> too, if your BL requires that). This is more applicable where you want
>> control to a single record level rather than set, collection or batch of
>> records.
>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>> granular approach but as far as I recall, it still is built on top of core
>> Spark (basically another level of abstraction over core Spark constructs.)
>> So given this, you can pick the framework which is more attuned to your
>> needs.
>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>>   Do these two projects do essentially the same thing? Is one better
>>> than the other?

View raw message