Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of dechouxb@gmail.com designates
 209.85.215.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAKqDWF0MBno1B+NAFtxjYi3cM1GBQ93B=uEMS4tG=EDufUy+cw@mail.gmail.com>
References: 
 <CANKHNWR3GL0NQC6LzfeZEDRU-Tkey7YWsWwQWWQycsRYsaiSsQ@mail.gmail.com>
	<CAAD07OK235NNw01vvbkgn2xO9+zjUxb+th4AkFenKPHcone3_Q@mail.gmail.com>
	<CANKHNWTfui_VC-UOEJ0q3OOE+UjXd7=D+9sUUPk4dpn3+ioN-w@mail.gmail.com>
	<CANKHNWSnwfQphLUnm8npb7aY=L=Ej3tmWj8p66z_CWvurfMBGg@mail.gmail.com>
	<COL126-DS9A3FAAC0887BDC09095EB98060@phx.gbl>
	<CAEo-6+QMvC-ZKdXBh+KaSOf2Gbidud==7t3md0MSkMHzQbnFiw@mail.gmail.com>
	<CACkSZy3NQ9-rWU3Jievovi4ia-KmiZ9BLFbZEsWZkRgAO9E5Aw@mail.gmail.com>
	<COL126-DS132392F6C7B29FCC08D74A98060@phx.gbl>
	<CAKqDWF0MBno1B+NAFtxjYi3cM1GBQ93B=uEMS4tG=EDufUy+cw@mail.gmail.com>
Date: Thu, 3 Jul 2014 07:48:08 +0200
Message-ID: 
 <CAO6W-2eP=aur4u+UutXq_OkSSHXXApUhNjJHqwTViYhN10ys-A@mail.gmail.com>
Subject: Re: Big Data tech stack (was Spark vs. Storm)
From: Bertrand Dechoux <dechouxb@gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=001a11c3c842a9ade004fd438d74

--001a11c3c842a9ade004fd438d74
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I will second Stephen. At best you will arrive at a point where you can
tell "I don't care about your problems here is the solution". Even though
it sounds attractive if you are paid to set up the solution, that's really
not the position a 'client' would want you to hold.

Bertrand Dechoux


On Thu, Jul 3, 2014 at 1:10 AM, Gavin Yue <yue.yuanyuan@gmail.com> wrote:

> Isn't this what Yarn or Mesos are trying to do?  Separate the resources
> management and applications. Run whatever suitable above them.  Spark als=
o
> could run above yanr or mesos. Spark was designed for iteration intensive
> computing like Machine learning algorithms.
>
> Storm is quite different.  It is not designed for big data stored in the
> hard disk. It is inspired by the stream data like tweets. On the other
> side, Map-Reduce/HDFS was initially designed to handle stored webpage to
> build up index.
>
> Hadoop is on the way to become a generic Big Data analysis framework.
> HontonWorks and Cloudera are trying to make it much easier on management
> and deployment.
>
>
>
> On Wed, Jul 2, 2014 at 4:25 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   You know what I=E2=80=99m really trying to do? I=E2=80=99m trying to c=
ome up with a
>> best practice technology stack. There are so many freaking projects it i=
s
>> overwhelming. If I were to walk into an organization that had no Big Dat=
a
>> capability, what mix of projects would be best to implement based on
>> performance, scalability and easy of use/implementation? So far I=E2=80=
=99ve got:
>> Ubuntu
>> Hadoop
>> Cassandra (Seems to be the highest performing NoSQL database out there.)
>> Storm (maybe?)
>> Python (Easier than Java. Maybe that shouldn=E2=80=99t be a concern.)
>> Hive (For people to leverage their existing SQL skillset.)
>>
>> That would seem to cover transaction processing and warehouse storage an=
d
>> the capability to do batch and real time analysis. What am I leaving out=
 or
>> what do I have incorrect in my assumptions?
>>
>> B.
>>
>>
>>
>>  *From:* Stephen Boesch <javadba@gmail.com>
>> *Sent:* Wednesday, July 02, 2014 3:07 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Spark vs. Storm
>>
>>  Spark Streaming discretizes the stream by configurable intervals of no
>> less than 500Milliseconds. Therefore it is not appropriate for true real
>> time processing.So if you need to capture events in the low 100's of
>> milliseonds range or less than stick with Storm (at least for now).
>>
>> If you can afford one second+ of latency then spark provides advantages
>> of interoperability with the other Spark components and capabilities.
>>
>>
>> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <shahab.yunus@gmail.com>:
>>
>>> Not exactly. There are of course  major implementation differences and
>>> then some subtle and high level ones too.
>>>
>>> My 2-cents:
>>>
>>> Spark is in-memory M/R and it simulated streaming or real-time
>>> distributed process for large datasets by micro-batching. The gain in s=
peed
>>> and performance as opposed to batch paradigm is in-memory buffering or
>>> batching (and I am here being a bit naive/crude in explanation.)
>>>
>>> Storm on the other hand, supports stream processing even at a single
>>> record level (known as tuple in its lingo.) You can do micro-batching o=
n
>>> top of it as well (using Trident API which is good for state maintenanc=
e
>>> too, if your BL requires that). This is more applicable where you want
>>> control to a single record level rather than set, collection or batch o=
f
>>> records.
>>>
>>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>>> granular approach but as far as I recall, it still is built on top of c=
ore
>>> Spark (basically another level of abstraction over core Spark construct=
s.)
>>>
>>> So given this, you can pick the framework which is more attuned to your
>>> needs.
>>>
>>>
>>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Do these two projects do essentially the same thing? Is one better
>>>> than the other?
>>>>
>>>
>>>
>>
>>
>
>

--001a11c3c842a9ade004fd438d74
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I will second Stephen. At best you will arrive at a point =
where you can tell &quot;I don&#39;t care about your problems here is the s=
olution&quot;. Even though it sounds attractive if you are paid to set up t=
he solution, that&#39;s really not the position a &#39;client&#39; would wa=
nt you to hold.<div class=3D"gmail_extra">
<br clear=3D"all"><div>Bertrand Dechoux</div>
<br><br><div class=3D"gmail_quote">On Thu, Jul 3, 2014 at 1:10 AM, Gavin Yu=
e <span dir=3D"ltr">&lt;<a href=3D"mailto:yue.yuanyuan@gmail.com" target=3D=
"_blank">yue.yuanyuan@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">
<div dir=3D"ltr"><div>Isn&#39;t this what Yarn or Mesos are trying to do?=
=C2=A0 Separate the resources management and applications. Run whatever sui=
table above them.=C2=A0 Spark also could run above yanr or mesos. Spark was=
 designed for iteration intensive computing like Machine learning algorithm=
s.</div>


<div><br></div><div>Storm is quite different.=C2=A0 It is not designed for =
big data stored in the hard disk. It is inspired by the stream data like tw=
eets. On the other side,=C2=A0Map-Reduce/HDFS was=C2=A0initially designed=
=C2=A0to handle stored webpage to build up index.=C2=A0</div>


<div><br></div><div>Hadoop is on the way to become a generic Big Data analy=
sis framework. HontonWorks and Cloudera are trying to make it much easier o=
n management and deployment. </div><div><br></div></div><div class=3D"HOEnZ=
b">
<div class=3D"h5"><div class=3D"gmail_extra">

<br><br><div class=3D"gmail_quote">On Wed, Jul 2, 2014 at 4:25 PM, Adaryl &=
quot;Bob&quot; Wakefield, MBA <span dir=3D"ltr">&lt;<a href=3D"mailto:adary=
l.wakefield@hotmail.com" target=3D"_blank">adaryl.wakefield@hotmail.com</a>=
&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div dir=3D"ltr">
<div dir=3D"ltr">
<div style=3D"color:rgb(0,0,0);font-family:&quot;Calibri&quot;;font-size:12=
pt">
<div>You know what I=E2=80=99m really trying to do? I=E2=80=99m trying to c=
ome up with a best=20
practice technology stack. There are so many freaking projects it is=20
overwhelming. If I were to walk into an organization that had no Big Data=
=20
capability, what mix of projects would be best to implement based on=20
performance, scalability and easy of use/implementation? So far I=E2=80=99v=
e=20
got:<br>Ubuntu</div>
<div>Hadoop</div>
<div>Cassandra (Seems to be the highest performing NoSQL database out=20
there.)</div>
<div>Storm (maybe?)</div>
<div>Python (Easier than Java. Maybe that shouldn=E2=80=99t be a concern.)<=
/div>
<div>Hive (For people to leverage their existing SQL skillset.)</div>
<div>=C2=A0</div>
<div>That would seem to cover transaction processing and warehouse storage =
and=20
the capability to do batch and real time analysis. What am I leaving out or=
 what=20
do I have incorrect in my assumptions?</div>
<div>=C2=A0</div>
<div>B.</div>
<div>=C2=A0</div>
<div style=3D"color:rgb(0,0,0);font-family:&quot;Calibri&quot;;font-size:12=
pt">=C2=A0</div>
<div style=3D"color:rgb(0,0,0);font-family:&quot;Calibri&quot;;font-size:sm=
all;font-style:normal;font-weight:normal;text-decoration:none;display:inlin=
e">
<div style=3D"font:10pt/normal tahoma;font-size-adjust:none;font-stretch:no=
rmal">
<div>=C2=A0</div>
<div style=3D"background:rgb(245,245,245)">
<div><b>From:</b> <a title=3D"javadba@gmail.com" href=3D"mailto:javadba@gma=
il.com" target=3D"_blank">Stephen Boesch</a> </div>
<div><b>Sent:</b> Wednesday, July 02, 2014 3:07 PM</div>
<div><b>To:</b> <a title=3D"user@hadoop.apache.org" href=3D"mailto:user@had=
oop.apache.org" target=3D"_blank">user@hadoop.apache.org</a> </div>
<div><b>Subject:</b> Re: Spark vs. Storm</div></div></div>
<div>=C2=A0</div></div>
<div style=3D"color:rgb(0,0,0);font-family:&quot;Calibri&quot;;font-size:sm=
all;font-style:normal;font-weight:normal;text-decoration:none;display:inlin=
e">
<div dir=3D"ltr">Spark Streaming discretizes the stream by configurable int=
ervals of=20
no less than 500Milliseconds. Therefore it is not appropriate for true real=
 time=20
processing.So if you need to capture events in the low 100&#39;s of millise=
onds=20
range or less than stick with Storm (at least for now).=20
<div>=C2=A0</div>
<div>If you can afford one second+ of latency then spark provides advantage=
s of=20
interoperability with the other Spark components and capabilities.</div></d=
iv>
<div class=3D"gmail_extra"><br><br>
<div class=3D"gmail_quote">2014-07-02 12:59 GMT-07:00 Shahab Yunus <span di=
r=3D"ltr">&lt;<a href=3D"mailto:shahab.yunus@gmail.com" target=3D"_blank">s=
hahab.yunus@gmail.com</a>&gt;</span>:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;padding=
-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-l=
eft-style:solid">
  <div dir=3D"ltr">Not exactly. There are of course=C2=A0 major implementat=
ion=20
  differences and then some subtle and high level ones too.=20
  <div>=C2=A0</div>
  <div>My 2-cents:<br>
  <div>=C2=A0</div>
  <div>Spark is in-memory M/R and it simulated streaming or real-time=20
  distributed process for large datasets by micro-batching. The gain in spe=
ed=20
  and performance as opposed to batch paradigm is in-memory buffering or=20
  batching (and I am here being a bit naive/crude in explanation.)</div>
  <div>=C2=A0</div>
  <div>Storm on the other hand, supports stream processing even at a single=
=20
  record level (known as tuple in its lingo.) You can do micro-batching on =
top=20
  of it as well (using Trident API which is good for state maintenance too,=
 if=20
  your BL requires that). This is more applicable where you want control to=
 a=20
  single record level rather than set, collection or batch of records.</div=
>
  <div>=C2=A0</div>
  <div>Having said that, Spark Streaming is trying to simulate Storm&#39;s =
extreme=20
  granular approach but as far as I recall, it still is built on top of cor=
e=20
  Spark (basically another level of abstraction over core Spark=20
  constructs.)</div></div>
  <div>=C2=A0</div>
  <div>So given this, you can pick the framework which is more attuned to y=
our=20
  needs.</div></div>
  <div>
  <div>
  <div class=3D"gmail_extra"><br><br>
  <div class=3D"gmail_quote">On Wed, Jul 2, 2014 at 3:31 PM, Adaryl &quot;B=
ob&quot; Wakefield,=20
  MBA <span dir=3D"ltr">&lt;<a href=3D"mailto:adaryl.wakefield@hotmail.com"=
 target=3D"_blank">adaryl.wakefield@hotmail.com</a>&gt;</span> wrote:<br>
  <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;paddi=
ng-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border=
-left-style:solid">
    <div dir=3D"ltr">
    <div dir=3D"ltr">
    <div style=3D"color:rgb(0,0,0);font-family:&quot;Calibri&quot;;font-siz=
e:12pt">
    <div>Do these two projects do essentially the same thing? Is one better=
 than=20
    the other?</div></div></div></div></blockquote></div>
  <div>=C2=A0</div></div></div></div></blockquote></div>
<div>=C2=A0</div></div></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div>

--001a11c3c842a9ade004fd438d74--