Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CAN-NQFZg4b=qc_J+YFu8iGnMF4gEbRV2hkoWJ3H_98rAPTuWHQ@mail.gmail.com>
References: 
 <CAMdmNm=wM3qxXgSZAnXScVfRXWRApMbDQMQZbxhA8o+DVE4SqQ@mail.gmail.com>
	<CAMdmNm=5xc9kWeBfbuHvh7TisO1_2LaN1jkjfmLqgE0pBYxvUQ@mail.gmail.com>
	<CAN-NQFZg4b=qc_J+YFu8iGnMF4gEbRV2hkoWJ3H_98rAPTuWHQ@mail.gmail.com>
Date: Sat, 12 Mar 2016 17:58:26 +0700
Message-ID: 
 <CAMdmNmmHhOaTRq2vTek_s3MbPa0oqF-m2bMq3094DMDjrMeQJQ@mail.gmail.com>
Subject: Re: Correct way to use spark streaming with apache zeppelin
From: trung kien <kientt86@gmail.com>
To: Chris Miller <cmiller11101@gmail.com>
Cc: user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a11c332dc51a5e1052dd7eda3

--001a11c332dc51a5e1052dd7eda3
Content-Type: text/plain; charset=UTF-8

Thanks Chris and Mich for replying.

Sorry for not explaining my problem clearly.  Yes i am talking about a
flexibke dashboard when mention Zeppelin.

Here is the problem i am having:

I am running a comercial website where we selle many products and we have
many branchs in many place. We have a lots of realtime transactions and
want to anaylyze it in realtime.

We dont want every time doing analytics we have to aggregate every single
transactions ( each transaction have BranchID, ProductID, Qty, Price). So,
we maintain intermediate data which contains : BranchID, ProducrID,
totalQty, totalDollar

Ideally, we have 2 tables:
   Transaction ( BranchID, ProducrID, Qty, Price, Timestamp)

And intermediate table Stats is just sum of every transaction group by
BranchID and ProductID( i am using Sparkstreaming to calculate this table
realtime)

My thinking is that doing statistics ( realtime dashboard)  on Stats table
is much easier, this table is also not enough for maintain.

I'm just wondering, whats the best way to store Stats table( a database or
parquet file?)
What exactly are you trying to do? Zeppelin is for interactive analysis of
a dataset. What do you mean "realtime analytics" -- do you mean build a
report or dashboard that automatically updates as new data comes in?


--
Chris Miller

On Sat, Mar 12, 2016 at 3:13 PM, trung kien <kientt86@gmail.com> wrote:

> Hi all,
>
> I've just viewed some Zeppenlin's videos. The intergration between
> Zeppenlin and Spark is really amazing and i want to use it for my
> application.
>
> In my app, i will have a Spark streaming app to do some basic realtime
> aggregation ( intermediate data). Then i want to use Zeppenlin to do some
> realtime analytics on the intermediate data.
>
> My question is what's the most efficient storage engine to store realtime
> intermediate data? Is parquet file somewhere is suitable?
>

--001a11c332dc51a5e1052dd7eda3
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Thanks Chris and Mich for replying.</p>
<p dir=3D"ltr">Sorry for not explaining my problem clearly.=C2=A0 Yes i am =
talking about a flexibke dashboard when mention Zeppelin.</p>
<p dir=3D"ltr">Here is the problem i am having:</p>
<p dir=3D"ltr">I am running a comercial website where we selle many product=
s and we have many branchs in many place. We have a lots of realtime transa=
ctions and want to anaylyze it in realtime. </p>
<p dir=3D"ltr">We dont want every time doing analytics we have to aggregate=
 every single transactions ( each transaction have BranchID, ProductID, Qty=
, Price). So, we maintain intermediate data which contains : BranchID, Prod=
ucrID, totalQty, totalDollar </p>
<p dir=3D"ltr">Ideally, we have 2 tables:<br>
=C2=A0=C2=A0 Transaction ( BranchID, ProducrID, Qty, Price, Timestamp)</p>
<p dir=3D"ltr">And intermediate table Stats is just sum of every transactio=
n group by BranchID and ProductID( i am using Sparkstreaming to calculate t=
his table realtime)</p>
<p dir=3D"ltr">My thinking is that doing statistics ( realtime dashboard)=
=C2=A0 on Stats table is much easier, this table is also not enough for mai=
ntain. </p>
<p dir=3D"ltr">I&#39;m just wondering, whats the best way to store Stats ta=
ble( a database or parquet file?)</p>
<div class=3D"gmail_quot&lt;blockquote class=3D" style=3D"margin:0 0 0 .8ex=
;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">What exactly=
 are you trying to do? Zeppelin is for interactive analysis of a dataset. W=
hat do you mean &quot;realtime analytics&quot; -- do you mean build a repor=
t or dashboard that automatically updates as new data comes in?<br><div><br=
></div></div><div class=3D"gmail_extra"><br clear=3D"all"><div><div><div di=
r=3D"ltr"><div>--</div><div>Chris Miller</div></div></div></div>
<br><div class=3D"gmail_quote">On Sat, Mar 12, 2016 at 3:13 PM, trung kien =
<span dir=3D"ltr">&lt;<a href=3D"mailto:kientt86@gmail.com" target=3D"_blan=
k">kientt86@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x"><p dir=3D"ltr">Hi all,</p>
<p dir=3D"ltr">I&#39;ve just viewed some Zeppenlin&#39;s videos. The interg=
ration between Zeppenlin and Spark is really amazing and i want to use it f=
or my application.</p>
<p dir=3D"ltr">In my app, i will have a Spark streaming app to do some basi=
c realtime aggregation ( intermediate data). Then i want to use Zeppenlin t=
o do some realtime analytics on the intermediate data.</p>
<p dir=3D"ltr">My question is what&#39;s the most efficient storage engine =
to store realtime intermediate data? Is parquet file somewhere is suitable?=
</p>
</blockquote></div><br></div>
</div>

--001a11c332dc51a5e1052dd7eda3--