Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4F24E195EE for ; Sat, 12 Mar 2016 11:03:36 +0000 (UTC) Received: (qmail 61828 invoked by uid 500); 12 Mar 2016 11:03:31 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 61685 invoked by uid 500); 12 Mar 2016 11:03:31 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 61675 invoked by uid 99); 12 Mar 2016 11:03:30 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Mar 2016 11:03:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 883731804EA for ; Sat, 12 Mar 2016 11:03:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.429 X-Spam-Level: * X-Spam-Status: No, score=1.429 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 81vle8VUpWZj for ; Sat, 12 Mar 2016 11:03:29 +0000 (UTC) Received: from mail-ig0-f179.google.com (mail-ig0-f179.google.com [209.85.213.179]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 78D805F39A for ; Sat, 12 Mar 2016 11:03:28 +0000 (UTC) Received: by mail-ig0-f179.google.com with SMTP id av4so30535238igc.1 for ; Sat, 12 Mar 2016 03:03:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=JupJJXvpcKm4Io0EI8o1yJkHV/oDyaNgSVFnZ+MDjp0=; b=kmixuBJlixcjC7ACXfnt7i85GE0JD24opJhO3oU9Zj8HwLIG9NR9X4O0ecMY9ZE8Cm Es8X6HhgFmuTxNLW45kXetgjDpEw5Up8Gbcz58lqBKRlWOfOMbkpGoCfDs51Qk74Fquw PRFvAMYn9631TwJZ15ryQb1C+HCmL60AkXgkjskXtk6r0y/4nZENQ6AbAZviBC+ciCU9 W376NTRlc2UtAL96GqqIYqOw6n/A7EErST+b90WawvWDiluLdcgq7G2pJ7tS2LMf0rK/ f6iyyK6BaiqAdgm2OBgxZhr5EEQS6ySzuo0rMuU8tJ+shvuOm6RFnoqW1k6TOA9yYqqt BzXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=JupJJXvpcKm4Io0EI8o1yJkHV/oDyaNgSVFnZ+MDjp0=; b=UbzPJKJ/K7X//EBJLnfOAPK14ziZRKDbMOkj7WcS08mnIJLescpEA3qRxlDQPhroN+ tIypsCBkvOr9OmijIQ0XYBpf+i0CmA4pN1usm5WL3YmdSEcbHsm6HzewvJPQqlk0OdVS 1aHwH7lk65a7mPCYLrrdLOBT9qDnz7wCKSmDm08FEX1WPDsM25Ia1f2xry6QFRd5OTTp ALAmHonnqTWA8SMDt0atrFDajIQJ3bbKl9sJG4YnHozTsAT/rIZJAsdnoAjnXoH/kSLp dmMWyQQZgP7ANMNkQe9R8DtLT3sh9qm8qmOHS3T56uOyAMJGTB6THFQhxqaFVndm012n Bs/A== X-Gm-Message-State: AD7BkJIinltnGTcmvjlwcPaMKilsbVBaNSi6vDkYFd+8wK3QOGu0dMD1xXjPwmr6f+nXPIGuovwQkl6yleNM6w== MIME-Version: 1.0 X-Received: by 10.50.150.105 with SMTP id uh9mr8790392igb.8.1457780306321; Sat, 12 Mar 2016 02:58:26 -0800 (PST) Received: by 10.107.59.197 with HTTP; Sat, 12 Mar 2016 02:58:26 -0800 (PST) Received: by 10.107.59.197 with HTTP; Sat, 12 Mar 2016 02:58:26 -0800 (PST) In-Reply-To: References: Date: Sat, 12 Mar 2016 17:58:26 +0700 Message-ID: Subject: Re: Correct way to use spark streaming with apache zeppelin From: trung kien To: Chris Miller Cc: user Content-Type: multipart/alternative; boundary=001a11c332dc51a5e1052dd7eda3 --001a11c332dc51a5e1052dd7eda3 Content-Type: text/plain; charset=UTF-8 Thanks Chris and Mich for replying. Sorry for not explaining my problem clearly. Yes i am talking about a flexibke dashboard when mention Zeppelin. Here is the problem i am having: I am running a comercial website where we selle many products and we have many branchs in many place. We have a lots of realtime transactions and want to anaylyze it in realtime. We dont want every time doing analytics we have to aggregate every single transactions ( each transaction have BranchID, ProductID, Qty, Price). So, we maintain intermediate data which contains : BranchID, ProducrID, totalQty, totalDollar Ideally, we have 2 tables: Transaction ( BranchID, ProducrID, Qty, Price, Timestamp) And intermediate table Stats is just sum of every transaction group by BranchID and ProductID( i am using Sparkstreaming to calculate this table realtime) My thinking is that doing statistics ( realtime dashboard) on Stats table is much easier, this table is also not enough for maintain. I'm just wondering, whats the best way to store Stats table( a database or parquet file?) What exactly are you trying to do? Zeppelin is for interactive analysis of a dataset. What do you mean "realtime analytics" -- do you mean build a report or dashboard that automatically updates as new data comes in? -- Chris Miller On Sat, Mar 12, 2016 at 3:13 PM, trung kien wrote: > Hi all, > > I've just viewed some Zeppenlin's videos. The intergration between > Zeppenlin and Spark is really amazing and i want to use it for my > application. > > In my app, i will have a Spark streaming app to do some basic realtime > aggregation ( intermediate data). Then i want to use Zeppenlin to do some > realtime analytics on the intermediate data. > > My question is what's the most efficient storage engine to store realtime > intermediate data? Is parquet file somewhere is suitable? > --001a11c332dc51a5e1052dd7eda3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Thanks Chris and Mich for replying.

Sorry for not explaining my problem clearly.=C2=A0 Yes i am = talking about a flexibke dashboard when mention Zeppelin.

Here is the problem i am having:

I am running a comercial website where we selle many product= s and we have many branchs in many place. We have a lots of realtime transa= ctions and want to anaylyze it in realtime.

We dont want every time doing analytics we have to aggregate= every single transactions ( each transaction have BranchID, ProductID, Qty= , Price). So, we maintain intermediate data which contains : BranchID, Prod= ucrID, totalQty, totalDollar

Ideally, we have 2 tables:
=C2=A0=C2=A0 Transaction ( BranchID, ProducrID, Qty, Price, Timestamp)

And intermediate table Stats is just sum of every transactio= n group by BranchID and ProductID( i am using Sparkstreaming to calculate t= his table realtime)

My thinking is that doing statistics ( realtime dashboard)= =C2=A0 on Stats table is much easier, this table is also not enough for mai= ntain.

I'm just wondering, whats the best way to store Stats ta= ble( a database or parquet file?)

What exactly= are you trying to do? Zeppelin is for interactive analysis of a dataset. W= hat do you mean "realtime analytics" -- do you mean build a repor= t or dashboard that automatically updates as new data comes in?

--
Chris Miller

On Sat, Mar 12, 2016 at 3:13 PM, trung kien = <kientt86@gmail.com> wrote:

Hi all,

I've just viewed some Zeppenlin's videos. The interg= ration between Zeppenlin and Spark is really amazing and i want to use it f= or my application.

In my app, i will have a Spark streaming app to do some basi= c realtime aggregation ( intermediate data). Then i want to use Zeppenlin t= o do some realtime analytics on the intermediate data.

My question is what's the most efficient storage engine = to store realtime intermediate data? Is parquet file somewhere is suitable?=


--001a11c332dc51a5e1052dd7eda3--