spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Platter <>
Subject R: Re: Spark (SQL) as OLAP engine
Date Wed, 04 Feb 2015 09:27:23 GMT

I've the some doubt about sparksql + cassandra. Can thrift jdbc server handle multiple queries
from different users using cognos?

In cognos user can perform free hand queries, so basically you can perform an aggregation
over each Field in the record. How does it match with cassandra row key driven queries? Do
you need to put a secondary index on each Field?

Could parquet be a better Storage solution for olap queries? Can spark query parquet in a
columnar way whitout scan rows?

This is a very hot topic for my understanding of the big data market.
People want leverage past investments in BI solutions and only replace the storage and the
query engine. People want perform free query.



Inviata dal mio Windows Phone
Da: Kevin (Sangwoo) Kim<>
Inviato: ‎04/‎02/‎2015 08:27
A:<>; Kevin (Sangwoo) Kim<>;
McNerlin, Andrew (Agoda)<>; Sean McNamara<>;
Adamantios Corais<>
Cc: user<>;<>
Oggetto: Re: Re: Spark (SQL) as OLAP engine

Hi Sun,

Sure, I'm a contributor of Zeppelin project and will gladly share some use cases.

Apache Zeppelin is a web-base notebook tool, like famous iPython notebook. Zeppelin is used
to be hive-driven but recently inspired by Databricks Cloud, it changed the engine to Spark
and Spark SQL, it just works really great and easy to use.

So I use zeppelin as,
1. Daily work environment.
I'm running 3 zeppelin instances, and one is for my data exploration.
I can easily manage Spark codes and re-run them.
I used to store my codes in Evernote and paste it into REPL, now no need to do like that.
Also I can easily use visualization features in Zeppelin like charting, table, it's much better
than Evernote + Spark shell.

2. Running batch job
I guess many data guys will be packaging an Spark application and run it through cron or something,
but in that way, you will be having a hard time when something is wrong. Zeppelin has a scheduler
inside (controlled by cron expression) and will automatically run a note content, So it's
easy to run a batch job (no need to deploy) and easy to maintain codes.

3. Data sharing through team
That's like the screenshot I've shared already,
I've made a dashboard and update it periodically using the scheduler.
Also you can inflate an html in Zeppelin table, you can have some image or link like this:

4. Creating instant report
as a part of 1. Daily work environment, it's convenient to create instant data report and
review them with team. Compare to Spark-shell + google docs or PPT.

And I believe Zepplin + Spark has far more possibilities, I think it worth try!


On Wed Feb 04 2015 at 3:54:12 PM<> <<>>
Hey, kevin
That tool seems quite interesting. Could you share more use cases about that?



From: Kevin (Sangwoo) Kim<>
Date: 2015-02-04 14:13
To: McNerlin, Andrew (Agoda)<>; Sean McNamara<>;
Adamantios Corais<>
Subject: Re: Spark (SQL) as OLAP engine
Hi, experts,

I got similar usage case and resolved problem with a notebook tool called Zeppelin (

Zeppelin draws charts from SparkSQL result and caches result.
Also Zeppelin has a scheduler inside,
So I persist aggregated result into s3 and create a lot of queries on the Zeppelin, that run
once a day.

I'm attaching a screenshot of my dashboard (with numbers and some labels are cut)

when view only mode:
when editable mode:

+1) There is a limitation when you have to run various queries with various filers each time,
because in Zeppelin, a session is shared for all users.


On Wed Feb 04 2015 at 2:37:38 PM McNerlin, Andrew (Agoda) <<>>
Hi Sean,

I'm interested in trying something similar.  How was your performance when you had many concurrent
queries running against spark?  I know this will work well where you have a low volume of
queries against a large dataset, but am concerned about having a high volume of queries against
the same large dataset. (I know I've not defined "large", but hopefully you get the gist:))

I'm using Cassandra to handle workloads where we have large amounts of low complexity queries,
but want to move to an architecture which supports a similar(ish) large volume of higher complexity
queries.  I'd like to use spark as the query serving layer, but have concerns about how many
concurrent queries it could handle.

I'd be interested in anyones thoughts or experience with this.


From: Sean McNamara <<<>>>
Date: Wednesday, February 4, 2015 at 1:01
To: Adamantios Corais <<><<>>>
Cc: "<><<>>"
Subject: Re: Spark (SQL) as OLAP engine

We have gone down a similar path at Webtrends, Spark has worked amazingly well for us in this
use case.  Our solution goes from REST, directly into spark, and back out to the UI instantly.

Here is the resulting product in case you are curious (and please pardon the self promotion):

> How can I automatically cache the data once a day...

If you are not memory-bounded you could easily cache the daily results for some span of time
and re-union them together each time you add new data.  You would service queries off the
unioned RDD.

> ... and make them available on a web service

From the unioned RDD you could always step into spark SQL at that point.  Or you could use
a simple scatter/gather pattern for this.  As with all things Spark, this is super easy to
do: just use aggregate()()!



On Feb 3, 2015, at 9:59 AM, Adamantios Corais <<><<>>>


After some research I have decided that Spark (SQL) would be ideal for building an OLAP engine.
My goal is to push aggregated data (to Cassandra or other low-latency data storage) and then
be able to project the results on a web page (web service). New data will be added (aggregated)
once a day, only. On the other hand, the web service must be able to run some fixed(?) queries
(either on Spark or Spark SQL) at anytime and plot the results with D3.js. Note that I can
already achieve similar speeds while in REPL mode by caching the data. Therefore, I believe
that my problem must be re-phrased as follows: "How can I automatically cache the data once
a day and make them available on a web service that is capable of running any Spark or Spark
(SQL)  statement in order to plot the results with D3.js?"

Note that I have already some experience in Spark (+Spark SQL) as well as D3.js but not at
all with OLAP engines (at least in their traditional form).

Any ideas or suggestions?

// Adamantios

This message is confidential and is for the sole use of the intended recipient(s). It may
also be privileged or otherwise protected by copyright or other legal rules. If you have received
it by mistake please let us know by reply email and delete it from your system. It is prohibited
to copy this message or disclose its content to anyone. Any confidentiality or privilege is
not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All
messages sent to and from Agoda may be monitored to ensure compliance with company policies,
to protect the company's interests and to remove potential malware. Electronic messages may
be intercepted, amended, lost or deleted, or contain viruses.

To unsubscribe, e-mail:<>
For additional commands, e-mail:<>

[提示图标] 邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
共有 2 个附件

Screen Shot 2015-02-04 at 3.07.04 PM.png(98K)

Screen Shot 2015-02-04 at 3.11.11 PM.png(84K)
View raw message