spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammed Guller <moham...@glassbeam.com>
Subject RE: Spark SQL queries hive table, real time ?
Date Tue, 07 Jul 2015 02:08:33 GMT
Hi Florian,
It depends on a number of factors. How much data are you querying? Where is the data stored
(HDD, SSD or DRAM)? What is the file format (Parquet or CSV)?

In theory, it is possible to use Spark SQL for real-time queries, but cost increases as the
data size grows. If you can store all of your data in memory, then you should be able to query
it in real-time ☺ On the other extreme,  if Spark SQL has to read a terabyte of data from
spinning disk, there is no way it can respond in real-time. To be fair, no software can read
a terabyte of data from HDD in real-time. Simple laws of physics. Either you will have to
spread out the reads over a large number of disks and read them in parallel. Alternatively,
index the data so that your queries don’t have to read a terabyte of data from disk.

Hope that helps.

Mohammed

From: Denny Lee [mailto:denny.g.lee@gmail.com]
Sent: Monday, July 6, 2015 4:21 AM
To: spierki; user@spark.apache.org
Subject: Re: Spark SQL queries hive table, real time ?

Within the context of your question, Spark SQL utilizing the Hive context is primarily about
very fast queries.  If you want to use real-time queries, I would utilize Spark Streaming.
 A couple of great resources on this topic include Guest Lecture on Spark Streaming in Stanford
CME 323: Distributed Algorithms and Optimization<http://www.slideshare.net/tathadas/guest-lecture-on-spark-streaming-in-standford>
and Recipes for Running Spark Streaming Applications in Production<https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/>
(from the recent Spark Summit 2015)

HTH!


On Mon, Jul 6, 2015 at 3:23 PM spierki <florian.spierckel@crisalid.com<mailto:florian.spierckel@crisalid.com>>
wrote:
Hello,

I'm actually asking my self about performance of using Spark SQL with Hive
to do real time analytics.
I know that Hive has been created for batch processing, and Spark is use to
do fast queries.

But, use Spark SQL with Hive will allow me to do real time queries ? Or it
just will make fastest queries but not real time.
Should I use an other datawarehouse, like Hbase ?

Thanks in advance for your time and consideration,
Florian



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-queries-hive-table-real-time-tp23642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<mailto:user-unsubscribe@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<mailto:user-help@spark.apache.org>
Mime
View raw message