cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Eriksson <>
Subject Re: Cassandra & Spark
Date Thu, 08 Jun 2017 13:37:00 GMT
Something to consider before moving to Apache Spark and Cassandra
I have a background where we have tons of data in Cassandra, and we wanted to use Apache Spark
to run various jobs
We loved what we could do with Spark, BUT….
We realized soon that we wanted to run multiple jobs in parallel
Some jobs would take 30 minutes and some 45 seconds
Spark is by default arranged so that it will take up all the resources there is, this can
be tweaked by using Mesos or Yarn
But even with Mesos and Yarn we found it complicated to run multiple jobs in parallel.
So eventually we ended up throwing out Spark,
Instead we transferred the data to Apache Kudu, and then we ran our analysis on Kudu, and
what a difference !
“my two cents!”

From: 한 승호 <>
Date: Thursday, 8 June 2017 at 10:25
To: "" <>
Subject: Cassandra & Spark


I am Seung-ho and I work as a Data Engineer in Korea. I need some advice.

My company recently consider replacing RDMBS-based system with Cassandra and Hadoop.
The purpose of this system is to analyze Cadssandra and HDFS data with Spark.

It seems many user cases put emphasis on data locality, for instance, both Cassandra and Spark
executor should be on the same node.

The thing is, my company's data analyst team wants to analyze heterogeneous data source, Cassandra
and HDFS, using Spark.
So, I wonder what would be the best practices of using Cassandra and Hadoop in such case.

Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the same node

Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node separately but the same

Which would be better or correct, or would be a better way?

I appreciate your advice in advance :)

Best Regards,
Seung-Ho Han

Windows 10용 메일<>에서 보냄

View raw message