cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Durity, Sean R" <SEAN_R_DUR...@homedepot.com>
Subject RE: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra
Date Thu, 10 Jan 2019 20:11:11 GMT
RF in the Analytics DC can be 2 (or even 1) if storage cost is more important than availability.
There is a storage (and CPU and network latency) cost for a separate Spark cluster. So, the
variables of your specific use case may swing the decision in different directions.


Sean Durity
From: Dor Laor <dor@scylladb.com>
Sent: Wednesday, January 09, 2019 11:23 PM
To: user@cassandra.apache.org
Subject: Re: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra

On Wed, Jan 9, 2019 at 7:28 AM Durity, Sean R <SEAN_R_DURITY@homedepot.com<mailto:SEAN_R_DURITY@homedepot.com>>
wrote:
I think you could consider option C: Create a (new) analytics DC in Cassandra and run your
spark nodes there. Then you can address the scaling just on that DC. You can also use less
vnodes, only replicate certain keyspaces, etc. in order to perform the analytics more efficiently.

But this way you duplicate the entire dataset RF times over. It's very very expensive.
It is a common practice to run Spark on a separate Cassandra (virtual) datacenter but it's
done
in order to isolate the analytic workload from the realtime workload for isolation and low
latency guarantees.
We addressed this problem elsewhere, beyond this scope.



Sean Durity

From: Dor Laor <dor@scylladb.com<mailto:dor@scylladb.com>>
Sent: Friday, January 04, 2019 4:21 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra

I strongly recommend option B, separate clusters. Reasons:
 - Networking of node-node is negligible compared to networking within the node
 - Different scaling considerations
   Your workload may require 10 Spark nodes and 20 database nodes, so why bundle them?
   This ratio may also change over time as your application evolves and amount of data changes.
 - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't want it to affect
Cassandra and the opposite.
   If you isolate it with cgroups, you may have too much idle time when the above doesn't
happen.


On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy <goutham.chirutha@gmail.com<mailto:goutham.chirutha@gmail.com>>
wrote:
Hi,
We have requirement of heavy data lifting and analytics requirement and decided to go with
Apache Spark. In the process we have come up with two patterns
a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
b. Apache Spark on one independent cluster and Apache Cassandra as one independent cluster.

Need good pattern how to use the analytic engine for Cassandra. Thanks in advance.

Regards
Goutham.

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is
intended solely for the addressee. Access to this Email by anyone else is unauthorized. If
you are not the intended recipient, any disclosure, copying, distribution or any action taken
or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed
to our clients any opinions or advice contained in this Email are subject to the terms and
conditions expressed in any applicable governing The Home Depot terms of business or client
engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy
and content of this attachment and for any damages or losses arising from any inaccuracies,
errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature,
which may be contained in this attachment and shall not be liable for direct, indirect, consequential
or special damages in connection with this e-mail message or its attachment.

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is
intended solely for the addressee. Access to this Email by anyone else is unauthorized. If
you are not the intended recipient, any disclosure, copying, distribution or any action taken
or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed
to our clients any opinions or advice contained in this Email are subject to the terms and
conditions expressed in any applicable governing The Home Depot terms of business or client
engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy
and content of this attachment and for any damages or losses arising from any inaccuracies,
errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature,
which may be contained in this attachment and shall not be liable for direct, indirect, consequential
or special damages in connection with this e-mail message or its attachment.
Mime
View raw message