cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karthik prasad <karthik.prasad4...@gmail.com>
Subject Re: Spark and intermediate results
Date Fri, 09 Oct 2015 18:56:00 GMT
Spark's core module uses this connector to read data from Cassandra and
create RDD's or DataFrames in its workspace (In memory/on disc, depending
on the spark configurations). Then transformations or queries are applied
on RDD's or DataFrames respectively. The end results are stored back into
Cassandra using the connector.

Note: If you just want to read/write from Cassandra using spark, you can
try Kundera's Spark-Cassandra Module
<https://github.com/impetus-opensource/Kundera/wiki/Spark-Cassandra-Module>.
Kundera exposes the operations in a JPA way and helps in quick development.

-Karthik

On Fri, Oct 9, 2015 at 8:09 PM, Marcelo Valle (BLOOMBERG/ LONDON) <
mvallemilita@bloomberg.net> wrote:

> I know the connector, but having the connector only means it will take
> *input* data from Cassandra, right? What about intermediate results?
> If it stores intermediate results on Cassandra, could you please clarify
> how data locality is handled? Will it store in other keyspace?
> I could not find any doc about it...
>
> From: user@cassandra.apache.org
> Subject: Re: Spark and intermediate results
>
> You can run spark against your Cassandra data directly without using a
> shared filesystem.
>
> https://github.com/datastax/spark-cassandra-connector
>
>
> On Fri, Oct 9, 2015 at 6:09 AM Marcelo Valle (BLOOMBERG/ LONDON) <
> mvallemilita@bloomberg.net> wrote:
>
>> Hello,
>>
>> I saw this nice link from an event:
>>
>>
>> http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D
>>
>> I would like to test using Spark to perform some operations on a column
>> family, my objective is reading from CF A and writing the output of my M/R
>> job to CF B.
>>
>> That said, I've read this from Spark's FAQ (
>> http://spark.apache.org/faq.html):
>>
>> "Do I need Hadoop to run Spark?
>> No, but if you run on a cluster, you will need some form of shared file
>> system (for example, NFS mounted at the same path on each node). If you
>> have this type of filesystem, you can just deploy Spark in standalone mode.
>> "
>>
>> The question I ask is - if I don't want to have a HDFS instalation just
>> to run Spark on Cassandra, is my only option to have this NFS mounted over
>> network?
>> It doesn't seem smart to me to have something as NFS to store Spark
>> files, as it would probably affect performance, and at the same time I
>> wouldn't like to have an additional HDFS cluster just to run jobs on
>> Cassandra.
>> Is there a way of using Cassandra itself as this "some form of shared
>> file system"?
>>
>> -Marcelo
>>
>>
>> << ideas don't deserve respect >>
>>
>
>
>
> << ideas don't deserve respect >>
>

Mime
View raw message