Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of gsterg@gmail.com designates
 209.85.216.171 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CABNXB2CmCWpejjN07dxY9A8fKnHN_=_i9grz3EVpnpnMSkNdwA@mail.gmail.com>
References: 
 <CALKyeu-bX5Hfh-A3FRHQsxEkQ1TM7DpCZTixb-gn78ABg8W42Q@mail.gmail.com>
	<CAEDUwd3J4RvhE=gAqwEzAbJSaveJPWs20iDp9H=BjfMmCWTL6A@mail.gmail.com>
	<CALKyeu_NXYw7rxyDapT3xd2Y7XQ7v82r4N0xzfTMMTCj9JE=Aw@mail.gmail.com>
	<CABNXB2CmCWpejjN07dxY9A8fKnHN_=_i9grz3EVpnpnMSkNdwA@mail.gmail.com>
Date: Tue, 16 Sep 2014 08:32:50 -0400
Message-ID: 
 <CAO-Q4e4_AA44jap2bPXAMRrWeQoDJba_5O1_wh_yDBOt9qUyTQ@mail.gmail.com>
Subject: Re: Cassandra, vnodes, and spark
From: George Stergiou <gsterg@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001a11c0b36a21e10205032df3ea

--001a11c0b36a21e10205032df3ea
Content-Type: text/plain; charset=UTF-8

Run into this performance report

https://github.com/datastax/spark-cassandra-connector/issues/200

Does spark connector in its current state issue one CQL per vnode or task
per vnode?

Regards.

On Tue, Sep 16, 2014 at 2:05 AM, DuyHai Doan <doanduyhai@gmail.com> wrote:

> Look into the source code of the Spark connector. CassandraRDD try to find
> all token ranges (even when using vnodes) for each node (endpoint) and
> create RDD partition to match this distribution of token ranges. Thus data
> locality is guaranteed
>
> On Tue, Sep 16, 2014 at 4:39 AM, Eric Plowe <eric.plowe@gmail.com> wrote:
>
>> Interesting. The way I understand the spark connector is that it's
>> basically a client executing a cql query and filling a spark rdd. Spark
>> will then handle the partitioning of data. Again, this is my understanding,
>> and it maybe incorrect.
>>
>>
>> On Monday, September 15, 2014, Robert Coli <rcoli@eventbrite.com> wrote:
>>
>>> On Mon, Sep 15, 2014 at 4:57 PM, Eric Plowe <eric.plowe@gmail.com>
>>> wrote:
>>>
>>>> Based on this stackoverflow question, vnodes effect the number of
>>>> mappers Hadoop needs to spawn. Which in then affect performance.
>>>>
>>>> With the spark connector for cassandra would the same situation happen?
>>>> Would vnodes affect performance in a similar situation to Hadoop?
>>>>
>>>
>>> I don't know what specifically Spark does here, but if it has the same
>>> locality expectations as Hadoop generally, my belief would be : "yes."
>>>
>>> =Rob
>>>
>>>
>

--001a11c0b36a21e10205032df3ea
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Run into this performance report<div><br></div><div><a hre=
f=3D"https://github.com/datastax/spark-cassandra-connector/issues/200">http=
s://github.com/datastax/spark-cassandra-connector/issues/200</a><br></div><=
div><br></div><div>Does spark connector in its current state issue one CQL =
per vnode or task per vnode?</div><div><br></div><div>Regards.</div></div><=
div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Sep 16, 20=
14 at 2:05 AM, DuyHai Doan <span dir=3D"ltr">&lt;<a href=3D"mailto:doanduyh=
ai@gmail.com" target=3D"_blank">doanduyhai@gmail.com</a>&gt;</span> wrote:<=
br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left=
:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Look into the source cod=
e of the Spark connector. CassandraRDD try to find all token ranges (even w=
hen using vnodes) for each node (endpoint) and create RDD partition to matc=
h this distribution of token ranges. Thus data locality is guaranteed<br></=
div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Sep 1=
6, 2014 at 4:39 AM, Eric Plowe <span dir=3D"ltr">&lt;<a href=3D"mailto:eric=
.plowe@gmail.com" target=3D"_blank">eric.plowe@gmail.com</a>&gt;</span> wro=
te:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-=
left:1px #ccc solid;padding-left:1ex">Interesting. The way I understand the=
 spark connector is that it&#39;s basically a client executing a cql query =
and filling a spark rdd. Spark will then handle the partitioning of data. A=
gain, this is my understanding, and it maybe incorrect.<div><div><span></sp=
an><br><br>On Monday, September 15, 2014, Robert Coli &lt;<a href=3D"mailto=
:rcoli@eventbrite.com" target=3D"_blank">rcoli@eventbrite.com</a>&gt; wrote=
:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-le=
ft:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_ex=
tra"><div class=3D"gmail_quote">On Mon, Sep 15, 2014 at 4:57 PM, Eric Plowe=
 <span dir=3D"ltr">&lt;<a>eric.plowe@gmail.com</a>&gt;</span> wrote:<br><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex"><div>Based=C2=A0on this stackoverflow question,=
=C2=A0vnodes effect the number of mappers Hadoop needs to spawn. Which in t=
hen affect performance.</div><div><br></div><div>With the spark connector f=
or cassandra would the same situation happen? Would vnodes affect performan=
ce in a similar situation to Hadoop?<span></span></div>
</blockquote></div><br></div><div class=3D"gmail_extra">I don&#39;t know wh=
at specifically Spark does here, but if it has the same locality expectatio=
ns as Hadoop generally, my belief would be : &quot;yes.&quot;</div><div cla=
ss=3D"gmail_extra"><br></div><div class=3D"gmail_extra">=3DRob</div><div cl=
ass=3D"gmail_extra"><br></div></div>
</blockquote>
</div></div></blockquote></div><br></div>
</blockquote></div><br></div>

--001a11c0b36a21e10205032df3ea--