Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of pauloricardomg@gmail.com
 designates 209.85.160.43 as permitted sender)
MIME-Version: 1.0
From: Paulo Motta <pauloricardomg@gmail.com>
Date: Thu, 17 Oct 2013 17:49:32 -0300
Message-ID: 
 <CAKaZCX7pp6cTX488yYptQU8zzxvVt_cXjbZwcbnyNDP4upCzKQ@mail.gmail.com>
Subject: Virtual node support for Hadoop workloads
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=047d7b5d660c9c39b804e8f5f50f

--047d7b5d660c9c39b804e8f5f50f
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello,

According to DSE3.1 documentation [1], "DataStax recommends using virtual
nodes only on data centers running purely Cassandra workloads. You should
disable virtual nodes on data centers running either Hadoop or Solr
workloads by setting num_tokens to 1.".

There was a thread in this mailing list earlier this year [2], where it was
suggested a workaround to the problem of having a minimum of one map task
per token (unfeasible with vnodes). This suggestion involved implementing a
new Hadoop InputSplitFormat that could combine many tokens from a single
node, thus reducing the overhead of having too many tasks per node.

Is there any JIRA ticket around this issue yet, or something being worked
on to support VNodes for Hadoop workloads, or the suggestion remains to
avoid VNodes for analytics workloads (hadoop, solr)?

Thanks,

--=20
Paulo

[1]
http://www.datastax.com/docs/datastax_enterprise3.1/deploy/configuring_repl=
ication
**
[2]
http://mail-archives.apache.org/mod_mbox/cassandra-user/201302.mbox/%3CCAJV=
_UYdqYmfStn5OetWrozQqbi+-yP3X-Ew9xtW=3DQY=3D2zGYDMA@mail.gmtokenail.com%3E<=
http://mail-archives.apache.org/mod_mbox/cassandra-user/201302.mbox/%3CCAJV=
_UYdqYmfStn5OetWrozQqbi+-yP3X-Ew9xtW=3DQY=3D2zGYDMA@mail.gmail.com%3E>

--047d7b5d660c9c39b804e8f5f50f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello,<div><br></div><div>According to DSE3.1 documentatio=
n [1], &quot;DataStax recommends using virtual nodes only on data centers r=
unning purely Cassandra workloads. You should disable virtual nodes on data=
 centers running either Hadoop or Solr workloads by setting num_tokens to 1=
.&quot;.</div>

<div><br></div><div>There was a thread in this mailing list earlier this ye=
ar [2], where it was suggested a workaround to the problem of having a mini=
mum of one map task per token (unfeasible with vnodes). This suggestion inv=
olved implementing a new Hadoop InputSplitFormat that could combine many to=
kens from a single node, thus reducing the overhead of having too many task=
s per node.=A0</div>

<div><br></div><div>Is there any JIRA ticket around this issue yet, or some=
thing being worked on to support VNodes for Hadoop workloads, or the sugges=
tion remains to avoid VNodes for analytics workloads (hadoop, solr)?</div>

<div><br></div><div>Thanks,=A0<br clear=3D"all"><div><br></div>-- <br><div>=
Paulo</div><div><br></div>[1]=A0<a href=3D"http://www.datastax.com/docs/dat=
astax_enterprise3.1/deploy/configuring_replication">http://www.datastax.com=
/docs/datastax_enterprise3.1/deploy/configuring_replication</a><div>

<span style=3D"font-family:arial,sans-serif;line-height:15px"><i style=3D"f=
ont-style:normal"><i style=3D"font-style:normal"></i></i></span></div>
</div><div>[2]=A0<a href=3D"http://mail-archives.apache.org/mod_mbox/cassan=
dra-user/201302.mbox/%3CCAJV_UYdqYmfStn5OetWrozQqbi+-yP3X-Ew9xtW=3DQY=3D2zG=
YDMA@mail.gmail.com%3E">http://mail-archives.apache.org/mod_mbox/cassandra-=
user/201302.mbox/%3CCAJV_UYdqYmfStn5OetWrozQqbi+-yP3X-Ew9xtW=3DQY=3D2zGYDMA=
@mail.gmtokenail.com%3E</a></div>

</div>

--047d7b5d660c9c39b804e8f5f50f--