Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5F7D0100ED for ; Fri, 18 Oct 2013 15:36:55 +0000 (UTC) Received: (qmail 53882 invoked by uid 500); 18 Oct 2013 15:36:52 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 53034 invoked by uid 500); 18 Oct 2013 15:36:46 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 53019 invoked by uid 99); 18 Oct 2013 15:36:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 15:36:45 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jeremiah.jordan@gmail.com designates 209.85.214.178 as permitted sender) Received: from [209.85.214.178] (HELO mail-ob0-f178.google.com) (209.85.214.178) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 15:36:38 +0000 Received: by mail-ob0-f178.google.com with SMTP id uz6so3239008obc.37 for ; Fri, 18 Oct 2013 08:36:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:message-id:mime-version:subject:date:references :to:in-reply-to; bh=OHCsg5uk8zUtJQsjY0gDfwVTZo9kv5UwJaqgZl/9ZfQ=; b=g5J8xzucaMm2C95cySUrcuVAQPZxDmYETnPUwF9py4pDp7Rx8p0S6sUdAvd+NknZGn RdZ/kdgnFM1iUr9b3VhlYeM7EKK455DOTbWiujURYXALu0CWda2FVvJJE0a20HM0ptMr GEJ6m/r7PvIbYVEvCU0MMlO+0TYN/w6qwrISBanFbz3cBamYAs7hcKy0OxCGalWQWKQW MZVmwnpyYbcEF/PDqZ9Kr85dUMU8B9oqEnZNSFwYwDuFBXwig1okV8br1cV/3iDQ9MKG Szce8KkBfk4g2IsRPZSr4wn82rFwuaWL/3o78dhpmcQyC6KCupgLPFO/3BnBoc2QXSOF 0ySw== X-Received: by 10.182.149.234 with SMTP id ud10mr430371obb.73.1382110577688; Fri, 18 Oct 2013 08:36:17 -0700 (PDT) Received: from [192.168.2.10] (c-71-201-190-179.hsd1.il.comcast.net. [71.201.190.179]) by mx.google.com with ESMTPSA id r6sm4358454obi.14.2013.10.18.08.36.16 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Oct 2013 08:36:16 -0700 (PDT) From: Jeremiah D Jordan Content-Type: multipart/alternative; boundary="Apple-Mail=_EA978BF7-E0D3-406D-9FF9-690D40082187" Message-Id: <89CC2836-0738-4F31-9D91-5B93EFA3BD5D@gmail.com> Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\)) Subject: Re: Virtual node support for Hadoop workloads Date: Fri, 18 Oct 2013 10:36:15 -0500 References: To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1510) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_EA978BF7-E0D3-406D-9FF9-690D40082187 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 Paulo, If you have large data sizes then the vnodes with hadoop issue is moot. = You will get that many splits with or without vnodes. The issues come = when you don't have a lot of data, so all the extra splits slow = everything down to a crawl because there are 256 times as many tasks = created as you actually needed for your job. So for large data sets, there is no issue. For small data sets, you can = run jobs, they will just be slower than if you didn't have vnodes. -Jeremiah On Oct 17, 2013, at 3:49 PM, Paulo Motta = wrote: > Hello, >=20 > According to DSE3.1 documentation [1], "DataStax recommends using = virtual nodes only on data centers running purely Cassandra workloads. = You should disable virtual nodes on data centers running either Hadoop = or Solr workloads by setting num_tokens to 1.". >=20 > There was a thread in this mailing list earlier this year [2], where = it was suggested a workaround to the problem of having a minimum of one = map task per token (unfeasible with vnodes). This suggestion involved = implementing a new Hadoop InputSplitFormat that could combine many = tokens from a single node, thus reducing the overhead of having too many = tasks per node.=20 >=20 > Is there any JIRA ticket around this issue yet, or something being = worked on to support VNodes for Hadoop workloads, or the suggestion = remains to avoid VNodes for analytics workloads (hadoop, solr)? >=20 > Thanks,=20 >=20 > --=20 > Paulo >=20 > [1] = http://www.datastax.com/docs/datastax_enterprise3.1/deploy/configuring_rep= lication > [2] = http://mail-archives.apache.org/mod_mbox/cassandra-user/201302.mbox/%3CCAJ= V_UYdqYmfStn5OetWrozQqbi+-yP3X-Ew9xtW=3DQY=3D2zGYDMA@mail.gmtokenail.com%3= E --Apple-Mail=_EA978BF7-E0D3-406D-9FF9-690D40082187 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 pauloricardomg@gmail.com> = wrote:
Hello,

According to = DSE3.1 documentation [1], "DataStax recommends using virtual nodes only = on data centers running purely Cassandra workloads. You should disable = virtual nodes on data centers running either Hadoop or Solr workloads by = setting num_tokens to 1.".

There was a thread in this mailing list earlier this = year [2], where it was suggested a workaround to the problem of having a = minimum of one map task per token (unfeasible with vnodes). This = suggestion involved implementing a new Hadoop InputSplitFormat that = could combine many tokens from a single node, thus reducing the overhead = of having too many tasks per node. 

Is there any JIRA ticket around this issue yet, or = something being worked on to support VNodes for Hadoop workloads, or the = suggestion remains to avoid VNodes for analytics workloads (hadoop, = solr)?


= --Apple-Mail=_EA978BF7-E0D3-406D-9FF9-690D40082187--