incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremiah D Jordan <jeremiah.jor...@gmail.com>
Subject Re: Virtual node support for Hadoop workloads
Date Fri, 18 Oct 2013 15:36:15 GMT
Paulo,
If you have large data sizes then the vnodes with hadoop issue is moot.  You will get that
many splits with or without vnodes.  The issues come when you don't have a lot of data, so
all the extra splits slow everything down to a crawl because there are 256 times as many tasks
created as you actually needed for your job.

So for large data sets, there is no issue.  For small data sets, you can run jobs, they will
just be slower than if you didn't have vnodes.

-Jeremiah

On Oct 17, 2013, at 3:49 PM, Paulo Motta <pauloricardomg@gmail.com> wrote:

> Hello,
> 
> According to DSE3.1 documentation [1], "DataStax recommends using virtual nodes only
on data centers running purely Cassandra workloads. You should disable virtual nodes on data
centers running either Hadoop or Solr workloads by setting num_tokens to 1.".
> 
> There was a thread in this mailing list earlier this year [2], where it was suggested
a workaround to the problem of having a minimum of one map task per token (unfeasible with
vnodes). This suggestion involved implementing a new Hadoop InputSplitFormat that could combine
many tokens from a single node, thus reducing the overhead of having too many tasks per node.

> 
> Is there any JIRA ticket around this issue yet, or something being worked on to support
VNodes for Hadoop workloads, or the suggestion remains to avoid VNodes for analytics workloads
(hadoop, solr)?
> 
> Thanks, 
> 
> -- 
> Paulo
> 
> [1] http://www.datastax.com/docs/datastax_enterprise3.1/deploy/configuring_replication
> [2] http://mail-archives.apache.org/mod_mbox/cassandra-user/201302.mbox/%3CCAJV_UYdqYmfStn5OetWrozQqbi+-yP3X-Ew9xtW=QY=2zGYDMA@mail.gmtokenail.com%3E


Mime
View raw message