Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of eevans@acunu.com designates
 209.85.210.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAENxBwwPqmPBZHi0uY1uxG0SvK4kSmgfruq6=oBhH3vXCRhHcQ@mail.gmail.com>
References: 
 <CAJV_UYdqYmfStn5OetWrozQqbi+-yP3X-Ew9xtW=QY=2zGYDMA@mail.gmail.com>
 <CAENxBwwPqmPBZHi0uY1uxG0SvK4kSmgfruq6=oBhH3vXCRhHcQ@mail.gmail.com>
From: Eric Evans <eevans@acunu.com>
Date: Fri, 15 Feb 2013 22:10:03 -0600
Message-ID: 
 <CAL35Oi3WTWeJ5GoUZE9Mh4EpFCpC3YhT1dEGC9+tQgCzzv95RA@mail.gmail.com>
Subject: Re: virtual nodes + map reduce = too many mappers
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1

On Fri, Feb 15, 2013 at 7:01 PM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
> Seems like the hadoop Input format should combine the splits that are
> on the same node into the same map task, like Hadoop's
> CombinedInputFormat can. I am not sure who recommends vnodes as the
> default, because this is now the second problem (that I know of) of
> this class where vnodes has extra overhead,
> https://issues.apache.org/jira/browse/CASSANDRA-5161
>
> This seems to be the standard operating practice in c* now, enable
> things in the default configuration like new partitioners and newer
> features like vnodes, even though they are not heavily tested in the
> wild or well understood, then deal with fallout.

Except that it is not in fact enabled by default; The default remains
1-token-per-node.

That said, the only way that a feature like this will ever be heavily
tested in the wild, and well understood, is if it is actually put to
use.  Speaking only for myself, I am grateful to users like Cem who
test new features and report the issues they find.

> On Fri, Feb 15, 2013 at 11:52 AM, cem <cayiroglu@gmail.com> wrote:
>> Hi All,
>>
>> I have just started to use virtual nodes. I set the number of nodes to 256
>> as recommended.
>>
>> The problem that I have is when I run a mapreduce job it creates node * 256
>> mappers. It creates node * 256 splits. this effects the performance since
>> the range queries have a lot of overhead.
>>
>> Any suggestion to improve the performance? It seems like I need to lower the
>> number of virtual nodes.
>>
>> Best Regards,
>> Cem
>>
>>


-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu