cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Evans <>
Subject Re: virtual nodes + map reduce = too many mappers
Date Sat, 16 Feb 2013 16:05:29 GMT
On Sat, Feb 16, 2013 at 9:13 AM, Edward Capriolo <> wrote:
> No one had ever tried vnodes with hadoop until the OP did, or they
> would have noticed this. No one extensively used it with secondary
> indexes either from the last ticket I mentioned.
> My mistake they are not a default.
> I do think vnodes are awesome, its great that c* has the longer
> release cylcle. Just saying I do not know what .0 and .1 releases are.
> They just seem like extended beta-s to me.

We should definitely aspire to better/more thorough QA, but at the
risk of making what sounds like an excuse, I would argue that this is
the nature of open source software development.  You "Release Early,
Release Often", and iterate with your early adopters to shake out the
missed bugs.

What's important, I think, is to minimize the impact on existing
users, and properly set expectations.  I don't see where we've failed
here, but I'm definitely open to hearing that I'm wrong (or how we
could have done better).

> On Fri, Feb 15, 2013 at 11:10 PM, Eric Evans <> wrote:
>> On Fri, Feb 15, 2013 at 7:01 PM, Edward Capriolo <> wrote:
>>> Seems like the hadoop Input format should combine the splits that are
>>> on the same node into the same map task, like Hadoop's
>>> CombinedInputFormat can. I am not sure who recommends vnodes as the
>>> default, because this is now the second problem (that I know of) of
>>> this class where vnodes has extra overhead,
>>> This seems to be the standard operating practice in c* now, enable
>>> things in the default configuration like new partitioners and newer
>>> features like vnodes, even though they are not heavily tested in the
>>> wild or well understood, then deal with fallout.
>> Except that it is not in fact enabled by default; The default remains
>> 1-token-per-node.
>> That said, the only way that a feature like this will ever be heavily
>> tested in the wild, and well understood, is if it is actually put to
>> use.  Speaking only for myself, I am grateful to users like Cem who
>> test new features and report the issues they find.
>>> On Fri, Feb 15, 2013 at 11:52 AM, cem <> wrote:
>>>> Hi All,
>>>> I have just started to use virtual nodes. I set the number of nodes to 256
>>>> as recommended.
>>>> The problem that I have is when I run a mapreduce job it creates node * 256
>>>> mappers. It creates node * 256 splits. this effects the performance since
>>>> the range queries have a lot of overhead.
>>>> Any suggestion to improve the performance? It seems like I need to lower
>>>> number of virtual nodes.
>>>> Best Regards,
>>>> Cem
>> --
>> Eric Evans
>> Acunu | | @acunu

Eric Evans
Acunu | | @acunu

View raw message