incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: virtual nodes + map reduce = too many mappers
Date Sat, 16 Feb 2013 15:13:48 GMT
No one had ever tried vnodes with hadoop until the OP did, or they
would have noticed this. No one extensively used it with secondary
indexes either from the last ticket I mentioned.

My mistake they are not a default.

I do think vnodes are awesome, its great that c* has the longer
release cylcle. Just saying I do not know what .0 and .1 releases are.
They just seem like extended beta-s to me.

Edward


On Fri, Feb 15, 2013 at 11:10 PM, Eric Evans <eevans@acunu.com> wrote:
> On Fri, Feb 15, 2013 at 7:01 PM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
>> Seems like the hadoop Input format should combine the splits that are
>> on the same node into the same map task, like Hadoop's
>> CombinedInputFormat can. I am not sure who recommends vnodes as the
>> default, because this is now the second problem (that I know of) of
>> this class where vnodes has extra overhead,
>> https://issues.apache.org/jira/browse/CASSANDRA-5161
>>
>> This seems to be the standard operating practice in c* now, enable
>> things in the default configuration like new partitioners and newer
>> features like vnodes, even though they are not heavily tested in the
>> wild or well understood, then deal with fallout.
>
> Except that it is not in fact enabled by default; The default remains
> 1-token-per-node.
>
> That said, the only way that a feature like this will ever be heavily
> tested in the wild, and well understood, is if it is actually put to
> use.  Speaking only for myself, I am grateful to users like Cem who
> test new features and report the issues they find.
>
>> On Fri, Feb 15, 2013 at 11:52 AM, cem <cayiroglu@gmail.com> wrote:
>>> Hi All,
>>>
>>> I have just started to use virtual nodes. I set the number of nodes to 256
>>> as recommended.
>>>
>>> The problem that I have is when I run a mapreduce job it creates node * 256
>>> mappers. It creates node * 256 splits. this effects the performance since
>>> the range queries have a lot of overhead.
>>>
>>> Any suggestion to improve the performance? It seems like I need to lower the
>>> number of virtual nodes.
>>>
>>> Best Regards,
>>> Cem
>>>
>>>
>
>
>
> --
> Eric Evans
> Acunu | http://www.acunu.com | @acunu

Mime
View raw message