cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Filippo Diotalevi <>
Subject cassandra.input.split.size and number of mappers
Date Mon, 23 Apr 2012 17:39:31 GMT
I'm finding very difficult to try to understand how Hadoop and Cassandra (CDH3u3 and 1.0.8
respectively) splits the work between mappers.

The thing that confuses me is that, for any value of cassandra.input.split.size I set, I always
get 1 (at most 2) mapper per node.

I'm trying to debug the Cassandra code connecting with a 3 node cluster, and I notice the
following things

** ColumnFamilyInputFormat.getRangeMap returns (correctly, I assume) 3 ranges  
[TokenRange(start_token:0, end_token:56713727820156410577229101238628035242, ….
TokenRange(start_token:56713727820156410577229101238628035242, end_token:113427455640312814857969558651062452224,
TokenRange(start_token:113427455640312814857969558651062452224, end_token:0, …….]

** Inside the SplitCallable object, the getSubsplits methods always return 1 split.  
Irregardless of the splitSize, the call to client.describe_splits(..)   always return 1 split
(which is the original range).

I should mention  also that the CF I'm trying to map/reduce is composed of around 1500 rows,
and I've tried split size ranging from 1000 to 10 without change, except for a "sweet spot"
split size of 120 that creates exactly 2 mappers per node. However, decreasing the split size
under 120 has the effect of Hadoop creating again 1 mapper per node.

It seems to me that, with my current Cassandra configuration, the describe_splits RPC call
always return 1 or 2, irregardless of the keys_per_split value passed.

Is it maybe a Cassadra configuration? Or can it be a bug in the code?

Filippo Diotalevi

View raw message