cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joost Ouwerkerk <>
Subject Load Balancing Mapper Tasks
Date Wed, 12 May 2010 17:32:04 GMT
I've been trying to improve the time it takes to map 30 million rows using a
hadoop / cassandra cluster with 30 nodes.  I discovered that since
CassandraInputFormat returns an ordered list of splits, when there are many
splits (e.g. hundreds or more) the load on cassandra is horribly unbalanced.
 e.g. if I have 30 tasks processing 600 splits, then the first 30 splits are
all located on the same one or two nodes.

I added *Collections.shuffle(splits) *before returning the splits in
getSplits().  As a result, the load is much better distributed, throughput
 was increased (about 3X in my case) and TimedOutExceptions were all but


View raw message