cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <>
Subject Re: Load Balancing Mapper Tasks
Date Sat, 15 May 2010 14:55:46 GMT
Oh, very interesting.  I assumed Hadoop would be smart enough to
load-balance the jobs it sends out.  Guess not.

Can you submit a patch?

On Wed, May 12, 2010 at 12:32 PM, Joost Ouwerkerk <> wrote:
> I've been trying to improve the time it takes to map 30 million rows using a
> hadoop / cassandra cluster with 30 nodes.  I discovered that since
> CassandraInputFormat returns an ordered list of splits, when there are many
> splits (e.g. hundreds or more) the load on cassandra is horribly unbalanced.
>  e.g. if I have 30 tasks processing 600 splits, then the first 30 splits are
> all located on the same one or two nodes.
> I added Collections.shuffle(splits) before returning the splits in
> getSplits().  As a result, the load is much better distributed, throughput
>  was increased (about 3X in my case) and TimedOutExceptions were all but
> eliminated.
> Joost.

Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support

View raw message