cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joost Ouwerkerk (JIRA)" <>
Subject [jira] Updated: (CASSANDRA-1096) Sequential splits causing unbalanced MapReduce load
Date Sun, 16 May 2010 19:52:42 GMT


Joost Ouwerkerk updated CASSANDRA-1096:

    Attachment: CASSANDRA-1096.patch

I added Collections.shuffle(splits) before returning the splits in getSplits().  As a result,
the load is much better distributed, throughput  was increased (about 3X in my case) and TimedOutExceptions
were all but eliminated.

> Sequential splits causing unbalanced MapReduce load
> ---------------------------------------------------
>                 Key: CASSANDRA-1096
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Joost Ouwerkerk
>         Attachments: CASSANDRA-1096.patch
>   Original Estimate: 1h
>  Remaining Estimate: 1h
> Since CassandraInputFormat returns an ordered list of splits, when there are many splits
(e.g. hundreds or more) the load on cassandra is horribly unbalanced.  e.g. if I have 30 tasks
processing 600 splits, then the rows for the first 30 splits are all located on the same one
or two nodes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message