cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrik Modesto <>
Subject Re: How Cassandra determines the splits
Date Wed, 02 May 2012 07:35:06 GMT

I had a simillar problem with Cassandra 0.8.x and the problem was when
configured Cassandra with rpc_address: and starting Hadoop job
from outside the Cassandra cluster. But with version 1.0.x the problem
is gone.

You can debug the splits with thrift. This is a copy&paste part of my
splits testing Python utility:

        print "describe_ring"
        res = client.describe_ring(argv[1])
        for t in res:
            print "%s - %s [%s] [%s]" % (t.start_token, t.end_token,
",".join(t.endpoints), ",".join(t.rpc_endpoints),)

        for r in res:
            res2 = client.describe_splits('PageData',
                    r.start_token, r.end_token,

It asks Cassandra for a list of nodes with their key ranges, then asks
each node for slits. You should adjust the 24*1024 split size.


On Tue, May 1, 2012 at 5:58 PM, Filippo Diotalevi <> wrote:
> Hi,
> I'm having problems in my Cassandra/Hadoop (1.0.8 + cdh3u3) cluster related
> to how cassandra splits the data to be processed by Hadoop.
> I'm currently testing a map reduce job, starting from a CF of roughly 1500
> rows, with
> cassandra.input.split.size 10
> cassandra.range.batch.size 1
> but what I consistently see is that, while most of the task have 1-20 rows
> assigned each, one of them is assigned 400+ rows, which gives me all sort of
> problems in terms of timeouts and memory consumption (not to mention seeing
> the mapper progress bar going to 4000% and more).
> Do you have any suggestion to solve/troublehsoot this issue?
> --
> Filippo Diotalevi

View raw message