cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anton Brazhnyk <>
Subject RE: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
Date Fri, 16 May 2014 23:41:44 GMT
Hi Paulo,

I’m using C* 1.2.15 and have no easy option to upgrade (at least not to 2.0.* branch).
I’ve started to look if I can implement my variant of InputFormat.
Thanks a lot for the hint, I’m for sure will check how it’s done in 2.0.6 and if it’s
possible to backport it to 1.2.* branch.


From: Paulo Ricardo Motta Gomes []
Sent: Thursday, May 15, 2014 3:21 AM
Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

Hello Anton,

What version of Cassandra are you using? If between 1.2.6 and 2.0.6 the setInputRange(startToken,
endToken) is not working.

This was fixed in 2.0.7:

If you can't upgrade you can copy AbstractCFIF and CFIF to your project and apply the patch



On Wed, May 14, 2014 at 10:29 PM, Anton Brazhnyk <<>>

I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just
part of it - something like Spark's sample() function.
Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration,
startToken, endToken) method, but it doesn't work.
The limit is just ignored and the entire column family is scanned. It seems this kind of feature
is just not supported
and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat?
2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat,
so that this amount is predictable (like 5% of entire dataset)?


Paulo Motta

Chaordic | Platform<>
+55 48 3232.3200
View raw message