incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
Date Mon, 19 May 2014 08:57:49 GMT
> The limit is just ignored and the entire column family is scanned.
Which limit ? 

> 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat?
From what I understand setting the input range is used when calculating the splits. The token
ranges in the cluster are iterated and if they intersect with the supplied range the overlapping
range is used to calculate the split. Rather than the full token range. 

> 2. Is there other way to limit the amount of data read from Cassandra with Spark and
ColumnFamilyInputFormat,
> so that this amount is predictable (like 5% of entire dataset)?
if you suppled a token range is that is 5% of the possible range of values for the token that
should be close to a random 5% sample. 


Hope that helps. 
Aaron

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 14/05/2014, at 10:46 am, Anton Brazhnyk <anton.brazhnyk@genesys.com> wrote:

> Greetings,
> 
> I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read
just part of it - something like Spark's sample() function.
> Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration,
startToken, endToken) method, but it doesn't work.
> The limit is just ignored and the entire column family is scanned. It seems this kind
of feature is just not supported 
> and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
> Questions:
> 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat?
> 2. Is there other way to limit the amount of data read from Cassandra with Spark and
ColumnFamilyInputFormat,
> so that this amount is predictable (like 5% of entire dataset)?
> 
> 
> WBR,
> Anton
> 
> 


Mime
View raw message