incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: Cassandra input paging for Hadoop
Date Mon, 16 Sep 2013 22:25:05 GMT
> Or CqlPagingRecordReader supports paging through the entire result set?
Supports paging through the entire result set. 

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 12/09/2013, at 5:58 PM, Renat Gilfanov <grennat@mail.ru> wrote:

> Hello,
> 
> So it means that job will process only first "cassandra.input.page.row.size" rows, and
ignore the rest? Or CqlPagingRecordReader supports paging through the entire result set?
> 
> 
>   Aaron Morton <aaron@thelastpickle.com>:
>>> 
>>> I'm looking at the ConfigHelper.setRangeBatchSize() and
>>> CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit confused if
>>> that's what I need and if yes, which one should I use for those purposes.
> If you are using CQL 3 via Hadoop CqlConfigHelper.setInputCQLPageRowSize is the one you
want. 
> 
> it maps to the LIMIT clause of the select statement the input reader will generate, the
default is 1,000.
> 
> A
>  
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
> 
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
> 
> On 12/09/2013, at 9:04 AM, Jiaan Zeng <l.allen09@gmail.com> wrote:
> 
>> Speaking of thrift client, i.e. ColumnFamilyInputFormat, yes,
>> ConfigHelper.setRangeBatchSize() can reduce the number of rows sent to
>> Cassandra.
>> 
>> Depend on how big your column is, you may also want to increase thrift
>> message length through setThriftMaxMessageLengthInMb().
>> 
>> Hope that helps.
>> 
>> On Tue, Sep 10, 2013 at 8:18 PM, Renat Gilfanov <grennat@mail.ru> wrote:
>>> Hi,
>>> 
>>> We have Hadoop jobs that read data from our Cassandra column families and
>>> write some data back to another column families.
>>> The input column families are pretty simple CQL3 tables without wide rows.
>>> In Hadoop jobs we set up corresponding WHERE clause in
>>> ConfigHelper.setInputWhereClauses(...), so we don't process the whole table
>>> at once.
>>> Never  the less, sometimes the amount of data returned by input query is big
>>> enough to cause TimedOutExceptions.
>>> 
>>> To mitigate this, I'd like to configure Hadoop job in a such way that it
>>> sequentially fetches input rows by smaller portions.
>>> 
>>> I'm looking at the ConfigHelper.setRangeBatchSize() and
>>> CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit confused if
>>> that's what I need and if yes, which one should I use for those purposes.
>>> 
>>> Any help is appreciated.
>>> 
>>> Hadoop version is 1.1.2, Cassandra version is 1.2.8.
>> 
>> 
>> 
>> -- 
>> Regards,
>> Jiaan
> 


Mime
View raw message