Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: error (nike.apache.org: local policy)
From: Aaron Morton <aaron@thelastpickle.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_FC5D2F3B-5BC1-4EC6-A57F-58AD22FDF0B0"
Message-Id: <7D096795-F595-46FD-8166-50AEA43DC5CA@thelastpickle.com>
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Subject: Re: Cassandra input paging for Hadoop
Date: Thu, 12 Sep 2013 14:23:01 +1200
References: <1378858682.588148424@f210.i.mail.ru>
 <CAHFZbrJkmU1BC4DHGHnvysZPuOPjsiw=hMSanbMvyo34mThdKA@mail.gmail.com>
To: user@cassandra.apache.org
In-Reply-To: 
 <CAHFZbrJkmU1BC4DHGHnvysZPuOPjsiw=hMSanbMvyo34mThdKA@mail.gmail.com>


--Apple-Mail=_FC5D2F3B-5BC1-4EC6-A57F-58AD22FDF0B0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

>>=20
>> I'm looking at the ConfigHelper.setRangeBatchSize() and
>> CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit confused =
if
>> that's what I need and if yes, which one should I use for those =
purposes.
If you are using CQL 3 via Hadoop CqlConfigHelper.setInputCQLPageRowSize =
is the one you want.=20

it maps to the LIMIT clause of the select statement the input reader =
will generate, the default is 1,000.

A
=20
-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 12/09/2013, at 9:04 AM, Jiaan Zeng <l.allen09@gmail.com> wrote:

> Speaking of thrift client, i.e. ColumnFamilyInputFormat, yes,
> ConfigHelper.setRangeBatchSize() can reduce the number of rows sent to
> Cassandra.
>=20
> Depend on how big your column is, you may also want to increase thrift
> message length through setThriftMaxMessageLengthInMb().
>=20
> Hope that helps.
>=20
> On Tue, Sep 10, 2013 at 8:18 PM, Renat Gilfanov <grennat@mail.ru> =
wrote:
>> Hi,
>>=20
>> We have Hadoop jobs that read data from our Cassandra column families =
and
>> write some data back to another column families.
>> The input column families are pretty simple CQL3 tables without wide =
rows.
>> In Hadoop jobs we set up corresponding WHERE clause in
>> ConfigHelper.setInputWhereClauses(...), so we don't process the whole =
table
>> at once.
>> Never  the less, sometimes the amount of data returned by input query =
is big
>> enough to cause TimedOutExceptions.
>>=20
>> To mitigate this, I'd like to configure Hadoop job in a such way that =
it
>> sequentially fetches input rows by smaller portions.
>>=20
>> I'm looking at the ConfigHelper.setRangeBatchSize() and
>> CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit confused =
if
>> that's what I need and if yes, which one should I use for those =
purposes.
>>=20
>> Any help is appreciated.
>>=20
>> Hadoop version is 1.1.2, Cassandra version is 1.2.8.
>=20
>=20
>=20
> --=20
> Regards,
> Jiaan


--Apple-Mail=_FC5D2F3B-5BC1-4EC6-A57F-58AD22FDF0B0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=iso-8859-1

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Diso-8859-1"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><blockquote type=3D"cite"><blockquote type=3D"cite"><br>I'm looking at =
the ConfigHelper.setRangeBatchSize() =
and<br>CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit =
confused if<br>that's what I need and if yes, which one should I use for =
those purposes.</blockquote></blockquote>If you are using CQL 3 via =
Hadoop&nbsp;CqlConfigHelper.setInputCQLPageRowSize is the one you =
want.&nbsp;<div><br></div><div>it maps to the LIMIT clause of the select =
statement the input reader will generate, the default is =
1,000.</div><div><br></div><div>A</div><div>&nbsp;<br><div =
apple-content-edited=3D"true">
<div style=3D"color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
medium; font-style: normal; font-variant: normal; font-weight: normal; =
letter-spacing: normal; line-height: normal; orphans: 2; text-align: =
-webkit-auto; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><div>-----------------</div><div>Aaron Morton</div><div>New =
Zealand</div><div>@aaronmorton</div><div><br></div><div>Co-Founder &amp; =
Principal Consultant</div><div>Apache Cassandra Consulting</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div>
</div>
<br><div><div>On 12/09/2013, at 9:04 AM, Jiaan Zeng &lt;<a =
href=3D"mailto:l.allen09@gmail.com">l.allen09@gmail.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite">Speaking of thrift client, i.e. ColumnFamilyInputFormat, =
yes,<br>ConfigHelper.setRangeBatchSize() can reduce the number of rows =
sent to<br>Cassandra.<br><br>Depend on how big your column is, you may =
also want to increase thrift<br>message length through =
setThriftMaxMessageLengthInMb().<br><br>Hope that helps.<br><br>On Tue, =
Sep 10, 2013 at 8:18 PM, Renat Gilfanov &lt;<a =
href=3D"mailto:grennat@mail.ru">grennat@mail.ru</a>&gt; =
wrote:<br><blockquote type=3D"cite">Hi,<br><br>We have Hadoop jobs that =
read data from our Cassandra column families and<br>write some data back =
to another column families.<br>The input column families are pretty =
simple CQL3 tables without wide rows.<br>In Hadoop jobs we set up =
corresponding WHERE clause in<br>ConfigHelper.setInputWhereClauses(...), =
so we don't process the whole table<br>at once.<br>Never &nbsp;the less, =
sometimes the amount of data returned by input query is big<br>enough to =
cause TimedOutExceptions.<br><br>To mitigate this, I'd like to configure =
Hadoop job in a such way that it<br>sequentially fetches input rows by =
smaller portions.<br><br>I'm looking at the =
ConfigHelper.setRangeBatchSize() =
and<br>CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit =
confused if<br>that's what I need and if yes, which one should I use for =
those purposes.<br><br>Any help is appreciated.<br><br>Hadoop version is =
1.1.2, Cassandra version is 1.2.8.<br></blockquote><br><br><br>-- =
<br>Regards,<br>Jiaan<br></blockquote></div><br></div></body></html>=

--Apple-Mail=_FC5D2F3B-5BC1-4EC6-A57F-58AD22FDF0B0--