From user-return-36498-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Thu Sep 12 02:23:56 2013 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A80D810088 for ; Thu, 12 Sep 2013 02:23:56 +0000 (UTC) Received: (qmail 32672 invoked by uid 500); 12 Sep 2013 02:23:54 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 32643 invoked by uid 500); 12 Sep 2013 02:23:54 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 32635 invoked by uid 99); 12 Sep 2013 02:23:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Sep 2013 02:23:54 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [209.85.192.179] (HELO mail-pd0-f179.google.com) (209.85.192.179) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Sep 2013 02:23:47 +0000 Received: by mail-pd0-f179.google.com with SMTP id v10so9996748pde.38 for ; Wed, 11 Sep 2013 19:23:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:message-id:mime-version :subject:date:references:to:in-reply-to; bh=Jf2ZtYTCm/pXmtSvS3FZiRJ685k5PO1xbKRHwcA5iIM=; b=ZwZXc0UpmcpOpPd5pBK1bkzQ+4aBmvUrZJ5ytzYhSlA837WgNK7MSNlSj9Z3fdwriS 5iSTK6iP0RO96ThH0dHdJTohqFEiwwAA6lZq8k3Yh351CgwJFnN0RmPk0Of8TzdLrRAK 2dNTWKIva8Z5aFpawvVLLKkfJVGZQJ4PhMLUiXdGowThGXu73bLBsD5jbUSBosHOgkxY 9hlE4hLs4FBe/4ffzntJ4qtwpIt54XDKQzFBIFjy1MVy7yQ42nrz0bR1HH1GqzJmL1tm OXy3MFCyLFvsv7s3fKFYOmT1wgtgBBDAB4lvkwcAd+0VOx2TIHBVyt3lK5wovnjhs3Om 04XA== X-Gm-Message-State: ALoCoQkKkyhwmk+K2B8Fot4gYeHqpwyy6nmtkKxX7Rd0iD6TpCch9q1ybRDoOw+Kh+rLWvUzU4qk X-Received: by 10.66.216.193 with SMTP id os1mr6984123pac.29.1378952585848; Wed, 11 Sep 2013 19:23:05 -0700 (PDT) Received: from [172.16.1.7] ([203.86.207.101]) by mx.google.com with ESMTPSA id mr3sm1188977pbb.27.1969.12.31.16.00.00 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 11 Sep 2013 19:23:05 -0700 (PDT) From: Aaron Morton Content-Type: multipart/alternative; boundary="Apple-Mail=_FC5D2F3B-5BC1-4EC6-A57F-58AD22FDF0B0" Message-Id: <7D096795-F595-46FD-8166-50AEA43DC5CA@thelastpickle.com> Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: Cassandra input paging for Hadoop Date: Thu, 12 Sep 2013 14:23:01 +1200 References: <1378858682.588148424@f210.i.mail.ru> To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1508) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_FC5D2F3B-5BC1-4EC6-A57F-58AD22FDF0B0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 >>=20 >> I'm looking at the ConfigHelper.setRangeBatchSize() and >> CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit confused = if >> that's what I need and if yes, which one should I use for those = purposes. If you are using CQL 3 via Hadoop CqlConfigHelper.setInputCQLPageRowSize = is the one you want.=20 it maps to the LIMIT clause of the select statement the input reader = will generate, the default is 1,000. A =20 ----------------- Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/09/2013, at 9:04 AM, Jiaan Zeng wrote: > Speaking of thrift client, i.e. ColumnFamilyInputFormat, yes, > ConfigHelper.setRangeBatchSize() can reduce the number of rows sent to > Cassandra. >=20 > Depend on how big your column is, you may also want to increase thrift > message length through setThriftMaxMessageLengthInMb(). >=20 > Hope that helps. >=20 > On Tue, Sep 10, 2013 at 8:18 PM, Renat Gilfanov = wrote: >> Hi, >>=20 >> We have Hadoop jobs that read data from our Cassandra column families = and >> write some data back to another column families. >> The input column families are pretty simple CQL3 tables without wide = rows. >> In Hadoop jobs we set up corresponding WHERE clause in >> ConfigHelper.setInputWhereClauses(...), so we don't process the whole = table >> at once. >> Never the less, sometimes the amount of data returned by input query = is big >> enough to cause TimedOutExceptions. >>=20 >> To mitigate this, I'd like to configure Hadoop job in a such way that = it >> sequentially fetches input rows by smaller portions. >>=20 >> I'm looking at the ConfigHelper.setRangeBatchSize() and >> CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit confused = if >> that's what I need and if yes, which one should I use for those = purposes. >>=20 >> Any help is appreciated. >>=20 >> Hadoop version is 1.1.2, Cassandra version is 1.2.8. >=20 >=20 >=20 > --=20 > Regards, > Jiaan --Apple-Mail=_FC5D2F3B-5BC1-4EC6-A57F-58AD22FDF0B0 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1

I'm looking at = the ConfigHelper.setRangeBatchSize() = and
CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit = confused if
that's what I need and if yes, which one should I use for = those purposes.
If you are using CQL 3 via = Hadoop CqlConfigHelper.setInputCQLPageRowSize is the one you = want. 

it maps to the LIMIT clause of the select = statement the input reader will generate, the default is = 1,000.

A
 
http://www.thelastpickle.com

On 12/09/2013, at 9:04 AM, Jiaan Zeng <l.allen09@gmail.com> = wrote:

Speaking of thrift client, i.e. ColumnFamilyInputFormat, = yes,
ConfigHelper.setRangeBatchSize() can reduce the number of rows = sent to
Cassandra.

Depend on how big your column is, you may = also want to increase thrift
message length through = setThriftMaxMessageLengthInMb().

Hope that helps.

On Tue, = Sep 10, 2013 at 8:18 PM, Renat Gilfanov <grennat@mail.ru> = wrote:
Hi,

We have Hadoop jobs that = read data from our Cassandra column families and
write some data back = to another column families.
The input column families are pretty = simple CQL3 tables without wide rows.
In Hadoop jobs we set up = corresponding WHERE clause in
ConfigHelper.setInputWhereClauses(...), = so we don't process the whole table
at once.
Never  the less, = sometimes the amount of data returned by input query is big
enough to = cause TimedOutExceptions.

To mitigate this, I'd like to configure = Hadoop job in a such way that it
sequentially fetches input rows by = smaller portions.

I'm looking at the = ConfigHelper.setRangeBatchSize() = and
CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit = confused if
that's what I need and if yes, which one should I use for = those purposes.

Any help is appreciated.

Hadoop version is = 1.1.2, Cassandra version is 1.2.8.



-- =
Regards,
Jiaan

= --Apple-Mail=_FC5D2F3B-5BC1-4EC6-A57F-58AD22FDF0B0--