Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 388D29CC4 for ; Fri, 24 Feb 2012 13:30:14 +0000 (UTC) Received: (qmail 55682 invoked by uid 500); 24 Feb 2012 13:30:12 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 55657 invoked by uid 500); 24 Feb 2012 13:30:11 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 55648 invoked by uid 99); 24 Feb 2012 13:30:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Feb 2012 13:30:11 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of adsicoe@gmail.com designates 209.85.214.44 as permitted sender) Received: from [209.85.214.44] (HELO mail-bk0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Feb 2012 13:30:05 +0000 Received: by bkwj4 with SMTP id j4so484196bkw.31 for ; Fri, 24 Feb 2012 05:29:44 -0800 (PST) Received-SPF: pass (google.com: domain of adsicoe@gmail.com designates 10.205.121.138 as permitted sender) client-ip=10.205.121.138; Authentication-Results: mr.google.com; spf=pass (google.com: domain of adsicoe@gmail.com designates 10.205.121.138 as permitted sender) smtp.mail=adsicoe@gmail.com; dkim=pass header.i=adsicoe@gmail.com Received: from mr.google.com ([10.205.121.138]) by 10.205.121.138 with SMTP id gc10mr309559bkc.23.1330090184951 (num_hops = 1); Fri, 24 Feb 2012 05:29:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=dbIH37W9TxLSckyUkgPisb2IcKg3qeV9L1WKgBsNmd4=; b=sPG+hmCeRFNwEtcLx824yXqkt6A2OEfBTZ8kHgw88X4tElSXyMs/pqLAltMWggr5gj 1QtcVExEkM6eOLHPttHoprgE8T8Mhr+Y3fU15AU+xFQSC0aoRx9KkV/KXEh/oQQWp8KC uCnhLtpUOTaQPqc4OarBTcaSiR+MtITCfwy2k= MIME-Version: 1.0 Received: by 10.205.121.138 with SMTP id gc10mr259204bkc.23.1330090184687; Fri, 24 Feb 2012 05:29:44 -0800 (PST) Received: by 10.205.121.133 with HTTP; Fri, 24 Feb 2012 05:29:44 -0800 (PST) In-Reply-To: <644FC292-2E10-4866-B8ED-144BB4797B05@thelastpickle.com> References: <644FC292-2E10-4866-B8ED-144BB4797B05@thelastpickle.com> Date: Fri, 24 Feb 2012 14:29:44 +0100 Message-ID: Subject: Re: Querying all keys in a column family From: Alexandru Sicoe To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=00151740295cfa400804b9b5c062 X-Virus-Checked: Checked by ClamAV on apache.org --00151740295cfa400804b9b5c062 Content-Type: text/plain; charset=ISO-8859-1 Hi Aaron and Martin, Sorry about my previous reply, I thought you wanted to process only all the row keys in CF. I have a similar issue as Martin because I see myself being forced to hit more than a million rows with a query (I only get a few columns from every row). Aaron, we've talked about this in another thread, basically I am constrained to ship out a window of data from my online cluster to an offline cluster. For this I need to read for example 5 min window of all the data I have. This simply accesses too many rows and I am hitting the I/O limit on the nodes. As I understand for every row it will do 2 random disk seeks (I have no caches). My question is, what can I do to improve the performance of shipping windows of data entirely out? Martin, did you use Hadoop as Aaron suggested? How did that work with Cassandra? I don't understand how accessing 1 million of rows through map reduce jobs be any faster? Cheers, Alexandru On Tue, Feb 14, 2012 at 10:00 AM, aaron morton wrote: > If you want to process 1 million rows use Hadoop with Hive or Pig. If you > use Hadoop you are not doing things in real time. > > You may need to rephrase the problem. > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote: > > Hi Experts, > > My program is such that it queries all keys on Cassandra. I want to do > this as quick as possible, in order to get as close to real-time as > possible. > > One solution I heard was to use the sstables2json tool, and read the data > in as JSON. I understand that reading from each line in Cassandra might > take longer. > > Are there any other ideas for doing this ? Or can you confirm that > sstables2json is the way to go. > > Querying 100 rows in Cassandra the normal way is fast enough. I'd like to > query a million rows, do some calculations on them, and spit out the result > like it's real time. > > Thanks for any help you can give, > > Martin > > > --00151740295cfa400804b9b5c062 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Aaron and Martin,

Sorry about my previous reply, I thought you wa= nted to process only all the row keys in CF.

I have a similar issue = as Martin because I see myself being forced to hit more than a million rows= with a query (I only get a few columns from every row). Aaron, we've t= alked about this in another thread, basically I am constrained to ship out = a window of data from my online cluster to an offline cluster. For this I n= eed to read for example 5 min window of all the data I have. This simply ac= cesses too many rows and I am hitting the I/O limit on the nodes. As I unde= rstand for every row it will do 2 random disk seeks (I have no caches).

My question is, what can I do to improve the performance of shipping wi= ndows of data entirely out?

Martin, did you use Hadoop as Aaron sugg= ested? How did that work with Cassandra? I don't understand how accessi= ng 1 million of rows through map reduce jobs be any faster?

Cheers,
Alexandru
=A0

On Tue, F= eb 14, 2012 at 10:00 AM, aaron morton <aaron@thelastpickle.com> wrote:
If you want to process 1 million rows u= se Hadoop with Hive or Pig. If you use Hadoop you are not doing things in r= eal time.=A0

You may need to rephrase the problem.=A0

Cheers

-----------------
Aaron Morton
Freelance Deve= loper
@aaronmorton

On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:
Hi Experts,

My program is such that it qu= eries all keys on Cassandra. I want to do this as quick as possible, in ord= er to get as close to real-time as possible.

One solution I heard was to use the sstables2json tool, and read the da= ta in as JSON. I understand that reading from each line in Cassandra might = take longer.

Are there any other ideas for doing this ? Or can you confirm that ssta= bles2json is the way to go.

Querying 100 rows in Cassandra the norma= l way is fast enough. I'd like to query a million rows, do some calcula= tions on them, and spit out the result like it's real time.

Thanks for any help you can give,

Martin


--00151740295cfa400804b9b5c062--