Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 792279FF3 for ; Sun, 26 Feb 2012 01:22:14 +0000 (UTC) Received: (qmail 55698 invoked by uid 500); 26 Feb 2012 01:22:12 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 55665 invoked by uid 500); 26 Feb 2012 01:22:12 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 55656 invoked by uid 99); 26 Feb 2012 01:22:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Feb 2012 01:22:12 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of arrowsmith.martin@gmail.com designates 209.85.212.172 as permitted sender) Received: from [209.85.212.172] (HELO mail-wi0-f172.google.com) (209.85.212.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Feb 2012 01:22:06 +0000 Received: by wicr5 with SMTP id r5so87898wic.31 for ; Sat, 25 Feb 2012 17:21:45 -0800 (PST) Received-SPF: pass (google.com: domain of arrowsmith.martin@gmail.com designates 10.180.93.232 as permitted sender) client-ip=10.180.93.232; Authentication-Results: mr.google.com; spf=pass (google.com: domain of arrowsmith.martin@gmail.com designates 10.180.93.232 as permitted sender) smtp.mail=arrowsmith.martin@gmail.com; dkim=pass header.i=arrowsmith.martin@gmail.com Received: from mr.google.com ([10.180.93.232]) by 10.180.93.232 with SMTP id cx8mr7360391wib.14.1330219305440 (num_hops = 1); Sat, 25 Feb 2012 17:21:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=uGDi08kpUHq6HqQLXa7OtBZQ8sIzoRd24QXBLz1MBbY=; b=CErO6Q1MBId7c5UIX6qQqzkThy1x8eWHH1ls5dl++GGPt2xjDZ53CuunXYBFuKCBx7 4fGpoNMOHV5h0wL7yGRCzGdziZ8ukc6X5YLtUNUHtnOKgem7Cc96QPPocEVrwxL/VwLv 9bQGXO5FIDxmOR3d4Rwzsk7zZFLO3cfy5uejI= MIME-Version: 1.0 Received: by 10.180.93.232 with SMTP id cx8mr5828421wib.14.1330219305343; Sat, 25 Feb 2012 17:21:45 -0800 (PST) Received: by 10.216.61.78 with HTTP; Sat, 25 Feb 2012 17:21:45 -0800 (PST) In-Reply-To: References: <644FC292-2E10-4866-B8ED-144BB4797B05@thelastpickle.com> Date: Sat, 25 Feb 2012 17:21:45 -0800 Message-ID: Subject: Re: Querying all keys in a column family From: Martin Arrowsmith To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=f46d043892772b18cf04b9d3d1bb --f46d043892772b18cf04b9d3d1bb Content-Type: text/plain; charset=ISO-8859-1 Hi Alexandru, Things got hectic and I put off the project until this weekend. I'm actually learning about Hadoop right now and how to implement it. I can respond to this thread when I have something running. In the meantime, I'd like to bump this email up and see if there are others who can provide some feedback. 1) Will Hadoop speed up the time to read all the rows? 2) Are there other options? My guess was that hadoop could split up your jobs, so each node could handle a portion of the query. For instance, having 2 nodes would do the job twice as fast. That is my naive guess though and could be far from the truth. Best wishes, Martin On Fri, Feb 24, 2012 at 5:29 AM, Alexandru Sicoe wrote: > Hi Aaron and Martin, > > Sorry about my previous reply, I thought you wanted to process only all > the row keys in CF. > > I have a similar issue as Martin because I see myself being forced to hit > more than a million rows with a query (I only get a few columns from every > row). Aaron, we've talked about this in another thread, basically I am > constrained to ship out a window of data from my online cluster to an > offline cluster. For this I need to read for example 5 min window of all > the data I have. This simply accesses too many rows and I am hitting the > I/O limit on the nodes. As I understand for every row it will do 2 random > disk seeks (I have no caches). > > My question is, what can I do to improve the performance of shipping > windows of data entirely out? > > Martin, did you use Hadoop as Aaron suggested? How did that work with > Cassandra? I don't understand how accessing 1 million of rows through map > reduce jobs be any faster? > > Cheers, > Alexandru > > > > On Tue, Feb 14, 2012 at 10:00 AM, aaron morton wrote: > >> If you want to process 1 million rows use Hadoop with Hive or Pig. If you >> use Hadoop you are not doing things in real time. >> >> You may need to rephrase the problem. >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote: >> >> Hi Experts, >> >> My program is such that it queries all keys on Cassandra. I want to do >> this as quick as possible, in order to get as close to real-time as >> possible. >> >> One solution I heard was to use the sstables2json tool, and read the data >> in as JSON. I understand that reading from each line in Cassandra might >> take longer. >> >> Are there any other ideas for doing this ? Or can you confirm that >> sstables2json is the way to go. >> >> Querying 100 rows in Cassandra the normal way is fast enough. I'd like to >> query a million rows, do some calculations on them, and spit out the result >> like it's real time. >> >> Thanks for any help you can give, >> >> Martin >> >> >> > --f46d043892772b18cf04b9d3d1bb Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Alexandru,

Things got hectic and I put off the project until this= weekend. I'm actually learning about Hadoop right now and how to imple= ment it. I can respond to this thread when I have something running.

In the meantime, I'd like to bump this email up and see if there ar= e others who can provide some feedback. 1) Will Hadoop speed up the time to= read all the rows? 2) Are there other options?

My guess was that ha= doop could split up your jobs, so each node could handle a portion of the q= uery. For instance, having 2 nodes would do the job twice as fast. That is = my naive guess though and could be far from the truth.

Best wishes,

Martin

On Fri, Fe= b 24, 2012 at 5:29 AM, Alexandru Sicoe <adsicoe@gmail.com> wrote:
Hi Aaron and Martin,

Sorry about my previous reply, I thought you wa= nted to process only all the row keys in CF.

I have a similar issue = as Martin because I see myself being forced to hit more than a million rows= with a query (I only get a few columns from every row). Aaron, we've t= alked about this in another thread, basically I am constrained to ship out = a window of data from my online cluster to an offline cluster. For this I n= eed to read for example 5 min window of all the data I have. This simply ac= cesses too many rows and I am hitting the I/O limit on the nodes. As I unde= rstand for every row it will do 2 random disk seeks (I have no caches).

My question is, what can I do to improve the performance of shipping wi= ndows of data entirely out?

Martin, did you use Hadoop as Aaron sugg= ested? How did that work with Cassandra? I don't understand how accessi= ng 1 million of rows through map reduce jobs be any faster?

Cheers,
Alexandru

=A0
=
On Tue, Feb 14, 2012 at 10:00 AM, aaron mort= on <aaron@thelastpickle.com> wrote:
If you want to process 1 million rows u= se Hadoop with Hive or Pig. If you use Hadoop you are not doing things in r= eal time.=A0

You may need to rephrase the problem.=A0

Cheers

-----------------
Aaron Morton
Freelance Deve= loper
@aaronmorton

On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:
Hi Experts,

My program is such that it qu= eries all keys on Cassandra. I want to do this as quick as possible, in ord= er to get as close to real-time as possible.

One solution I heard was to use the sstables2json tool, and read the da= ta in as JSON. I understand that reading from each line in Cassandra might = take longer.

Are there any other ideas for doing this ? Or can you confirm that ssta= bles2json is the way to go.

Querying 100 rows in Cassandra the norma= l way is fast enough. I'd like to query a million rows, do some calcula= tions on them, and spit out the result like it's real time.

Thanks for any help you can give,

Martin



--f46d043892772b18cf04b9d3d1bb--