Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of adsicoe@gmail.com designates
 209.85.214.44 as permitted sender)
Received-SPF: pass (google.com: domain of adsicoe@gmail.com designates
 10.205.121.138 as permitted sender) client-ip=10.205.121.138;
MIME-Version: 1.0
In-Reply-To: <644FC292-2E10-4866-B8ED-144BB4797B05@thelastpickle.com>
References: 
 <CALN66xJzg_KWEdNabB9SJof22RXJm_tfsexL0f8jTTB1tveNag@mail.gmail.com>
	<644FC292-2E10-4866-B8ED-144BB4797B05@thelastpickle.com>
Date: Fri, 24 Feb 2012 14:29:44 +0100
Message-ID: 
 <CAADnm_fQ_=7ikDG4onUR9xLss86Mi_pProi0e6mNbEwWxCqfoQ@mail.gmail.com>
Subject: Re: Querying all keys in a column family
From: Alexandru Sicoe <adsicoe@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=00151740295cfa400804b9b5c062

--00151740295cfa400804b9b5c062
Content-Type: text/plain; charset=ISO-8859-1

Hi Aaron and Martin,

Sorry about my previous reply, I thought you wanted to process only all the
row keys in CF.

I have a similar issue as Martin because I see myself being forced to hit
more than a million rows with a query (I only get a few columns from every
row). Aaron, we've talked about this in another thread, basically I am
constrained to ship out a window of data from my online cluster to an
offline cluster. For this I need to read for example 5 min window of all
the data I have. This simply accesses too many rows and I am hitting the
I/O limit on the nodes. As I understand for every row it will do 2 random
disk seeks (I have no caches).

My question is, what can I do to improve the performance of shipping
windows of data entirely out?

Martin, did you use Hadoop as Aaron suggested? How did that work with
Cassandra? I don't understand how accessing 1 million of rows through map
reduce jobs be any faster?

Cheers,
Alexandru


On Tue, Feb 14, 2012 at 10:00 AM, aaron morton <aaron@thelastpickle.com>wrote:

> If you want to process 1 million rows use Hadoop with Hive or Pig. If you
> use Hadoop you are not doing things in real time.
>
> You may need to rephrase the problem.
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:
>
> Hi Experts,
>
> My program is such that it queries all keys on Cassandra. I want to do
> this as quick as possible, in order to get as close to real-time as
> possible.
>
> One solution I heard was to use the sstables2json tool, and read the data
> in as JSON. I understand that reading from each line in Cassandra might
> take longer.
>
> Are there any other ideas for doing this ? Or can you confirm that
> sstables2json is the way to go.
>
> Querying 100 rows in Cassandra the normal way is fast enough. I'd like to
> query a million rows, do some calculations on them, and spit out the result
> like it's real time.
>
> Thanks for any help you can give,
>
> Martin
>
>
>

--00151740295cfa400804b9b5c062
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Aaron and Martin,<br><br>Sorry about my previous reply, I thought you wa=
nted to process only all the row keys in CF.<br><br>I have a similar issue =
as Martin because I see myself being forced to hit more than a million rows=
 with a query (I only get a few columns from every row). Aaron, we&#39;ve t=
alked about this in another thread, basically I am constrained to ship out =
a window of data from my online cluster to an offline cluster. For this I n=
eed to read for example 5 min window of all the data I have. This simply ac=
cesses too many rows and I am hitting the I/O limit on the nodes. As I unde=
rstand for every row it will do 2 random disk seeks (I have no caches).<br>
<br>My question is, what can I do to improve the performance of shipping wi=
ndows of data entirely out?<br><br>Martin, did you use Hadoop as Aaron sugg=
ested? How did that work with Cassandra? I don&#39;t understand how accessi=
ng 1 million of rows through map reduce jobs be any faster?<br>
<br>Cheers,<br>Alexandru<br>=A0<br><br><div class=3D"gmail_quote">On Tue, F=
eb 14, 2012 at 10:00 AM, aaron morton <span dir=3D"ltr">&lt;<a href=3D"mail=
to:aaron@thelastpickle.com">aaron@thelastpickle.com</a>&gt;</span> wrote:<b=
r><blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;borde=
r-left:1px solid rgb(204,204,204);padding-left:1ex">
<div style=3D"word-wrap:break-word">If you want to process 1 million rows u=
se Hadoop with Hive or Pig. If you use Hadoop you are not doing things in r=
eal time.=A0<div><br></div><div>You may need to rephrase the problem.=A0</d=
iv>
<div><br></div><div>Cheers</div><div><br><div>
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><span style=3D"text-indent:0px;letter-spacing:norm=
al;font-variant:normal;font-style:normal;font-weight:normal;line-height:nor=
mal;border-collapse:separate;text-transform:none;font-size:medium;white-spa=
ce:normal;font-family:Helvetica;word-spacing:0px"><div style=3D"word-wrap:b=
reak-word">
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><div style=3D"word-wrap:break-word">
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><div style=3D"word-wrap:break-word">
<div><div>-----------------</div><div>Aaron Morton</div><div>Freelance Deve=
loper</div><div>@aaronmorton</div><div><a href=3D"http://www.thelastpickle.=
com" target=3D"_blank">http://www.thelastpickle.com</a></div></div></div></=
span></div>
</span></div></span></span>
</div><div><div class=3D"h5">

<br><div><div>On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:</div><br=
><blockquote type=3D"cite">Hi Experts,<br><br>My program is such that it qu=
eries all keys on Cassandra. I want to do this as quick as possible, in ord=
er to get as close to real-time as possible.<br>
<br>One solution I heard was to use the sstables2json tool, and read the da=
ta in as JSON. I understand that reading from each line in Cassandra might =
take longer.<br>
<br>Are there any other ideas for doing this ? Or can you confirm that ssta=
bles2json is the way to go.<br><br>Querying 100 rows in Cassandra the norma=
l way is fast enough. I&#39;d like to query a million rows, do some calcula=
tions on them, and spit out the result like it&#39;s real time.<br>

<br>Thanks for any help you can give,<br><br>Martin<br>
</blockquote></div><br></div></div></div></div></blockquote></div><br>

--00151740295cfa400804b9b5c062--