Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of arrowsmith.martin@gmail.com
 designates 209.85.212.172 as permitted sender)
Received-SPF: pass (google.com: domain of arrowsmith.martin@gmail.com
 designates 10.180.93.232 as permitted sender) client-ip=10.180.93.232;
MIME-Version: 1.0
In-Reply-To: 
 <CAADnm_fQ_=7ikDG4onUR9xLss86Mi_pProi0e6mNbEwWxCqfoQ@mail.gmail.com>
References: 
 <CALN66xJzg_KWEdNabB9SJof22RXJm_tfsexL0f8jTTB1tveNag@mail.gmail.com>
	<644FC292-2E10-4866-B8ED-144BB4797B05@thelastpickle.com>
	<CAADnm_fQ_=7ikDG4onUR9xLss86Mi_pProi0e6mNbEwWxCqfoQ@mail.gmail.com>
Date: Sat, 25 Feb 2012 17:21:45 -0800
Message-ID: 
 <CALN66xLmR+A7MJg6JAXK8wEcvY7BHk3KarT+L=Kn7obwi-Z9Lw@mail.gmail.com>
Subject: Re: Querying all keys in a column family
From: Martin Arrowsmith <arrowsmith.martin@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=f46d043892772b18cf04b9d3d1bb

--f46d043892772b18cf04b9d3d1bb
Content-Type: text/plain; charset=ISO-8859-1

Hi Alexandru,

Things got hectic and I put off the project until this weekend. I'm
actually learning about Hadoop right now and how to implement it. I can
respond to this thread when I have something running.

In the meantime, I'd like to bump this email up and see if there are others
who can provide some feedback. 1) Will Hadoop speed up the time to read all
the rows? 2) Are there other options?

My guess was that hadoop could split up your jobs, so each node could
handle a portion of the query. For instance, having 2 nodes would do the
job twice as fast. That is my naive guess though and could be far from the
truth.

Best wishes,

Martin

On Fri, Feb 24, 2012 at 5:29 AM, Alexandru Sicoe <adsicoe@gmail.com> wrote:

> Hi Aaron and Martin,
>
> Sorry about my previous reply, I thought you wanted to process only all
> the row keys in CF.
>
> I have a similar issue as Martin because I see myself being forced to hit
> more than a million rows with a query (I only get a few columns from every
> row). Aaron, we've talked about this in another thread, basically I am
> constrained to ship out a window of data from my online cluster to an
> offline cluster. For this I need to read for example 5 min window of all
> the data I have. This simply accesses too many rows and I am hitting the
> I/O limit on the nodes. As I understand for every row it will do 2 random
> disk seeks (I have no caches).
>
> My question is, what can I do to improve the performance of shipping
> windows of data entirely out?
>
> Martin, did you use Hadoop as Aaron suggested? How did that work with
> Cassandra? I don't understand how accessing 1 million of rows through map
> reduce jobs be any faster?
>
> Cheers,
> Alexandru
>
>
>
> On Tue, Feb 14, 2012 at 10:00 AM, aaron morton <aaron@thelastpickle.com>wrote:
>
>> If you want to process 1 million rows use Hadoop with Hive or Pig. If you
>> use Hadoop you are not doing things in real time.
>>
>> You may need to rephrase the problem.
>>
>> Cheers
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:
>>
>> Hi Experts,
>>
>> My program is such that it queries all keys on Cassandra. I want to do
>> this as quick as possible, in order to get as close to real-time as
>> possible.
>>
>> One solution I heard was to use the sstables2json tool, and read the data
>> in as JSON. I understand that reading from each line in Cassandra might
>> take longer.
>>
>> Are there any other ideas for doing this ? Or can you confirm that
>> sstables2json is the way to go.
>>
>> Querying 100 rows in Cassandra the normal way is fast enough. I'd like to
>> query a million rows, do some calculations on them, and spit out the result
>> like it's real time.
>>
>> Thanks for any help you can give,
>>
>> Martin
>>
>>
>>
>

--f46d043892772b18cf04b9d3d1bb
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Alexandru,<br><br>Things got hectic and I put off the project until this=
 weekend. I&#39;m actually learning about Hadoop right now and how to imple=
ment it. I can respond to this thread when I have something running.<br>
<br>In the meantime, I&#39;d like to bump this email up and see if there ar=
e others who can provide some feedback. 1) Will Hadoop speed up the time to=
 read all the rows? 2) Are there other options?<br><br>My guess was that ha=
doop could split up your jobs, so each node could handle a portion of the q=
uery. For instance, having 2 nodes would do the job twice as fast. That is =
my naive guess though and could be far from the truth.<br>
<br>Best wishes,<br><br>Martin<br><br><div class=3D"gmail_quote">On Fri, Fe=
b 24, 2012 at 5:29 AM, Alexandru Sicoe <span dir=3D"ltr">&lt;<a href=3D"mai=
lto:adsicoe@gmail.com">adsicoe@gmail.com</a>&gt;</span> wrote:<br><blockquo=
te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so=
lid;padding-left:1ex">
Hi Aaron and Martin,<br><br>Sorry about my previous reply, I thought you wa=
nted to process only all the row keys in CF.<br><br>I have a similar issue =
as Martin because I see myself being forced to hit more than a million rows=
 with a query (I only get a few columns from every row). Aaron, we&#39;ve t=
alked about this in another thread, basically I am constrained to ship out =
a window of data from my online cluster to an offline cluster. For this I n=
eed to read for example 5 min window of all the data I have. This simply ac=
cesses too many rows and I am hitting the I/O limit on the nodes. As I unde=
rstand for every row it will do 2 random disk seeks (I have no caches).<br>

<br>My question is, what can I do to improve the performance of shipping wi=
ndows of data entirely out?<br><br>Martin, did you use Hadoop as Aaron sugg=
ested? How did that work with Cassandra? I don&#39;t understand how accessi=
ng 1 million of rows through map reduce jobs be any faster?<br>

<br>Cheers,<br>Alexandru<div class=3D"HOEnZb"><div class=3D"h5"><br>=A0<br>=
<br><div class=3D"gmail_quote">On Tue, Feb 14, 2012 at 10:00 AM, aaron mort=
on <span dir=3D"ltr">&lt;<a href=3D"mailto:aaron@thelastpickle.com" target=
=3D"_blank">aaron@thelastpickle.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">
<div style=3D"word-wrap:break-word">If you want to process 1 million rows u=
se Hadoop with Hive or Pig. If you use Hadoop you are not doing things in r=
eal time.=A0<div><br></div><div>You may need to rephrase the problem.=A0</d=
iv>

<div><br></div><div>Cheers</div><div><br><div>
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><span style=3D"text-indent:0px;letter-spacing:norm=
al;font-variant:normal;font-style:normal;font-weight:normal;line-height:nor=
mal;border-collapse:separate;text-transform:none;font-size:medium;white-spa=
ce:normal;font-family:Helvetica;word-spacing:0px"><div style=3D"word-wrap:b=
reak-word">

<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><div style=3D"word-wrap:break-word">

<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><div style=3D"word-wrap:break-word">

<div><div>-----------------</div><div>Aaron Morton</div><div>Freelance Deve=
loper</div><div>@aaronmorton</div><div><a href=3D"http://www.thelastpickle.=
com" target=3D"_blank">http://www.thelastpickle.com</a></div></div></div></=
span></div>

</span></div></span></span>
</div><div><div>

<br><div><div>On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:</div><br=
><blockquote type=3D"cite">Hi Experts,<br><br>My program is such that it qu=
eries all keys on Cassandra. I want to do this as quick as possible, in ord=
er to get as close to real-time as possible.<br>

<br>One solution I heard was to use the sstables2json tool, and read the da=
ta in as JSON. I understand that reading from each line in Cassandra might =
take longer.<br>
<br>Are there any other ideas for doing this ? Or can you confirm that ssta=
bles2json is the way to go.<br><br>Querying 100 rows in Cassandra the norma=
l way is fast enough. I&#39;d like to query a million rows, do some calcula=
tions on them, and spit out the result like it&#39;s real time.<br>


<br>Thanks for any help you can give,<br><br>Martin<br>
</blockquote></div><br></div></div></div></div></blockquote></div><br>
</div></div></blockquote></div><br>

--f46d043892772b18cf04b9d3d1bb--