Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: aaron morton <aaron@thelastpickle.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_EFF492AC-E7C3-48B4-B2B8-3B6647A89019"
Message-Id: <4AC095AB-8DF4-4360-997A-372AF5BF5285@thelastpickle.com>
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: Unable to fetch large amount of rows
Date: Fri, 22 Mar 2013 06:41:13 +1300
References: <D8D80BEE1D9B4B7A82B1AFE559CC4BFE@pune.wibhu.com>
 <CD6F72DF.2405C%Dean.Hiller@nrel.gov>
 <37543DA9F9D949E3A06B7259AF33A262@pune.wibhu.com>
To: user@cassandra.apache.org
In-Reply-To: <37543DA9F9D949E3A06B7259AF33A262@pune.wibhu.com>


--Apple-Mail=_EFF492AC-E7C3-48B4-B2B8-3B6647A89019
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

> + Did run cfhistograms, the results are interesting (Note: row cache =
is
> disabled):
SSTables in cfhistograms is a friend here. It tells you how many =
sstables were read from per read, if it's above 3 I then take a look at =
the data model. If you case I would be wondering how long that row with =
the time stamp is written to. Is it spread over many sstables ?=20

>        + 75% time is spent on disk latency
Do you  mean 75% of the latency reported by proxyhistorgrams is also =
reported by cfhistograms

> +++ When query made on node on which all the records are not present
Do you mean the co-ordinator for the request was not a replica for the =
row?

>    + If my query is=20

>=20
>        -   select * from schema where timestamp =3D '..' ORDER BY =
MacAddress,
> would that be faster than, say
>=20
>        -   select * from schema where timestamp =3D '..'=20

As usual in a DB, it's faster to not re-order things. I'd have to check =
if the order by will no-op if it's the same as the clustering columns, =
for now lets just keep it out.=20

>=20
> 2) Why does response time suffer when query is made on a node on which
> records to be returned are not present? In order to be able to get =
better
> response when queried from a different node, can something be done?

During a read one node is asked to return the data, and the others to =
return a digest of their data. When the read runs on a node that is a =
replica the data read is done locally and the others are asked for a =
digest, this can lead to better performance. If you are asking for a =
large row this will have a larger impact.=20

Astyanax can direct reads to nodes which are replicas.=20

Cheers


-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/03/2013, at 4:48 PM, Pushkar Prasad =
<pushkar.prasad@airtightnetworks.net> wrote:

> Yes, I'm reading from a single partition.
>=20
> -----Original Message-----
> From: Hiller, Dean [mailto:Dean.Hiller@nrel.gov]=20
> Sent: 21 March 2013 01:38
> To: user@cassandra.apache.org
> Subject: Re: Unable to fetch large amount of rows
>=20
> Is your use case reading from a single partition?  If so, you may want =
to
> switch to something like playorm which does virtual partitions so you =
still
> get the performance of multiple disks when reading from a single =
partition.
> My understanding is a single cassandra partition exists on a single =
node.
> Anyways, just an option if that is your use-case.
>=20
> Later,
> Dean
>=20
> From: Pushkar Prasad
> =
<pushkar.prasad@airtightnetworks.net<mailto:pushkar.prasad@airtightnetwork=
s.
> net>>
> Reply-To: =
"user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Wednesday, March 20, 2013 11:41 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: RE: Unable to fetch large amount of rows
>=20
> Hi aaron.
>=20
> I added pagination, and things seem to have started performing much =
better.
> With 1000 page size, now able to fetch 500K records in 25-30 seconds.
> However, I'd like to point you to some interesting observations:
>=20
> + Did run cfhistograms, the results are interesting (Note: row cache =
is
> disabled):
> +++ When query made on node on which all the records are present
>        + 75% time is spent on disk latency
>        + Example: When 50 K entries were fetched, it took 2.65 =
seconds, out
> of which 1.92 seconds were spent in disk latency
> +++ When query made on node on which all the records are not present
>        + Considerable amount of time is spent on things other than =
disk
> latency (probably deserialization/serialization, network, etc.)
>        + Example: When 50 K entries were fetched, it took 5.74 =
seconds, out
> of which 2.21 seconds were spent in disk latency.
>=20
> I've used Astyanax to run the above queries. The results were same =
when run
> with different data points. Compaction has not been done after data
> population yet.
>=20
> I've a few questions:
> 1) Is it necessary to fetch the records in natural order of comparator
> column in order to get a high throughput? I'm trying to fetch all the
> records for a particular partition ID without any ordering on =
comparator
> column. Would that slow down the response? Consider that timestamp is
> partitionId, and MacAddress is natural comparator column.
>    + If my query is
>        -   select * from schema where timestamp =3D '..' ORDER BY =
MacAddress,
> would that be faster than, say
>        -   select * from schema where timestamp =3D '..'
> 2) Why does response time suffer when query is made on a node on which
> records to be returned are not present? In order to be able to get =
better
> response when queried from a different node, can something be done?
>=20
> Thanks
> Pushkar
> ________________________________
> From: aaron morton [mailto:aaron@thelastpickle.com]
> Sent: 20 March 2013 15:02
> To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
> Subject: Re: Unable to fetch large amount of rows
>=20
> The query returns fine if I request for lesser number of entries =
(takes 15
> seconds for returning 20K records).
> That feels a little slow, but it depends on the data model, the query =
type
> and the server and a bunch of other things.
>=20
> However, as I increase the limit on
> number of entries, the response begins to slow down. It results in
> TimedOutException.
> Make many smaller requests.
> This is often faster.
>=20
> Isn't it the case that all the data for a partitionID is stored =
sequentially
> in disk?
> Yes and no.
> In each file all the columns on one partition / row are stored in =
comparator
> order. But there may be many files.
>=20
> If that is so, then why does fetching this data take such a long
> amount of time?
> You need to work out where the time is being spent.
> Add timing to your app, use nodetool proxyhistograms to see how long =
the
> requests takes at the co-ordinator, use nodetool histograms to see how =
long
> it takes at the disk level.
>=20
> Look at your data model, are you reading data in the natural order of =
the
> comparator.
>=20
> If disk throughput is 40 MB/s, then assuming sequential
> reads, the response should come pretty quickly.
> There is more involved than doing one read from disk and returning it.
>=20
> If it is stored
> sequentially, why does C* take so much time to return the records?
> It is always going to take time to read 500,000 columns. It will take =
time
> on the client to allocate the 2 to 4 million objects needed to =
represent
> them. And once it comes to allocating those objects it will probably =
take
> more than 40MB in ram.
>=20
> Do some tests at a smaller scale, start with 500 or 1000 columns then =
get
> bigger, to get a feel for what is practical in your environment. Often =
it's
> better to make many smaller / constant size requests.
>=20
> Cheers
>=20
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>=20
> @aaronmorton
> http://www.thelastpickle.com
>=20
> On 19/03/2013, at 9:38 PM, Pushkar Prasad
> =
<pushkar.prasad@airtightnetworks.net<mailto:pushkar.prasad@airtightnetwork=
s.
> net>> wrote:
>=20
>=20
> Aaron,
>=20
> Thanks for your reply. Here are the answers to questions you had =
asked:
>=20
> I am trying to read all the rows which have a particular TimeStamp. In =
my
> data base, there are 500 K entries for a particular TimeStamp. That =
means
> about 40 MB of data.
>=20
> The query returns fine if I request for lesser number of entries =
(takes 15
> seconds for returning 20K records). However, as I increase the limit =
on
> number of entries, the response begins to slow down. It results in
> TimedOutException.
>=20
> Isn't it the case that all the data for a partitionID is stored =
sequentially
> in disk? If that is so, then why does fetching this data take such a =
long
> amount of time? If disk throughput is 40 MB/s, then assuming =
sequential
> reads, the response should come pretty quickly. Is it not the case =
that the
> data I am trying to fetch would be sequentially stored? If it is =
stored
> sequentially, why does C* take so much time to return the records? And =
if
> data is stored sequentially, is there any alternative that would allow =
me to
> fetch all the records quickly (by sequential disk fetch)?
>=20
> Thanks
> Pushkar
>=20
> -----Original Message-----
> From: aaron morton
> [mailto:aaron@thelastpickle.com<http://thelastpickle.com>]
> Sent: 19 March 2013 13:11
> To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
> Subject: Re: Unable to fetch large amount of rows
>=20
>=20
> I have 1000 timestamps, and for each timestamp, I have 500K different
> MACAddress.
> So you are trying to read about 2 million columns ?
> 500K MACAddresses each with 3 other columns?
>=20
>=20
> When I run the following query, I get RPC Timeout exceptions:
> What is the exception?
> Is it a client side socket timeout or a server side TimedOutException =
?
>=20
> If my understanding is correct then try reading fewer columns and/or =
check
> the server side for logs. It sounds like you are trying to read too =
much
> though.
>=20
> Cheers
>=20
>=20
>=20
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>=20
> @aaronmorton
> http://www.thelastpickle.com
>=20
> On 19/03/2013, at 3:51 AM, Pushkar Prasad
> =
<pushkar.prasad@airtightnetworks.net<mailto:pushkar.prasad@airtightnetwork=
s.
> net>> wrote:
>=20
>=20
> Hi,
>=20
> I have following schema:
>=20
> TimeStamp
> MACAddress
> Data Transfer
> Data Rate
> LocationID
>=20
> PKEY is (TimeStamp, MACAddress). That means partitioning is on =
TimeStamp,
> and data is ordered by MACAddress, and stored together physically (let =
me
> know if my understanding is wrong). I have 1000 timestamps, and for =
each
> timestamp, I have 500K different MACAddress.
>=20
>=20
> When I run the following query, I get RPC Timeout exceptions:
>=20
>=20
> Select * from db_table where Timestamp=3D'...'
>=20
> =46rom my understanding, this should give all the rows with just one =
disk
> seek, as all the records for a particular timeStamp. This should be =
very
> quick, however, clearly, that doesn't seem to be the case. Is there
> something I am missing here? Your help would be greatly appreciated.
>=20
>=20
> Thanks
> PP
>=20
>=20
>=20
>=20
>=20


--Apple-Mail=_EFF492AC-E7C3-48B4-B2B8-3B6647A89019
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><blockquote type=3D"cite">+ Did run cfhistograms, the results are =
interesting (Note: row cache is<br>disabled):<br></blockquote>SSTables =
in cfhistograms is a friend here. It tells you how many sstables were =
read from per read, if it's above 3 I then take a look at the data =
model. If you case I would be wondering how long that row with the time =
stamp is written to. Is it spread over many sstables =
?&nbsp;<div><br></div><div><blockquote type=3D"cite">&nbsp; &nbsp; =
&nbsp; &nbsp;+ 75% time is spent on disk latency<br></blockquote>Do you =
&nbsp;mean 75% of the latency reported by proxyhistorgrams is also =
reported by cfhistograms</div><div><br></div><div><blockquote =
type=3D"cite">+++ When query made on node on which all the records are =
not present<br></blockquote>Do you mean the co-ordinator for the request =
was not a replica for the row?<br><div><br></div><div><blockquote =
type=3D"cite">&nbsp; &nbsp;+ If my query =
is&nbsp;</blockquote></div><div><blockquote =
type=3D"cite"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- =
&nbsp;&nbsp;select * from schema where timestamp =3D '..' ORDER BY =
MacAddress,<br>would that be faster than, =
say<br><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- =
&nbsp;&nbsp;select * from schema where timestamp =3D =
'..'&nbsp;</blockquote></div><div><div>As usual in a DB, it's faster to =
not re-order things. I'd have to check if the order by will no-op if =
it's the same as the clustering columns, for now lets just keep it =
out.&nbsp;</div><div><br></div><div></div><blockquote =
type=3D"cite"><div><br></div><div>2) Why does response time suffer when =
query is made on a node on which<br>records to be returned are not =
present? In order to be able to get better<br>response when queried from =
a different node, can something be =
done?</div></blockquote><div></div></div><div>During a read one node is =
asked to return the data, and the others to return a digest of their =
data. When the read runs on a node that is a replica the data read is =
done locally and the others are asked for a digest, this can lead to =
better performance. If you are asking for a large row this will have a =
larger impact.&nbsp;</div><div><br></div><div>Astyanax&nbsp;can direct =
reads to nodes which are =
replicas.&nbsp;</div><div><br></div><div>Cheers</div><div><br></div><div><=
br></div><div apple-content-edited=3D"true">
<div style=3D"color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
medium; font-style: normal; font-variant: normal; font-weight: normal; =
letter-spacing: normal; line-height: normal; orphans: 2; text-align: =
-webkit-auto; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div =
style=3D"color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; =
font-style: normal; font-variant: normal; font-weight: normal; =
letter-spacing: normal; line-height: normal; orphans: 2; text-align: =
-webkit-auto; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
border-spacing: 0px; -webkit-text-decorations-in-effect: none; =
-webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; =
font-size: medium; "><div style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Consultant</div><div>New =
Zealand</div><div><br></div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></span></div></span></div></span></div></span></div></div>
</div>

<br><div><div>On 21/03/2013, at 4:48 PM, Pushkar Prasad &lt;<a =
href=3D"mailto:pushkar.prasad@airtightnetworks.net">pushkar.prasad@airtigh=
tnetworks.net</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite">Yes, I'm =
reading from a single partition.<br><br>-----Original =
Message-----<br>From: Hiller, Dean [mailto:Dean.Hiller@<a =
href=3D"http://nrel.gov">nrel.gov</a>] <br>Sent: 21 March 2013 =
01:38<br>To: <a =
href=3D"mailto:user@cassandra.apache.org">user@cassandra.apache.org</a><br=
>Subject: Re: Unable to fetch large amount of rows<br><br>Is your use =
case reading from a single partition? &nbsp;If so, you may want =
to<br>switch to something like playorm which does virtual partitions so =
you still<br>get the performance of multiple disks when reading from a =
single partition.<br>My understanding is a single cassandra partition =
exists on a single node.<br>Anyways, just an option if that is your =
use-case.<br><br>Later,<br>Dean<br><br>From: Pushkar Prasad<br>&lt;<a =
href=3D"mailto:pushkar.prasad@airtightnetworks.net">pushkar.prasad@airtigh=
tnetworks.net</a>&lt;mailto:pushkar.prasad@airtightnetworks.<br>net&gt;&gt=
;<br>Reply-To: "<a =
href=3D"mailto:user@cassandra.apache.org">user@cassandra.apache.org</a>&lt=
;<a =
href=3D"mailto:user@cassandra.apache.org">mailto:user@cassandra.apache.org=
</a>&gt;"<br>&lt;<a =
href=3D"mailto:user@cassandra.apache.org">user@cassandra.apache.org</a>&lt=
;<a =
href=3D"mailto:user@cassandra.apache.org">mailto:user@cassandra.apache.org=
</a>&gt;&gt;<br>Date: Wednesday, March 20, 2013 11:41 AM<br>To: "<a =
href=3D"mailto:user@cassandra.apache.org">user@cassandra.apache.org</a>&lt=
;<a =
href=3D"mailto:user@cassandra.apache.org">mailto:user@cassandra.apache.org=
</a>&gt;"<br>&lt;<a =
href=3D"mailto:user@cassandra.apache.org">user@cassandra.apache.org</a>&lt=
;<a =
href=3D"mailto:user@cassandra.apache.org">mailto:user@cassandra.apache.org=
</a>&gt;&gt;<br>Subject: RE: Unable to fetch large amount of =
rows<br><br>Hi aaron.<br><br>I added pagination, and things seem to have =
started performing much better.<br>With 1000 page size, now able to =
fetch 500K records in 25-30 seconds.<br>However, I'd like to point you =
to some interesting observations:<br><br>+ Did run cfhistograms, the =
results are interesting (Note: row cache is<br>disabled):<br>+++ When =
query made on node on which all the records are present<br> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+ 75% time is spent on disk =
latency<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+ Example: When 50 =
K entries were fetched, it took 2.65 seconds, out<br>of which 1.92 =
seconds were spent in disk latency<br>+++ When query made on node on =
which all the records are not present<br> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+ Considerable amount of time =
is spent on things other than disk<br>latency (probably =
deserialization/serialization, network, etc.)<br> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+ Example: When 50 K entries =
were fetched, it took 5.74 seconds, out<br>of which 2.21 seconds were =
spent in disk latency.<br><br>I've used Astyanax to run the above =
queries. The results were same when run<br>with different data points. =
Compaction has not been done after data<br>population yet.<br><br>I've a =
few questions:<br>1) Is it necessary to fetch the records in natural =
order of comparator<br>column in order to get a high throughput? I'm =
trying to fetch all the<br>records for a particular partition ID without =
any ordering on comparator<br>column. Would that slow down the response? =
Consider that timestamp is<br>partitionId, and MacAddress is natural =
comparator column.<br> &nbsp;&nbsp;&nbsp;+ If my query is<br> =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- &nbsp;&nbsp;select * from =
schema where timestamp =3D '..' ORDER BY MacAddress,<br>would that be =
faster than, say<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- =
&nbsp;&nbsp;select * from schema where timestamp =3D '..'<br>2) Why does =
response time suffer when query is made on a node on which<br>records to =
be returned are not present? In order to be able to get =
better<br>response when queried from a different node, can something be =
done?<br><br>Thanks<br>Pushkar<br>________________________________<br>From=
: aaron morton [mailto:aaron@<a =
href=3D"http://thelastpickle.com">thelastpickle.com</a>]<br>Sent: 20 =
March 2013 15:02<br>To: <a =
href=3D"mailto:user@cassandra.apache.org">user@cassandra.apache.org</a>&lt=
;<a =
href=3D"mailto:user@cassandra.apache.org">mailto:user@cassandra.apache.org=
</a>&gt;<br>Subject: Re: Unable to fetch large amount of rows<br><br>The =
query returns fine if I request for lesser number of entries (takes =
15<br>seconds for returning 20K records).<br>That feels a little slow, =
but it depends on the data model, the query type<br>and the server and a =
bunch of other things.<br><br>However, as I increase the limit =
on<br>number of entries, the response begins to slow down. It results =
in<br>TimedOutException.<br>Make many smaller requests.<br>This is often =
faster.<br><br>Isn't it the case that all the data for a partitionID is =
stored sequentially<br>in disk?<br>Yes and no.<br>In each file all the =
columns on one partition / row are stored in comparator<br>order. But =
there may be many files.<br><br>If that is so, then why does fetching =
this data take such a long<br>amount of time?<br>You need to work out =
where the time is being spent.<br>Add timing to your app, use nodetool =
proxyhistograms to see how long the<br>requests takes at the =
co-ordinator, use nodetool histograms to see how long<br>it takes at the =
disk level.<br><br>Look at your data model, are you reading data in the =
natural order of the<br>comparator.<br><br>If disk throughput is 40 =
MB/s, then assuming sequential<br>reads, the response should come pretty =
quickly.<br>There is more involved than doing one read from disk and =
returning it.<br><br>If it is stored<br>sequentially, why does C* take =
so much time to return the records?<br>It is always going to take time =
to read 500,000 columns. It will take time<br>on the client to allocate =
the 2 to 4 million objects needed to represent<br>them. And once it =
comes to allocating those objects it will probably take<br>more than =
40MB in ram.<br><br>Do some tests at a smaller scale, start with 500 or =
1000 columns then get<br>bigger, to get a feel for what is practical in =
your environment. Often it's<br>better to make many smaller / constant =
size requests.<br><br>Cheers<br><br>-----------------<br>Aaron =
Morton<br>Freelance Cassandra Consultant<br>New =
Zealand<br><br>@aaronmorton<br><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a><br>=
<br>On 19/03/2013, at 9:38 PM, Pushkar =
Prasad<br>&lt;pushkar.prasad@airtightnetworks.net&lt;mailto:pushkar.prasad=
@airtightnetworks.<br>net&gt;&gt; wrote:<br><br><br>Aaron,<br><br>Thanks =
for your reply. Here are the answers to questions you had =
asked:<br><br>I am trying to read all the rows which have a particular =
TimeStamp. In my<br>data base, there are 500 K entries for a particular =
TimeStamp. That means<br>about 40 MB of data.<br><br>The query returns =
fine if I request for lesser number of entries (takes 15<br>seconds for =
returning 20K records). However, as I increase the limit on<br>number of =
entries, the response begins to slow down. It results =
in<br>TimedOutException.<br><br>Isn't it the case that all the data for =
a partitionID is stored sequentially<br>in disk? If that is so, then why =
does fetching this data take such a long<br>amount of time? If disk =
throughput is 40 MB/s, then assuming sequential<br>reads, the response =
should come pretty quickly. Is it not the case that the<br>data I am =
trying to fetch would be sequentially stored? If it is =
stored<br>sequentially, why does C* take so much time to return the =
records? And if<br>data is stored sequentially, is there any alternative =
that would allow me to<br>fetch all the records quickly (by sequential =
disk fetch)?<br><br>Thanks<br>Pushkar<br><br>-----Original =
Message-----<br>From: aaron =
morton<br>[mailto:aaron@thelastpickle.com&lt;http://thelastpickle.com&gt;]=
<br>Sent: 19 March 2013 13:11<br>To: =
user@cassandra.apache.org&lt;mailto:user@cassandra.apache.org&gt;<br>Subje=
ct: Re: Unable to fetch large amount of rows<br><br><br>I have 1000 =
timestamps, and for each timestamp, I have 500K =
different<br>MACAddress.<br>So you are trying to read about 2 million =
columns ?<br>500K MACAddresses each with 3 other =
columns?<br><br><br>When I run the following query, I get RPC Timeout =
exceptions:<br>What is the exception?<br>Is it a client side socket =
timeout or a server side TimedOutException ?<br><br>If my understanding =
is correct then try reading fewer columns and/or check<br>the server =
side for logs. It sounds like you are trying to read too =
much<br>though.<br><br>Cheers<br><br><br><br>-----------------<br>Aaron =
Morton<br>Freelance Cassandra Consultant<br>New =
Zealand<br><br>@aaronmorton<br>http://www.thelastpickle.com<br><br>On =
19/03/2013, at 3:51 AM, Pushkar =
Prasad<br>&lt;pushkar.prasad@airtightnetworks.net&lt;mailto:pushkar.prasad=
@airtightnetworks.<br>net&gt;&gt; wrote:<br><br><br>Hi,<br><br>I have =
following schema:<br><br>TimeStamp<br>MACAddress<br>Data =
Transfer<br>Data Rate<br>LocationID<br><br>PKEY is (TimeStamp, =
MACAddress). That means partitioning is on TimeStamp,<br>and data is =
ordered by MACAddress, and stored together physically (let me<br>know if =
my understanding is wrong). I have 1000 timestamps, and for =
each<br>timestamp, I have 500K different MACAddress.<br><br><br>When I =
run the following query, I get RPC Timeout exceptions:<br><br><br>Select =
* from db_table where Timestamp=3D'...'<br><br>=46rom my understanding, =
this should give all the rows with just one disk<br>seek, as all the =
records for a particular timeStamp. This should be very<br>quick, =
however, clearly, that doesn't seem to be the case. Is =
there<br>something I am missing here? Your help would be greatly =
appreciated.<br><br><br>Thanks<br>PP<br><br><br><br><br><br></blockquote><=
/div><br></div></body></html>=

--Apple-Mail=_EFF492AC-E7C3-48B4-B2B8-3B6647A89019--