incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Unable to fetch large amount of rows
Date Wed, 20 Mar 2013 09:32:23 GMT
> The query returns fine if I request for lesser number of entries (takes 15
> seconds for returning 20K records). 
That feels a little slow, but it depends on the data model, the query type and the server
and a bunch of other things. 

> However, as I increase the limit on
> number of entries, the response begins to slow down. It results in
> TimedOutException.
Make many smaller requests. 
This is often faster.

> Isn't it the case that all the data for a partitionID is stored sequentially
> in disk?
Yes and no. 
In each file all the columns on one partition / row are stored in comparator order. But there
may be many files. 

> If that is so, then why does fetching this data take such a long
> amount of time?
You need to work out where the time is being spent. 
Add timing to your app, use nodetool proxyhistograms to see how long the requests takes at
the co-ordinator, use nodetool histograms to see how long it takes at the disk level. 

Look at your data model, are you reading data in the natural order of the comparator. 

> If disk throughput is 40 MB/s, then assuming sequential
> reads, the response should come pretty quickly.
There is more involved than doing one read from disk and returning it. 

> If it is stored
> sequentially, why does C* take so much time to return the records?
It is always going to take time to read 500,000 columns. It will take time on the client to
allocate the 2 to 4 million objects needed to represent them. And once it comes to allocating
those objects it will probably take more than 40MB in ram. 

Do some tests at a smaller scale, start with 500 or 1000 columns then get bigger, to get a
feel for what is practical in your environment. Often it's better to make many smaller / constant
size requests. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/03/2013, at 9:38 PM, Pushkar Prasad <pushkar.prasad@airtightnetworks.net> wrote:

> Aaron,
> 
> Thanks for your reply. Here are the answers to questions you had asked:
> 
> I am trying to read all the rows which have a particular TimeStamp. In my
> data base, there are 500 K entries for a particular TimeStamp. That means
> about 40 MB of data.
> 
> The query returns fine if I request for lesser number of entries (takes 15
> seconds for returning 20K records). However, as I increase the limit on
> number of entries, the response begins to slow down. It results in
> TimedOutException.
> 
> Isn't it the case that all the data for a partitionID is stored sequentially
> in disk? If that is so, then why does fetching this data take such a long
> amount of time? If disk throughput is 40 MB/s, then assuming sequential
> reads, the response should come pretty quickly. Is it not the case that the
> data I am trying to fetch would be sequentially stored? If it is stored
> sequentially, why does C* take so much time to return the records? And if
> data is stored sequentially, is there any alternative that would allow me to
> fetch all the records quickly (by sequential disk fetch)?
> 
> Thanks
> Pushkar
> 
> -----Original Message-----
> From: aaron morton [mailto:aaron@thelastpickle.com] 
> Sent: 19 March 2013 13:11
> To: user@cassandra.apache.org
> Subject: Re: Unable to fetch large amount of rows
> 
>> I have 1000 timestamps, and for each timestamp, I have 500K different
> MACAddress.
> So you are trying to read about 2 million columns ? 
> 500K MACAddresses each with 3 other columns? 
> 
>> When I run the following query, I get RPC Timeout exceptions:
> What is the exception? 
> Is it a client side socket timeout or a server side TimedOutException ?
> 
> If my understanding is correct then try reading fewer columns and/or check
> the server side for logs. It sounds like you are trying to read too much
> though. 
> 
> Cheers
> 
> 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 19/03/2013, at 3:51 AM, Pushkar Prasad
> <pushkar.prasad@airtightnetworks.net> wrote:
> 
>> Hi,
>> 
>> I have following schema:
>> 
>> TimeStamp
>> MACAddress
>> Data Transfer
>> Data Rate
>> LocationID
>> 
>> PKEY is (TimeStamp, MACAddress). That means partitioning is on TimeStamp,
> and data is ordered by MACAddress, and stored together physically (let me
> know if my understanding is wrong). I have 1000 timestamps, and for each
> timestamp, I have 500K different MACAddress.
>> 
>> When I run the following query, I get RPC Timeout exceptions:
>> 
>> 
>> Select * from db_table where Timestamp='...'
>> 
>> From my understanding, this should give all the rows with just one disk
> seek, as all the records for a particular timeStamp. This should be very
> quick, however, clearly, that doesn't seem to be the case. Is there
> something I am missing here? Your help would be greatly appreciated.
>> 
>> Thanks
>> PP
> 
> 
> 


Mime
View raw message