incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Huge query Cassandra limits
Date Sun, 21 Jul 2013 18:40:48 GMT
> .The combination was performing better was querying for 500 rows at a time with 1000 columns
while different combinations, such as 125 rows for 4000 columns or 1000 rows for 500 columns,
were about the 15% slower. 
I would rarely go above 100 rows, specially if you are asking for 1000 columns.

> If you consider it depends also on the number of nodes in the cluster, the memory available
 and the number of rows and column the query needs, the problem of how  optimally divide a
request  becomes quite  complex. 

It sounds like you are targeting single read thread performance. 
If you want to go faster make your client do smaller requests in parallel. 

Cheers

-----------------
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/07/2013, at 12:26 AM, cesare cugnasco <cesare.cugnasco@gmail.com> wrote:

> Thank you Aaron,  your advice about a newer client it is really interesting. We will
take in account it!
> 
> Here, some numbers about our tests: we found that more or less that with more than 500k
elements (multiplying rows and columns requested) there was the inflection point, and so asking
for more the performance can only decrease.The combination was performing better was querying
for 500 rows at a time with 1000 columns while different combinations, such as 125 rows for
4000 columns or 1000 rows for 500 columns, were about the 15% slower. Other combinations have
even bigger differences.
> 
> It was a cluster of 16 nodes, with 24GBs or ram, sata-2 SSDs and 8-cores CPUs@2.6 GHz.
> 
> The issue is this memory limit can be reached with many combinations of row and columns.
Broadly speaking, in using more rows or columns there is a trade-off between better having
a better parallelization and and higher overhead. 
> If you consider it depends also on the number of nodes in the cluster, the memory available
 and the number of rows and column the query needs, the problem of how  optimally divide a
request  becomes quite  complex. 
>  
> Does these numbers make sense for you?
> 
> Cheers
> 
> 
> 2013/7/17 aaron morton <aaron@thelastpickle.com>
> >  In ours tests,  we found there's a significant performance difference between various
 configurations and we are studying a policy to optimize it. The doubt is that, if the needing
of issuing multiple requests is caused only by a fixable implementation detail, would make
pointless do this study.
> if you provide your numbers we can see if you are getting expected results.
> 
> There are some limiting factors. Using the thrift API the max message size is 15 MB.
And each row you ask for becomes (roughly) RF number of tasks in the thread pools on replicas.
When you ask for 1000 rows it creates (roughly) 3,000 tasks in the replicas. If you have other
clients trying to do reads at the same time this can cause delays to their reads.
> 
> Like everything in computing, more is not always better. Run some tests to try multi
gets with different sizes and see where improvements in the overall throughput begin to decline.
> 
> Also consider using a newer client with token aware balancing and async networking. Again
though, if you try to read everything at once you are going to have a bad day.
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 17/07/2013, at 8:24 PM, cesare cugnasco <cesare.cugnasco@gmail.com> wrote:
> 
> > Hi Rob,
> > of course, we could issue multiple requests, but then we should  consider which
is the optimal way to split the query in smaller ones. Moreover, we should choose how many
of sub-query run in parallel.
> >  In ours tests,  we found there's a significant performance difference between various
 configurations and we are studying a policy to optimize it. The doubt is that, if the needing
of issuing multiple requests is caused only by a fixable implementation detail, would make
pointless do this study.
> >
> > Does anyone made similar analysis?
> >
> >
> > 2013/7/16 Robert Coli <rcoli@eventbrite.com>
> >
> > On Tue, Jul 16, 2013 at 4:46 AM, cesare cugnasco <cesare.cugnasco@gmail.com>
wrote:
> > We  are working on porting some life science applications to Cassandra, but we have
to deal with its limits managing huge queries. Our queries are usually multiget_slice ones:
many rows with many columns each.
> >
> > You are not getting much "win" by increasing request size in Cassandra, and you
expose yourself to "lose" such as you have experienced.
> >
> > Is there some reason you cannot just issue multiple requests?
> >
> > =Rob
> >
> 
> 


Mime
View raw message