cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fernando Jimenez <fernando.jime...@wealth-port.com>
Subject Re: Practical limit on number of column families
Date Tue, 01 Mar 2016 14:11:45 GMT
Hi Jack

Being purposefully developed to only handle up to “a few hundred” tables is reason enough.
I accept that, and likely a use case with many tables was never really considered. But I would
still like to understand the design choices made so perhaps we gain some confidence level
in this upper limit in the number of tables. The best estimate we have so far is “a few
hundred” which is a bit vague. 

Regarding scaling, I’m not talking about scaling in terms of data volume, but on how the
data is structured. One thousand tables with one row each is the same data volume as one table
with one thousand rows, excluding any data structures required to maintain the extra tables.
But whereas the first seems likely to bring a Cassandra cluster to its knees, the second will
run happily on a single node cluster in a low end machine.

We will design our code to use a single table to avoid having nightmares with this issue.
But if there is any authoritative documentation on this characteristic of Cassandra, I would
love to know more.

FJ


> On 01 Mar 2016, at 14:23, Jack Krupansky <jack.krupansky@gmail.com> wrote:
> 
> I don't think there are any "reasons behind it." It is simply empirical experience -
as reported here.
> 
> Cassandra scales in two dimension - number of rows per node and number of nodes. If some
source of information lead you to believe otherwise, please point out the source so that we
can endeavor to correct it.
> 
> The exact number of rows per node and tables per node will always have to be evaluated
empirically - a proof of concept implementation, since it all depends on the mix of capabilities
of your hardware combined with your specific data model, your specific data values, your specific
access patterns, and your specific load. And it also depends on your own personal tolerance
for degradation of latency and throughput - some people might find a given set of performance
 metrics acceptable while other might not.
> 
> -- Jack Krupansky
> 
> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <fernando.jimenez@wealth-port.com
<mailto:fernando.jimenez@wealth-port.com>> wrote:
> Hi Tommaso
> 
> It’s not that I _need_ a large number of tables. This approach maps easily to the problem
we are trying to solve, but it’s becoming clear it’s not the right approach.
> 
> At the moment I’m trying to understand the limitations in Cassandra regarding number
of Tables and the reasons behind it. I’ve come to the email list as my Google-foo is not
giving me what I’m looking for :(
> 
> FJ
> 
> 
> 
>> On 01 Mar 2016, at 09:36, tommaso barbugli <tbarbugli@gmail.com <mailto:tbarbugli@gmail.com>>
wrote:
>> 
>> Hi Fernando,
>> 
>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a real pain
in terms of operations. Repairs were terribly slow, boot of C* slowed down and in general
tracking table metrics becomes bit more work. Why do you need this high number of tables?
>> 
>> Tommaso
>> 
>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <fernando.jimenez@wealth-port.com
<mailto:fernando.jimenez@wealth-port.com>> wrote:
>> Hi Jack
>> 
>> By entry I mean row
>> 
>> Apologies for the “obsolete terminology”. When I first looked at Cassandra it
was still on CQL2, and now that I’m looking at it again I’ve defaulted to the terms I
already knew. I will bear it in mind and call them tables from now on.
>> 
>> Is there any documentation about this limit? for example, I’d be keen to know how
much memory is consumed per table, and I’m also curious about the reasons for keeping this
in memory. I’m trying to understand the limitations here, rather than challenge them.
>> 
>> So far I found nothing in my search, hence why I had to resort to some “load testing”
to see what happens when you push the table count high
>> 
>> Thanks
>> FJ
>> 
>> 
>>> On 01 Mar 2016, at 06:23, Jack Krupansky <jack.krupansky@gmail.com <mailto:jack.krupansky@gmail.com>>
wrote:
>>> 
>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>> 
>>> You are using the obsolete terminology of CQL2 and Thrift - column family. With
CQL3 you should be creating "tables". The practical recommendation of an upper limit of a
few hundred tables across all key spaces remains.
>>> 
>>> Technically you can go higher and technically you can reduce the overhead per
table (an undocumented Jira - intentionally undocumented since it is strongly not recommended),
but... it is unlikely that you will be happy with the results.
>>> 
>>> What is the nature of the use case?
>>> 
>>> You basically have two choices: an additional cluster column to distinguish categories
of table, or separate clusters for each few hundred of tables.
>>> 
>>> 
>>> -- Jack Krupansky
>>> 
>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <fernando.jimenez@wealth-port.com
<mailto:fernando.jimenez@wealth-port.com>> wrote:
>>> Hi all
>>> 
>>> I have a use case for Cassandra that would require creating a large number of
column families. I have found references to early versions of Cassandra where each column
family would require a fixed amount of memory on all nodes, effectively imposing an upper
limit on the total number of CFs. I have also seen rumblings that this may have been fixed
in later versions.
>>> 
>>> To put the question to rest, I have setup a DSE sandbox and created some code
to generate column families populated with 3,000 entries each.
>>> 
>>> Unfortunately I have now hit this issue: https://issues.apache.org/jira/browse/CASSANDRA-9291
<https://issues.apache.org/jira/browse/CASSANDRA-9291>
>>> 
>>> So I will have to retest against Cassandra 3.0 instead
>>> 
>>> However, I would like to understand the limitations regarding creation of column
families. 
>>> 
>>> 	* Is there a practical upper limit? 
>>> 	* is this a fixed limit, or does it scale as more nodes are added into the cluster?

>>> 	* Is there a difference between one keyspace with thousands of column families,
vs thousands of keyspaces with only a few column families each?
>>> 
>>> I haven’t found any hard evidence/documentation to help me here, but if you
can point me in the right direction, I will oblige and RTFM away.
>>> 
>>> Many thanks for your help!
>>> 
>>> Cheers
>>> FJ
>>> 
>>> 
>>> 
>> 
>> 
> 
> 


Mime
View raw message