Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
From: Fernando Jimenez <fernando.jimenez@wealth-port.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_61394AFD-B994-4435-A888-973EEEFF3ACC"
Message-Id: <78A013D0-3EB7-4E62-95EA-F87A26959738@wealth-port.com>
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
Subject: Re: Practical limit on number of column families
Date: Tue, 1 Mar 2016 15:11:45 +0100
References: <E7643E42-B942-4735-80ED-1F9BC330BFDD@wealth-port.com>
 <CAOxAL61RNrev4y99t-pkA_3z+GpMMKwCQQCKEdWkB51ipuiDAg@mail.gmail.com>
 <C3F064C0-4FA4-4EAC-8D36-30F70A7BC88E@wealth-port.com>
 <CAMwLmJZRqf30GiePy8wOZq6D9rM0Td_E1s9jO40qbfFmxt8jQw@mail.gmail.com>
 <C20239D4-33CB-465F-ABE2-E4D50B77EC21@wealth-port.com>
 <CAOxAL61xjk9gvnngRiU2HhUk29d8p3j10g+96Wc1jHKR0AM1PQ@mail.gmail.com>
To: user@cassandra.apache.org
In-Reply-To: 
 <CAOxAL61xjk9gvnngRiU2HhUk29d8p3j10g+96Wc1jHKR0AM1PQ@mail.gmail.com>


--Apple-Mail=_61394AFD-B994-4435-A888-973EEEFF3ACC
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Hi Jack

Being purposefully developed to only handle up to =E2=80=9Ca few =
hundred=E2=80=9D tables is reason enough. I accept that, and likely a =
use case with many tables was never really considered. But I would still =
like to understand the design choices made so perhaps we gain some =
confidence level in this upper limit in the number of tables. The best =
estimate we have so far is =E2=80=9Ca few hundred=E2=80=9D which is a =
bit vague.=20

Regarding scaling, I=E2=80=99m not talking about scaling in terms of =
data volume, but on how the data is structured. One thousand tables with =
one row each is the same data volume as one table with one thousand =
rows, excluding any data structures required to maintain the extra =
tables. But whereas the first seems likely to bring a Cassandra cluster =
to its knees, the second will run happily on a single node cluster in a =
low end machine.

We will design our code to use a single table to avoid having nightmares =
with this issue. But if there is any authoritative documentation on this =
characteristic of Cassandra, I would love to know more.

FJ


> On 01 Mar 2016, at 14:23, Jack Krupansky <jack.krupansky@gmail.com> =
wrote:
>=20
> I don't think there are any "reasons behind it." It is simply =
empirical experience - as reported here.
>=20
> Cassandra scales in two dimension - number of rows per node and number =
of nodes. If some source of information lead you to believe otherwise, =
please point out the source so that we can endeavor to correct it.
>=20
> The exact number of rows per node and tables per node will always have =
to be evaluated empirically - a proof of concept implementation, since =
it all depends on the mix of capabilities of your hardware combined with =
your specific data model, your specific data values, your specific =
access patterns, and your specific load. And it also depends on your own =
personal tolerance for degradation of latency and throughput - some =
people might find a given set of performance  metrics acceptable while =
other might not.
>=20
> -- Jack Krupansky
>=20
> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez =
<fernando.jimenez@wealth-port.com =
<mailto:fernando.jimenez@wealth-port.com>> wrote:
> Hi Tommaso
>=20
> It=E2=80=99s not that I _need_ a large number of tables. This approach =
maps easily to the problem we are trying to solve, but it=E2=80=99s =
becoming clear it=E2=80=99s not the right approach.
>=20
> At the moment I=E2=80=99m trying to understand the limitations in =
Cassandra regarding number of Tables and the reasons behind it. I=E2=80=99=
ve come to the email list as my Google-foo is not giving me what I=E2=80=99=
m looking for :(
>=20
> FJ
>=20
>=20
>=20
>> On 01 Mar 2016, at 09:36, tommaso barbugli <tbarbugli@gmail.com =
<mailto:tbarbugli@gmail.com>> wrote:
>>=20
>> Hi Fernando,
>>=20
>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it =
was a real pain in terms of operations. Repairs were terribly slow, boot =
of C* slowed down and in general tracking table metrics becomes bit more =
work. Why do you need this high number of tables?
>>=20
>> Tommaso
>>=20
>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez =
<fernando.jimenez@wealth-port.com =
<mailto:fernando.jimenez@wealth-port.com>> wrote:
>> Hi Jack
>>=20
>> By entry I mean row
>>=20
>> Apologies for the =E2=80=9Cobsolete terminology=E2=80=9D. When I =
first looked at Cassandra it was still on CQL2, and now that I=E2=80=99m =
looking at it again I=E2=80=99ve defaulted to the terms I already knew. =
I will bear it in mind and call them tables from now on.
>>=20
>> Is there any documentation about this limit? for example, I=E2=80=99d =
be keen to know how much memory is consumed per table, and I=E2=80=99m =
also curious about the reasons for keeping this in memory. I=E2=80=99m =
trying to understand the limitations here, rather than challenge them.
>>=20
>> So far I found nothing in my search, hence why I had to resort to =
some =E2=80=9Cload testing=E2=80=9D to see what happens when you push =
the table count high
>>=20
>> Thanks
>> FJ
>>=20
>>=20
>>> On 01 Mar 2016, at 06:23, Jack Krupansky <jack.krupansky@gmail.com =
<mailto:jack.krupansky@gmail.com>> wrote:
>>>=20
>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... =
what?
>>>=20
>>> You are using the obsolete terminology of CQL2 and Thrift - column =
family. With CQL3 you should be creating "tables". The practical =
recommendation of an upper limit of a few hundred tables across all key =
spaces remains.
>>>=20
>>> Technically you can go higher and technically you can reduce the =
overhead per table (an undocumented Jira - intentionally undocumented =
since it is strongly not recommended), but... it is unlikely that you =
will be happy with the results.
>>>=20
>>> What is the nature of the use case?
>>>=20
>>> You basically have two choices: an additional cluster column to =
distinguish categories of table, or separate clusters for each few =
hundred of tables.
>>>=20
>>>=20
>>> -- Jack Krupansky
>>>=20
>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez =
<fernando.jimenez@wealth-port.com =
<mailto:fernando.jimenez@wealth-port.com>> wrote:
>>> Hi all
>>>=20
>>> I have a use case for Cassandra that would require creating a large =
number of column families. I have found references to early versions of =
Cassandra where each column family would require a fixed amount of =
memory on all nodes, effectively imposing an upper limit on the total =
number of CFs. I have also seen rumblings that this may have been fixed =
in later versions.
>>>=20
>>> To put the question to rest, I have setup a DSE sandbox and created =
some code to generate column families populated with 3,000 entries each.
>>>=20
>>> Unfortunately I have now hit this issue: =
https://issues.apache.org/jira/browse/CASSANDRA-9291 =
<https://issues.apache.org/jira/browse/CASSANDRA-9291>
>>>=20
>>> So I will have to retest against Cassandra 3.0 instead
>>>=20
>>> However, I would like to understand the limitations regarding =
creation of column families.=20
>>>=20
>>> 	* Is there a practical upper limit?=20
>>> 	* is this a fixed limit, or does it scale as more nodes are =
added into the cluster?=20
>>> 	* Is there a difference between one keyspace with thousands of =
column families, vs thousands of keyspaces with only a few column =
families each?
>>>=20
>>> I haven=E2=80=99t found any hard evidence/documentation to help me =
here, but if you can point me in the right direction, I will oblige and =
RTFM away.
>>>=20
>>> Many thanks for your help!
>>>=20
>>> Cheers
>>> FJ
>>>=20
>>>=20
>>>=20
>>=20
>>=20
>=20
>=20


--Apple-Mail=_61394AFD-B994-4435-A888-973EEEFF3ACC
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">Hi Jack<br class=3D""><div class=3D""><div style=3D"color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant: normal; font-weight: normal; letter-spacing: =
normal; orphans: auto; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; widows: auto; word-spacing: =
0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div class=3D""><br class=3D""></div><div class=3D"">Being =
purposefully developed to only handle up to =E2=80=9Ca few hundred=E2=80=9D=
 tables is reason enough. I accept that, and likely a use case with many =
tables was never really considered. But I would still like to understand =
the design choices made so perhaps we gain some confidence level in this =
upper limit in the number of tables. The best estimate we have so far is =
=E2=80=9Ca few hundred=E2=80=9D which is a bit vague.&nbsp;</div><div =
class=3D""><br class=3D""></div><div class=3D"">Regarding scaling, I=E2=80=
=99m not talking about scaling in terms of data volume, but on how the =
data is structured. One thousand tables with one row each is the same =
data volume as one table with one thousand rows, excluding any data =
structures required to maintain the extra tables. But whereas the first =
seems likely to bring a Cassandra cluster to its knees, the second will =
run happily on a single node cluster in a low end =
machine.</div></div><div class=3D""><br class=3D""></div>We will design =
our code to use a single table to avoid having nightmares with this =
issue. But if there is any authoritative documentation on this =
characteristic of Cassandra, I would love to know more.</div><div =
class=3D""><br class=3D""></div><div class=3D"">FJ<br =
class=3D"Apple-interchange-newline" style=3D"color: rgb(0, 0, 0); =
font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px;"><br class=3D"Apple-interchange-newline">
</div>
<br class=3D""><div><blockquote type=3D"cite" class=3D""><div =
class=3D"">On 01 Mar 2016, at 14:23, Jack Krupansky &lt;<a =
href=3D"mailto:jack.krupansky@gmail.com" =
class=3D"">jack.krupansky@gmail.com</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D"">I don't think there are any "reasons behind it." It is simply =
empirical experience - as reported here.<div class=3D""><br =
class=3D""></div><div class=3D"">Cassandra scales in two dimension - =
number of rows per node and number of nodes. If some source of =
information lead you to believe otherwise, please point out the source =
so that we can endeavor to correct it.</div><div class=3D""><br =
class=3D""></div><div class=3D"">The exact number of rows per node and =
tables per node will always have to be evaluated empirically - a proof =
of concept implementation, since it all depends on the mix of =
capabilities of your hardware combined with your specific data model, =
your specific data values, your specific access patterns, and your =
specific load. And it also depends on your own personal tolerance for =
degradation of latency and throughput - some people might find a given =
set of performance &nbsp;metrics acceptable while other might =
not.</div></div><div class=3D"gmail_extra"><br clear=3D"all" =
class=3D""><div class=3D""><div class=3D"gmail_signature"><div dir=3D"ltr"=
 class=3D"">-- Jack Krupansky</div></div></div>
<br class=3D""><div class=3D"gmail_quote">On Tue, Mar 1, 2016 at 3:54 =
AM, Fernando Jimenez <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:fernando.jimenez@wealth-port.com" target=3D"_blank" =
class=3D"">fernando.jimenez@wealth-port.com</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">Hi Tommaso<br class=3D""><div =
class=3D"">
<div style=3D"font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant: normal; font-weight: normal; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; word-wrap: break-word;" =
class=3D""><div class=3D""><br class=3D""><div class=3D"">It=E2=80=99s =
not that I _need_ a large number of tables. This approach maps easily to =
the problem we are trying to solve, but it=E2=80=99s becoming clear =
it=E2=80=99s not the right approach.</div></div><div class=3D""><br =
class=3D""></div><div class=3D"">At the moment I=E2=80=99m trying to =
understand the limitations in Cassandra regarding number of Tables and =
the reasons behind it. I=E2=80=99ve come to the email list as my =
Google-foo is not giving me what I=E2=80=99m looking for :(</div><span =
class=3D"HOEnZb"><font color=3D"#888888" class=3D""><div class=3D""><br =
class=3D""></div><div class=3D"">FJ</div></font></span></div><br =
style=3D"font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px;" class=3D""><br class=3D"">
</div><div class=3D""><div class=3D"h5">
<br class=3D""><div class=3D""><blockquote type=3D"cite" class=3D""><div =
class=3D"">On 01 Mar 2016, at 09:36, tommaso barbugli &lt;<a =
href=3D"mailto:tbarbugli@gmail.com" target=3D"_blank" =
class=3D"">tbarbugli@gmail.com</a>&gt; wrote:</div><br class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D"">Hi Fernando,<div class=3D""><br =
class=3D""></div><div class=3D"">I used to have a cluster with ~300 =
tables (1 keyspace) on C* 2.0, it was a real pain in terms of =
operations. Repairs were terribly slow, boot of C* slowed down and in =
general tracking table metrics becomes bit more work. Why do you need =
this high number of tables?</div><div class=3D""><br class=3D""></div><div=
 class=3D"">Tommaso<br class=3D""></div></div><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">On Tue, =
Mar 1, 2016 at 9:16 AM, Fernando Jimenez <span dir=3D"ltr" =
class=3D"">&lt;<a href=3D"mailto:fernando.jimenez@wealth-port.com" =
target=3D"_blank" =
class=3D"">fernando.jimenez@wealth-port.com</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">Hi Jack<div class=3D""><br =
class=3D""></div><div class=3D"">By entry I mean row<br class=3D""><div =
class=3D"">
<div =
style=3D"font-family:Helvetica;font-size:12px;font-style:normal;font-varia=
nt:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-i=
ndent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wra=
p:break-word" class=3D""><div class=3D""><br class=3D""><div =
class=3D"">Apologies for the =E2=80=9Cobsolete terminology=E2=80=9D. =
When I first looked at Cassandra it was still on CQL2, and now that =
I=E2=80=99m looking at it again I=E2=80=99ve defaulted to the terms I =
already knew. I will bear it in mind and call them tables from now =
on.</div></div><div class=3D""><br class=3D""></div><div class=3D"">Is =
there any documentation about this limit? for example, I=E2=80=99d be =
keen to know how much memory is consumed per table, and I=E2=80=99m also =
curious about the reasons for keeping this in memory. I=E2=80=99m trying =
to understand the limitations here, rather than challenge =
them.</div></div><div class=3D""><br class=3D""></div>So far I found =
nothing in my search, hence why I had to resort to some =E2=80=9Cload =
testing=E2=80=9D to see what happens when you push the table count =
high</div><div class=3D""><br class=3D""></div><div =
class=3D"">Thanks</div><span class=3D""><font color=3D"#888888" =
class=3D""><div class=3D"">FJ<br =
style=3D"font-family:Helvetica;font-size:12px;font-style:normal;font-varia=
nt:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-i=
ndent:0px;text-transform:none;white-space:normal;word-spacing:0px" =
class=3D""><br class=3D"">
</div></font></span><div class=3D""><div class=3D"">
<br class=3D""><div class=3D""><blockquote type=3D"cite" class=3D""><div =
class=3D"">On 01 Mar 2016, at 06:23, Jack Krupansky &lt;<a =
href=3D"mailto:jack.krupansky@gmail.com" target=3D"_blank" =
class=3D"">jack.krupansky@gmail.com</a>&gt; wrote:</div><br =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D"">3,000 entries? =
What's an "entry"? Do you mean row, column, or... what?<br class=3D""><div=
 class=3D""><br class=3D""></div><div class=3D"">You are using the =
obsolete terminology of CQL2 and Thrift - column family. With CQL3 you =
should be creating "tables". The practical recommendation of an upper =
limit of a few hundred tables across all key spaces remains.</div><div =
class=3D""><br class=3D""></div><div class=3D"">Technically you can go =
higher and technically you can reduce the overhead per table (an =
undocumented Jira - intentionally undocumented since it is strongly not =
recommended), but... it is unlikely that you will be happy with the =
results.</div><div class=3D""><br class=3D""></div><div class=3D"">What =
is the nature of the use case?</div><div class=3D""><br =
class=3D""></div><div class=3D"">You basically have two choices: an =
additional cluster column to distinguish categories of table, or =
separate clusters for each few hundred of tables.</div><div class=3D""><br=
 class=3D""></div></div><div class=3D"gmail_extra"><br clear=3D"all" =
class=3D""><div class=3D""><div class=3D""><div dir=3D"ltr" class=3D"">-- =
Jack Krupansky</div></div></div>
<br class=3D""><div class=3D"gmail_quote">On Mon, Feb 29, 2016 at 12:30 =
PM, Fernando Jimenez <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:fernando.jimenez@wealth-port.com" target=3D"_blank" =
class=3D"">fernando.jimenez@wealth-port.com</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D""><div class=3D""><div =
style=3D"font-family:Helvetica;font-size:12px;font-style:normal;font-varia=
nt:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-i=
ndent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wra=
p:break-word" class=3D""><div class=3D"">Hi all</div><div class=3D""><br =
class=3D""></div><div class=3D"">I have a use case for Cassandra that =
would require creating a large number of column families. I have found =
references to early versions of Cassandra where each column family would =
require a fixed amount of memory on all nodes, effectively imposing an =
upper limit on the total number of CFs. I have also seen rumblings that =
this may have been fixed in later versions.</div><div class=3D""><br =
class=3D""></div><div class=3D"">To put the question to rest, I have =
setup a DSE sandbox and created some code to generate column families =
populated with 3,000 entries each.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Unfortunately I have now hit this =
issue:&nbsp;<a =
href=3D"https://issues.apache.org/jira/browse/CASSANDRA-9291" =
target=3D"_blank" =
class=3D"">https://issues.apache.org/jira/browse/CASSANDRA-9291</a></div><=
/div><div class=3D""><br class=3D""></div>So I will have to retest =
against Cassandra 3.0 instead</div><div class=3D""><br =
class=3D""></div><div class=3D"">However, I would like to understand the =
limitations regarding creation of column families.&nbsp;</div><div =
class=3D""><br class=3D""></div><div class=3D""><span =
style=3D"white-space:pre-wrap" class=3D"">	</span>* Is there a =
practical upper limit?&nbsp;</div><div class=3D""><span =
style=3D"white-space:pre-wrap" class=3D"">	</span>*&nbsp;is this a =
fixed limit, or does it scale as more nodes are added into the =
cluster?&nbsp;</div><div class=3D""><span style=3D"white-space:pre-wrap" =
class=3D"">	</span>*&nbsp;Is there a difference between one keyspace =
with thousands of column families, vs thousands of keyspaces with only a =
few column families each?</div><div class=3D""><br class=3D""></div><div =
class=3D"">I haven=E2=80=99t found any hard evidence/documentation to =
help me here, but if you can point me in the right direction, I will =
oblige and RTFM away.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Many thanks for your help!</div><div class=3D""><br =
class=3D""></div><div class=3D"">Cheers</div><span class=3D""><font =
color=3D"#888888" class=3D""><div class=3D"">FJ<br =
style=3D"font-family:Helvetica;font-size:12px;font-style:normal;font-varia=
nt:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-i=
ndent:0px;text-transform:none;white-space:normal;word-spacing:0px" =
class=3D""><br class=3D"">
</div>
<br class=3D""></font></span></div></blockquote></div><br =
class=3D""></div>
</div></blockquote></div><br =
class=3D""></div></div></div></div></blockquote></div><br =
class=3D""></div>
</div></blockquote></div><br =
class=3D""></div></div></div></blockquote></div><br class=3D""></div>
</div></blockquote></div><br class=3D""></body></html>=

--Apple-Mail=_61394AFD-B994-4435-A888-973EEEFF3ACC--