Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTinBxPTHRvzt4XaEXsLNg2CHMSuuD0fK7=RxdgLd@mail.gmail.com>
References: <AANLkTins4=jbpwFZn-kfaQXOU8fHeseg6XJ2wdB-8r3W@mail.gmail.com>
	<AANLkTinvah0VEeCYQFoXZXFcbLkJcwxo+32Z8mZBAsRM@mail.gmail.com>
	<AANLkTi=rFmgrMKdJf1jKTOkEqJpCSPyiOXWhg7quGtp_@mail.gmail.com>
	<AANLkTi=OagFC1M104Fq8Od09-+WMAndPm1m=nvhNPBOq@mail.gmail.com>
	<AANLkTi=dfYnSp9cpqoXXvS1bd9zFufQ3FL5suHF6KJnO@mail.gmail.com>
	<AANLkTinBxPTHRvzt4XaEXsLNg2CHMSuuD0fK7=RxdgLd@mail.gmail.com>
Date: Thu, 3 Feb 2011 17:18:32 +0200
Message-ID: <AANLkTinqzGTAM1Dj56+G3Vrn85L81Z9k5qqJjF6XeDvY@mail.gmail.com>
Subject: Re: Do supercolumns have a purpose?
From: David Boxenhorn <david@lookin2.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0003255760f64f27b2049b62471e

--0003255760f64f27b2049b62471e
Content-Type: text/plain; charset=ISO-8859-1

Well, I am an "actual active developer" and I have "managed to do pretty
nice stuffs with Cassandra" - without secondary indexes so far. But I'm
looking forward to having secondary indexes in my arsenal when new
functional requirements come up, and I'm bummed out that my early design
decision to use supercolums wherever I could, instead of concatenating keys,
has closed off a whole lot of possibilities. I knew when I started that
secondary keys were in the future, if I had known that they would be only
for regular column families I wouldn't have used supercolumn families in the
first place, now I'm pretty much stuck (too late to go back - we're
launching in March).


On Thu, Feb 3, 2011 at 4:44 PM, Sylvain Lebresne <sylvain@datastax.com>wrote:

> On Thu, Feb 3, 2011 at 3:00 PM, David Boxenhorn <david@lookin2.com> wrote:
>
>> The advantage would be to enable secondary indexes on supercolumn
>> families.
>>
>
> Then I suggest opening a ticket for adding secondary indexes to supercolumn
> families and voting on it. This will be 1 or 2 order of magnitude less work
> than getting rid of super column internally, and probably a much better
> solution anyway.
>
>
>> I understand from this thread that indexes are supercolumn families are
>> not going to be:
>>
>> http://www.mail-archive.com/user@cassandra.apache.org/msg09527.html
>>
>
> I should maybe let Jonathan answer this one, but the way I understand it is
> that adding secondary indexes to super column is not a top priority to
> actual active developers. Not that it will never ever happen. And voting for
> tickets in JIRA is one way to help make it raise its priority.
>
> In any case, if the goal you're pursuing is adding secondary indexes to
> super column, then that's the ticket you should open, and if after careful
> consideration it is decided that getting rid of super column is the best way
> to reach that goal then so be it (spoiler: it is not).
>
>
>> Which, it seems to me, effectively deprecates supercolumn families. (I
>> don't see any of the three problems you brought up as overcoming this
>> problem, except, perhaps, for special cases.)
>>
>
> You're untitled to your opinions obviously but I doubt everyone share that
> feeling (I don't for instance). Before 0.7, there was no secondary indexes
> at all and still a bunch of people managed to do pretty nice stuffs with
> Cassandra. In particular denormalized views are sometimes (often?)
> preferable to secondary indexes for performance reasons. For that super
> columns are quite handy.
>
> --
> Sylvain
>
>
>>
>>
>>  On Thu, Feb 3, 2011 at 3:32 PM, Sylvain Lebresne <sylvain@datastax.com>wrote:
>>
>>> On Thu, Feb 3, 2011 at 1:33 PM, David Boxenhorn <david@lookin2.com>wrote:
>>>
>>>> Thanks Sylvain!
>>>>
>>>> Can I vote for internally implementing supercolumn families as regular
>>>> column families? (With a smooth upgrade process that doesn't require
>>>> shutting down a live cluster.)
>>>>
>>>
>>> I forgot to add that I don't know if this make a lot of sense. That would
>>> be a fairly major refactor (so error prone), you'd still have to deal with
>>> the point I mentioned in my previous mail (for range deletes you would have
>>> to change the on-disk format for instance), and all this for no actual
>>> benefits, even downsides actually (encoded supercolumn will take more space
>>> on-disk (and on-memory)). Super columns are there and work fairly well, so
>>> what would be the point ?
>>>
>>> I'm only just saying that 'in theory', super columns are not the super
>>> shiny magical feature that give you stuff you can't hope to have with only
>>> regular column family. That doesn't make then at least nice.
>>>
>>> That being said, you are free to create whatever ticket you want and vote
>>> for it. Don't expect too much support tough :)
>>>
>>>
>>>> What if supercolumn families were supported as regular column families +
>>>> an index (on what used to be supercolumn keys)? Would that solve some
>>>> problems?
>>>>
>>>
>>> You'd still have to remember for each CF if it has this index on what
>>> used to be supercolumn keys and handle those differently. Really not
>>> convince this would make the code cleaner that how it is now. And making the
>>> code cleaner is really the only reason I can thing of for wanting to get rid
>>> of super columns internally, so ...
>>>
>>>
>>>>
>>>>
>>>> On Thu, Feb 3, 2011 at 2:00 PM, Sylvain Lebresne <sylvain@datastax.com>wrote:
>>>>
>>>>> > Is there any advantage to using supercolumns
>>>>> > (columnFamilyName[superColumnName[columnName[val]]]) instead of
>>>>> regular
>>>>> > columns with concatenated keys
>>>>> > (columnFamilyName[superColumnName@columnName[val]])?
>>>>> >
>>>>> > When I designed my data model, I used supercolumns wherever I needed
>>>>> two
>>>>> > levels of key depth - just because they were there, and I figured
>>>>> that they
>>>>> > must be there for a reason.
>>>>> >
>>>>> > Now I see that in 0.7 secondary indexes don't work on supercolumns or
>>>>> > subcolumns (is that right?), which seems to me like a very serious
>>>>> > limitation of supercolumn families.
>>>>> >
>>>>> > It raises the question: Is there anything that supercolumn families
>>>>> are good
>>>>> > for?
>>>>>
>>>>> There is a bunch of queries that you cannot do (or less conveniently)
>>>>> if you
>>>>> encode super columns using regular columns with concatenated keys:
>>>>>
>>>>> 1) If you use regular columns with concatenated keys, the count
>>>>> argument
>>>>> count simple columns. With super columns it counts super columns. It
>>>>> means
>>>>> that you can't do "give me the 10 first super columns of this row".
>>>>>
>>>>> 2) If you need to get x super columns by name, you'll have to issue x
>>>>> get_slice query (one of each super column). On the client side it
>>>>> sucks.
>>>>> Internally in Cassandra we could do it reasonably well though.
>>>>>
>>>>> 3) You cannot remove entire super columns since there is no support for
>>>>> range
>>>>> deletions.
>>>>>
>>>>> Moreover, the encoding with concatenated keys uses more disk space (and
>>>>> less
>>>>> disk used for the same information means less things to read so it may
>>>>> have
>>>>> a slight impact on read performance too -- it's probably really slight
>>>>> on most
>>>>> usage but nevertheless).
>>>>>
>>>>> > And here's a related question: Why can't Cassandra implement
>>>>> supercolumn
>>>>> > families as regular column families, internally, and give you that
>>>>> > functionality?
>>>>>
>>>>> For the 1) and 2) above, we could deal with those internally fairly
>>>>> easily I
>>>>> think and rather well (which means it wouldn't be much worse
>>>>> performance-wise
>>>>> than with the actual implementaion of super columns, not that it would
>>>>> be
>>>>> better). For 3), range deletes are harder and would require more
>>>>> significant
>>>>> changes (that doesn't mean that Cassandra will never have it). Even
>>>>> without
>>>>> that, there would be the disk space lost.
>>>>>
>>>>> --
>>>>> Sylvain
>>>>>
>>>>>
>>>>
>>>
>>
>

--0003255760f64f27b2049b62471e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Well, I am an &quot;actual active developer&quot; and I ha=
ve &quot;managed to do pretty nice stuffs with Cassandra&quot; - without se=
condary indexes so far. But I&#39;m looking forward to having secondary ind=
exes in my arsenal when new functional requirements come up, and I&#39;m bu=
mmed out that my early design decision to use supercolums wherever I could,=
 instead of concatenating keys, has closed off a whole lot of possibilities=
. I knew when I started that secondary keys were in the future, if I had kn=
own that they would be only for regular column families I wouldn&#39;t have=
 used supercolumn families in the first place, now I&#39;m pretty much stuc=
k (too late to go back - we&#39;re launching in March). <br>
<br><br><div class=3D"gmail_quote">On Thu, Feb 3, 2011 at 4:44 PM, Sylvain =
Lebresne <span dir=3D"ltr">&lt;<a href=3D"mailto:sylvain@datastax.com">sylv=
ain@datastax.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote"=
 style=3D"margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 2=
04); padding-left: 1ex;">
<div class=3D"im">On Thu, Feb 3, 2011 at 3:00 PM, David Boxenhorn <span dir=
=3D"ltr">&lt;<a href=3D"mailto:david@lookin2.com" target=3D"_blank">david@l=
ookin2.com</a>&gt;</span> wrote:<br></div><div class=3D"gmail_quote"><div c=
lass=3D"im">
<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div dir=3D"ltr">The advantage would be to enable secondary indexes on supe=
rcolumn families.<br></div></blockquote><div><br></div></div><div>Then I su=
ggest opening a ticket for adding secondary indexes to supercolumn families=
 and voting on it. This will be 1 or 2 order of magnitude less work than ge=
tting rid of super column internally, and probably a much better solution a=
nyway.</div>
<div class=3D"im">
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0p=
t 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><di=
v dir=3D"ltr">I understand from this thread that indexes are supercolumn fa=
milies are not going to be:<br>

<br><a href=3D"http://www.mail-archive.com/user@cassandra.apache.org/msg095=
27.html" target=3D"_blank">http://www.mail-archive.com/user@cassandra.apach=
e.org/msg09527.html</a><br></div></blockquote><div><br></div></div><div>I s=
hould maybe let Jonathan answer this one, but the way I understand it is th=
at adding secondary indexes to super column is not a top priority to actual=
 active developers. Not that it will never ever happen. And voting for tick=
ets in JIRA is one way to help make it raise its priority.</div>

<div><br></div><div>In any case, if the goal you&#39;re pursuing is adding =
secondary indexes to super column, then that&#39;s the ticket you should op=
en, and if after careful consideration it is decided that getting rid of su=
per column is the best way to reach that goal then so be it (spoiler: it is=
 not).</div>
<div class=3D"im">
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0p=
t 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><di=
v dir=3D"ltr">Which, it seems to me, effectively deprecates supercolumn fam=
ilies. (I don&#39;t see any of the three problems you brought up as overcom=
ing this problem, except, perhaps, for special cases.)</div>

</blockquote><div><br></div></div><div>You&#39;re untitled to your opinions=
 obviously but I doubt everyone share that feeling (I don&#39;t for instanc=
e). Before 0.7, there was no secondary indexes at all and still a bunch of =
people managed to do pretty nice stuffs with Cassandra. In particular denor=
malized views are sometimes (often?) preferable to secondary indexes for pe=
rformance reasons. For that super columns are quite handy.</div>

<div>=A0</div><div>--</div><div>Sylvain</div><div><div></div><div class=3D"=
h5"><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0=
pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;=
"><div dir=3D"ltr">
 <br><div><div></div><div><br><br>
<div class=3D"gmail_quote">
On Thu, Feb 3, 2011 at 3:32 PM, Sylvain Lebresne <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:sylvain@datastax.com" target=3D"_blank">sylvain@datastax.com<=
/a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:=
 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left=
: 1ex;">


<div>On Thu, Feb 3, 2011 at 1:33 PM, David Boxenhorn <span dir=3D"ltr">&lt;=
<a href=3D"mailto:david@lookin2.com" target=3D"_blank">david@lookin2.com</a=
>&gt;</span> wrote:<br></div><div class=3D"gmail_quote"><div>
<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div dir=3D"ltr">Thanks Sylvain!<br><br>Can I vote for internally implement=
ing supercolumn families as regular column families? (With a smooth upgrade=
 process that doesn&#39;t require shutting down a live cluster.) <br></div>


</blockquote><div><br></div></div><div>I forgot to add that I don&#39;t kno=
w if this make a lot of sense. That would be a fairly major refactor (so er=
ror prone), you&#39;d still have to deal with the point I mentioned in my p=
revious mail (for range deletes you would have to change the on-disk format=
 for instance), and all this for no actual benefits, even downsides actuall=
y (encoded supercolumn will take more space on-disk (and on-memory)).=A0Sup=
er columns are there and work fairly well, so what would be the point ?</di=
v>


<div><br></div><div>I&#39;m only just saying that &#39;in theory&#39;, supe=
r columns are not the super shiny magical feature that give you stuff you c=
an&#39;t hope to have with only regular column family. That doesn&#39;t mak=
e then at least nice.</div>


<div><br></div><div>That being said, you are free to create whatever ticket=
 you want and vote for it. Don&#39;t expect too much support tough :)</div>=
<div><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0=
pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;=
">


<div dir=3D"ltr">What if supercolumn families were supported as regular col=
umn families + an index (on what used to be supercolumn keys)? Would that s=
olve some problems?</div></blockquote><div><br></div></div><div>You&#39;d s=
till have to remember for each CF if it has this index on what used to be s=
upercolumn keys and handle those=A0differently. Really not convince this wo=
uld make the code cleaner that how it is now. And making the code cleaner i=
s really=A0the only reason I can thing of for wanting to get rid of super c=
olumns internally, so ...</div>


<div><div></div><div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0p=
t 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><di=
v dir=3D"ltr"> <br><div><div></div><div>
<br><div class=3D"gmail_quote">On Thu, Feb 3, 2011 at 2:00 PM, Sylvain Lebr=
esne <span dir=3D"ltr">&lt;<a href=3D"mailto:sylvain@datastax.com" target=
=3D"_blank">sylvain@datastax.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid=
 rgb(204, 204, 204); padding-left: 1ex;">


<div><div>&gt; Is there any advantage to using supercolumns</div><div>&gt; =
(columnFamilyName[superColumnName[columnName[val]]]) instead of regular</di=
v><div>&gt; columns with concatenated keys</div><div>&gt; (columnFamilyName=
[superColumnName@columnName[val]])?</div>


<div>&gt;</div><div>&gt; When I designed my data model, I used supercolumns=
 wherever I needed two</div><div>&gt; levels of key depth - just because th=
ey were there, and I figured that they</div><div>&gt; must be there for a r=
eason.</div>


<div>&gt;</div><div>&gt; Now I see that in 0.7 secondary indexes don&#39;t =
work on supercolumns or</div><div>&gt; subcolumns (is that right?), which s=
eems to me like a very serious</div><div>&gt; limitation of supercolumn fam=
ilies.</div>


<div>&gt;</div><div>&gt; It raises the question: Is there anything that sup=
ercolumn families are good</div><div>&gt; for?</div><div><br></div></div><d=
iv>There is a bunch of queries that you cannot do (or less conveniently) if=
 you</div>


<div>encode super columns using regular columns with concatenated keys:</di=
v><div><br></div><div>1) If you use regular columns with concatenated keys,=
 the count argument</div><div>count simple columns. With super columns it c=
ounts super columns. It means</div>


<div>that you can&#39;t do &quot;give me the 10 first super columns of this=
 row&quot;.</div><div><br></div><div>2) If you need to get x super columns =
by name, you&#39;ll have to issue x</div><div>get_slice query (one of each =
super column). On the client side it sucks.</div>


<div>Internally in Cassandra we could do it reasonably well though.</div><d=
iv><br></div><div>3) You cannot remove entire super columns since there is =
no support for range</div><div>deletions.</div><div><br></div><div>Moreover=
, the encoding with concatenated keys uses more disk space (and less</div>


<div>disk used for the same information means less things to read so it may=
 have</div><div>a slight impact on read performance too -- it&#39;s probabl=
y really slight on most</div><div>usage but nevertheless).</div><div>
<div><br>
</div><div>&gt; And here&#39;s a related question: Why can&#39;t Cassandra =
implement supercolumn</div><div>&gt; families as regular column families, i=
nternally, and give you that</div><div>&gt; functionality?=A0</div><div>


<br>
</div></div><div>For the 1) and 2) above, we could deal with those internal=
ly fairly easily I</div><div>think and rather well (which means it wouldn&#=
39;t be much worse performance-wise</div><div>than with the actual implemen=
taion of super columns, not that it would be</div>


<div>better). For 3), range deletes are harder and would require more signi=
ficant</div><div>changes (that doesn&#39;t mean that Cassandra will never h=
ave it). Even without</div><div>that, there would be the disk space lost.</=
div>


<div><br></div><div>--</div><div>Sylvain</div><font color=3D"#888888"><br>
</font></blockquote></div><br></div></div></div>
</blockquote></div></div></div><br>
</blockquote></div><br></div></div></div>
</blockquote></div></div></div><br>
</blockquote></div><br></div>

--0003255760f64f27b2049b62471e--