Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of aventurella@gmail.com
 designates 209.85.214.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <3DA1CBD1-5082-427C-AC3E-07926C773605@thelastpickle.com>
References: 
 <CAP-hOzknJ8L6FdBP47yb4AT608huEyG3WakpBt_c=oORRny9pA@mail.gmail.com>
	<3DA1CBD1-5082-427C-AC3E-07926C773605@thelastpickle.com>
Date: Thu, 20 Dec 2012 13:18:10 -0800
Message-ID: 
 <CAP-hOzmoO9TvAtMFfBgW7Zw7FeyEjQiu+XcsX5UQbD7gTCOa7w@mail.gmail.com>
Subject: Re: Data Model Review
From: Adam Venturella <aventurella@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=14dae9399bd9992a6b04d14f44ff

--14dae9399bd9992a6b04d14f44ff
Content-Type: text/plain; charset=ISO-8859-1

In the case without CQL3, where I would use composite columns, I see how
this sort of lines up with what CQL3 is doing.

I don't have the ability to use CQL3 as I am using pycassa for my client,
so that leaves me with CompositeColumns

Under composite columns, I would have 1 row, which would be stored on 1
node with a lot of columns. Basically this single node would be hit
frequently and the other nodes would be ignored. Assuming I have it correct
that a row lives on a single node.

I can then get a slice of columns using the composite though (username,),
 and have the comparator be reverse for the photo_seq which would give me
my proper order. As I understand it, that would give me the same data
result as using the primary key, but it would be looking at 1 row on 1
node, unlike the PK solution, so I would have a hotspot.

The PRIMARY KEY solution allows creates multiple rows, but they effectively
act as 1 wide row, but they have the benefit of being distributed across
the nodes as they are independent rows (using username as the partition
key), instead of living on 1 node in 1 row.

If my assumptions are correct above, the PK solution is clearly better than
the single row solution. In doing some reading, I have come across a
solution where you manually partition the row keys so you spread the load
more evenly. The cassandra docs here talk about this approach under "High
Throughput Timelines" :
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

Would you advise the manual partition example?

My other option is to store all of the photos based on their id, or
generate them my own canonical id based on their id and some other factors,
into rows and then have kind of a hybrid index row for usernames that not
only would reference the photo_id row but potentially contain some more
information to render a result set.


On Tue, Dec 18, 2012 at 8:13 PM, aaron morton <aaron@thelastpickle.com>wrote:

> I have heard it best to try and avoid the use of super columns for now.
>
> Yup.
>
> Your model makes sense. If you are creating the CF using the cassandra-cli
> you will probably want to reverse order the column names see
> http://thelastpickle.com/2011/10/03/Reverse-Comparators/
>
> If you want to use CQL 3 you could do something like this:
>
> CREATE TABLE InstagramPhotos (
>
> user_name str,
> photo_seq timestamp,
> meta_1 str,
> meta_2 str
> PRIMARY KEY (user_name, phot_seq)
> );
>
> That's pretty much the same. user_name is the row key, and photo_seq will
> be used as part of a composite column name internally.
> (You can do the same thing without CQL, just look up composite columns)
>
> You can do something similar for the annotations.
>
> Depending on your use case I would use UNIX epoch time if possible rather
> than a time uuid.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 18/12/2012, at 4:35 AM, Adam Venturella <aventurella@gmail.com> wrote:
>
> My use case is capturing some information about Instagram photos from the
> API. I have 2 use cases. One, I need to capture all of the media data for
> an account and two I need to be able to privately annotate that data. There
> is some nuance in this, multiple http queries for example, but ignoring
> that, and assuming I have obtained all of the data surrounding an accounts
> photos here is how I was thinking of storing that information for use case
> 1.
>
> ColumnFamily: InstagramPhotos
>
> Row Key: <account_username>
>
> Columns:
> Coulmn Name: <date_posted_timestamp>
> Coulumn Value: JSON representing the data for the individual photo
> (filter, comments, likes etc, not the binary photo data).
>
>
>
> So the idea would be to keep adding columns to the row that contain that
> serialized data (in JSON) with their timestamps as the name.  Timestamps as
> the column names, I figure, should help help to perform range queries,
> where I make the 1st column inserted the earliest timestamp and the last
> column inserted the most recent. I could probably also use TimeUUIDs here
> as well since I will have things ordered prior to inserting.
>
> The question here, does this approach make sense? Is it common to store
> JSON in columns like this? I know there are super columns as well, so I
> could use those I suppose instead of JSON. The extra level of indexing
> would probably be useful to query specific photos for use case 2. I have
> heard it best to try and avoid the use of super columns for now. I have no
> information to back that claim up other than some time spent in the IRC. So
> feel free to debunk that statement if it is false.
>
> So that is use case one, use case two covers the private annotations.
>
> I figured here:
>
> ColumnFamily: InstagramAnnotations
> row key:  Canonical Media Id
>
> Column Name: TimeUUID
> Column Value: JSON representing an annotation/internal comment
>
>
> Writing out the above I can actually see where I might need to tighten
> some things up around how I store the photos. I am clearly missing an
> obvious connection between the InstagramPhotos and the
> InstagramAnnotations, maybe super columns would help with the photos
> instead of JSON? Otherwise I would need to build an index row where I tie
> the the canonical photo id to a timestamp (column name) in the
> InstagramPhotos. I could also try to figure out how to make a TimeUUID of
> my own that can double as the media's canonical id or further look at
> Instagram's canonical id for photos and see if it already counts up. In
> which case I could use that in place of a timestamp.
>
> Anyway, I figured I would see if anyone might help flush out other
> potential pitfalls in the above. I am definitely new to cassandra and I am
> using this project as a way to learn some more about assembling systems
> using it.
>
>
>
>
>
>
>

--14dae9399bd9992a6b04d14f44ff
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">In the case without CQL3, where I would use composite colu=
mns, I see how this sort of lines up with what CQL3 is doing.<div><br></div=
><div style>I don&#39;t have the ability to use CQL3 as I am using pycassa =
for my client, so that leaves me with CompositeColumns</div>
<div><div><br></div><div style>Under composite columns, I would have 1 row,=
 which would be stored on 1 node with a lot of columns. Basically this sing=
le node would be hit frequently and the other nodes would be ignored. Assum=
ing I have it correct that a row lives on a single node.=A0</div>
<div style><br></div><div style>I can then get a slice of columns using the=
 composite though (username,), =A0and have the comparator be reverse for th=
e photo_seq which would give me my proper order. As I understand it, that w=
ould give me the same data result as using the primary key, but it would be=
 looking at 1 row on 1 node, unlike the PK solution, so I would have a hots=
pot.</div>
<div style><br></div><div style>The PRIMARY KEY solution allows creates mul=
tiple rows, but they effectively act as 1 wide row, but they have the benef=
it of being distributed across the nodes as they are independent rows (usin=
g username as the partition key), instead of living on 1 node in 1 row.</di=
v>
</div><div style><br></div><div style>If my assumptions are correct above, =
the PK solution is clearly better than the single row solution. In doing so=
me reading, I have come across a solution where you manually partition the =
row keys so you spread the load more evenly. The cassandra docs here talk a=
bout this approach under &quot;High Throughput Timelines&quot; : <a href=3D=
"http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra">http=
://www.datastax.com/dev/blog/advanced-time-series-with-cassandra</a></div>
<div style><br></div><div style>Would you advise the manual partition examp=
le?</div><div style><br></div><div style>My other option is to store all of=
 the photos based on their id, or generate them my own canonical id based o=
n their id and some other factors, into rows and then have kind of a hybrid=
 index row for usernames that not only would reference the photo_id row but=
 potentially contain some more information to render a result set.</div>
<div style><br></div><div style><br></div></div><div class=3D"gmail_extra">=
<br><br><div class=3D"gmail_quote">On Tue, Dec 18, 2012 at 8:13 PM, aaron m=
orton <span dir=3D"ltr">&lt;<a href=3D"mailto:aaron@thelastpickle.com" targ=
et=3D"_blank">aaron@thelastpickle.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word"><div cla=
ss=3D"im"><blockquote type=3D"cite">I have heard it best to try and avoid t=
he use of super columns for now.=A0</blockquote>
</div>Yup.=A0<div><br></div><div>Your model makes sense. If you are creatin=
g the CF using the cassandra-cli you will probably want to reverse order th=
e column names see=A0<a href=3D"http://thelastpickle.com/2011/10/03/Reverse=
-Comparators/" target=3D"_blank">http://thelastpickle.com/2011/10/03/Revers=
e-Comparators/</a></div>
<div><br></div><div>If you want to use CQL 3 you could do something like th=
is:</div><div><br></div><div>CREATE=A0TABLE=A0InstagramPhotos=A0(<br><div><=
br></div><div><span style=3D"white-space:pre-wrap">	</span>user_name=A0str,=
</div>
<div><span style=3D"white-space:pre-wrap">	</span>photo_seq=A0timestamp,</d=
iv><div><span style=3D"white-space:pre-wrap">	</span>meta_1 str,=A0</div><d=
iv><span style=3D"white-space:pre-wrap">	</span>meta_2 str</div><span style=
=3D"white-space:pre-wrap">	</span>PRIMARY=A0KEY=A0(user_name,=A0phot_seq)<b=
r>
);</div><div><br></div><div>That&#39;s pretty much the same. user_name is t=
he row key, and photo_seq will be used as part of a composite column name i=
nternally.=A0</div><div>(You can do the same thing without CQL, just look u=
p composite columns)</div>
<div><br></div><div>You can do something similar for the annotations.=A0</d=
iv><div><br></div><div>Depending on your use case I would use UNIX epoch ti=
me if possible rather than a time uuid.</div><div><br></div><div>Hope that =
helps.=A0</div>
<div><br><div>
<div style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;tex=
t-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:norma=
l;text-transform:none;font-size:medium;white-space:normal;font-family:Helve=
tica;word-wrap:break-word;word-spacing:0px">
<span style=3D"border-spacing:0px;text-indent:0px;letter-spacing:normal;fon=
t-variant:normal;text-align:-webkit-auto;font-style:normal;font-weight:norm=
al;line-height:normal;border-collapse:separate;text-transform:none;font-siz=
e:medium;white-space:normal;font-family:Helvetica;word-spacing:0px"><div st=
yle=3D"word-wrap:break-word">
<span style=3D"border-spacing:0px;text-indent:0px;letter-spacing:normal;fon=
t-variant:normal;font-style:normal;font-weight:normal;line-height:normal;bo=
rder-collapse:separate;text-transform:none;font-size:medium;white-space:nor=
mal;font-family:Helvetica;word-spacing:0px"><div style=3D"word-wrap:break-w=
ord">
<span style=3D"border-spacing:0px;text-indent:0px;letter-spacing:normal;fon=
t-variant:normal;font-style:normal;font-weight:normal;line-height:normal;bo=
rder-collapse:separate;text-transform:none;font-size:medium;white-space:nor=
mal;font-family:Helvetica;word-spacing:0px"><div style=3D"word-wrap:break-w=
ord">
<span style=3D"border-spacing:0px;text-indent:0px;letter-spacing:normal;fon=
t-variant:normal;font-style:normal;font-weight:normal;line-height:normal;bo=
rder-collapse:separate;text-transform:none;font-size:medium;white-space:nor=
mal;font-family:Helvetica;word-spacing:0px"><div style=3D"word-wrap:break-w=
ord">
<div>-----------------</div><div>Aaron Morton</div><div>Freelance Cassandra=
 Developer</div><div>New Zealand</div><div><br></div><div>@aaronmorton</div=
><div><a href=3D"http://www.thelastpickle.com" target=3D"_blank">http://www=
.thelastpickle.com</a></div>
</div></span></div></span></div></span></div></span></div>
</div><div><div class=3D"h5">

<br><div><div>On 18/12/2012, at 4:35 AM, Adam Venturella &lt;<a href=3D"mai=
lto:aventurella@gmail.com" target=3D"_blank">aventurella@gmail.com</a>&gt; =
wrote:</div><br><blockquote type=3D"cite"><div>My use case is capturing som=
e information about Instagram photos from the API. I have 2 use cases. One,=
 I need to capture all of the media data for an account and two I need to b=
e able to privately annotate that data. There is some nuance in this, multi=
ple http queries for example, but ignoring that, and assuming I have obtain=
ed all of the data surrounding an accounts photos here is how I was thinkin=
g of storing that information for use case 1.=A0<br>

</div><div><br></div><div>ColumnFamily: InstagramPhotos</div><div><br></div=
><div>Row Key: &lt;account_username&gt;</div><div><br></div><div>Columns: =
=A0=A0</div><div>Coulmn Name: &lt;date_posted_timestamp&gt;</div><div>Coulu=
mn Value: JSON representing the data for the individual photo (filter, comm=
ents, likes etc, not the binary photo data).</div>

<div><br></div><div><br></div><div><br></div><div>So the idea would be to k=
eep adding columns to the row that contain that serialized data (in JSON) w=
ith their timestamps as the name. =A0Timestamps as the column names, I figu=
re, should help help to perform range queries, where I make the 1st column =
inserted the earliest timestamp and the last column inserted the most recen=
t. I could probably also use TimeUUIDs here as well since I will have thing=
s ordered prior to inserting.</div>

<div><br></div><div>The question here, does this approach make sense? Is it=
 common to store JSON in columns like this? I know there are super columns =
as well, so I could use those I suppose instead of JSON. The extra level of=
 indexing would probably be useful to query specific photos for use case 2.=
 I have heard it best to try and avoid the use of super columns for now. I =
have no information to back that claim up other than some time spent in the=
 IRC. So feel free to debunk that statement if it is false.</div>

<div><br></div><div>So that is use case one, use case two covers the privat=
e annotations.</div><div><br></div><div>I figured here:</div><div><br></div=
><div>ColumnFamily: InstagramAnnotations</div><div>row key: =A0Canonical Me=
dia Id</div>

<div><br></div><div>Column Name: TimeUUID</div><div>Column Value: JSON repr=
esenting an annotation/internal comment</div><div><br></div><div><br></div>=
<div>Writing out the above I can actually see where I might need to tighten=
 some things up around how I store the photos. I am clearly missing an obvi=
ous connection between the=A0InstagramPhotos=A0and the InstagramAnnotations=
, maybe super columns would help with the photos instead of JSON? Otherwise=
 I would need to build an index row where I tie the the canonical photo id =
to a timestamp (column name) in the InstagramPhotos. I could also try to fi=
gure out how to make a TimeUUID of my own that can double as the media&#39;=
s canonical id or further look at Instagram&#39;s canonical id for photos a=
nd see if it already counts up. In which case I could use that in place of =
a timestamp.<br>

</div><div><br></div><div>Anyway, I figured I would see if anyone might hel=
p flush out other potential pitfalls in the above. I am definitely new to c=
assandra and I am using this project as a way to learn some more about asse=
mbling systems using it.</div>

<div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div=
>

--14dae9399bd9992a6b04d14f44ff--