incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Venturella <aventure...@gmail.com>
Subject Re: Correct way to design a cassandra database
Date Fri, 21 Dec 2012 14:44:05 GMT
I am pretty new to cassandra as well. But here goes nothing:

Assumptions:
- You are using a CQL3 client

- Remember I am a n00bsauce at this as well, so another member of the list
may, and probably does, have a better more enlightened answer than I.
Everyone was new to this a one time though, and you gotta start somewhere,
so here goes:

- This is a long-ish message as it represents a train of thought.

Based on reading here:
http://www.datastax.com/dev/blog/schema-in-cassandra-1-1

Store the user accounts:

CREATE TABLE Users (
    user_name text,
    password text,
    PRIMARY KEY (user_name)
);


Store the Users Photos

CREATE TABLE Photos (
    user_name text,
    created_time timestamp,
    image_url text,
    meta_data1 text,
    meta_dataN text,
    PRIMARY KEY (user_name, created_time)
) WITH CLUSTERING ORDER BY (created_time DESC);

Uses a compound Primary key to make a wide row, allows you to: 'SELECT *
FROM Photos WHERE user_name = 'the_user'; and just get the users's photos
ordered by most recent. The meta columns are just for you to store whatever
you like, or you can have 1 meta column for example: 'data' and just store
some JSON that represents more info about the photo. Something along those
lines.

For albums we will see a bit of data duplication, which I think is par for
the course in something like this. The idea with cassandra is storage is
cheap so duplicating info a bit here and there is acceptable. This is also
the part where I feel like someone with more experience may have a better
answer than I.

CREATE TABLE PhotosAlbums (
    user_name text,
    album_name text,
    image_url text,
    PRIMARY KEY (user_name, album_name)
);


So we are duplicating user_name and image_url. The limitation here is you
would not be getting any user defined ordering of the album, but you would
be able to run just one query and get all of the photos in an album:

SELECT * FROM PhotoAlbums WHERE user_name = "the_user" AND album_name =
"the_album_name"

If you needed to get a list of all of the photo albums for a user you just
need to:

SELECT * FROM PhotoAlbums WHERE user_name = "the_user"

This would give you results, the issue with this PhotoAlbums
ColumnFamily/Table is that deleting individual photos from it I think is
problematic with the definition I have provided. I'm pretty sure I didn't
leave a way to delete an individual photo. It may be that you need:

PRIMARY KEY (user_name, album_name, image_url)

Hmmm, I don't like that, lemme try again.

I think, and again someone else here knows more than me about this, I could
easily be wrong here, you could add a level sorting per album like this?


CREATE TABLE PhotosAlbums (
    user_name text,
    album_name text,
    seq int,
    image_url text,
    poster_image_url text,
    PRIMARY KEY (user_name, album_name, seq)
) WITH CLUSTERING ORDER BY (seq ASC);

Note the addition of 'seq'

Actually, that may be better as your sequence number basically acts like a
unique key for just the album itself, instead a relying on the image_url
like I previously mentioned.  That way you could delete like this I think:

DELETE FROM PhotoAlbums WHERE user_name = "the_user" AND album_name =
"the_album_name" AND seq = 4

The problem then becomes, after you delete you would need to re-sequence
all the images in the album. Admittedly, I don't know how to best handle
that without running an update each Album entry to re-sequence it. The same
would apply if you where to reorder images, you would need to re-sequence a
set of them as well, or all of them if you made a new image #1.


Another option is to store the actual album in JSON and have your
application manipulate that and save it back to the album:


CREATE TABLE PhotosAlbums (
    user_name text,
    album_name text,
    poster_image_url text,
    data text
    PRIMARY KEY (user_name, album_name)
);

Here we do away with the 'seq' column and add a 'data' and 'poster_image'
column (so we can give the album a representation to the user).  Data would
just be JSON that looks something like this:

[{'image_url':, ...other data you might want ...},
{'image_url':, ...other data you might want ...},
{'image_url':, ...other data you might want ...},
{'image_url':, ...other data you might want ...},
...]


Now your application would be responsible for sorting the images in the
album and updating the whole JSON blob for the album.

You would now be able to get all of the user's photo albums with:

SELECT * FROM PhotoAlbums WHERE user_name = "the_user"

You can use the poster_image column to render a nice representation image
of the album.  You can get an individual album with:

SELECT * FROM PhotoAlbums WHERE user_name = "the_user" AND album_name =
"the_album_name"

You just need to deserialize the 'data' column to get all of your photo
data. If the user makes an update your application needs to update that
JSON.

My hunch is this JSON based Album might be the way to go.
Again, take this with a grain of salt, I am new to this as well.







On Fri, Dec 21, 2012 at 5:07 AM, Hiller, Dean <Dean.Hiller@nrel.gov> wrote:

> I you have a way to partition tables, relational can be ok.  Thing of a
> business that has trillions of clients as customers and clients have a
> whole slew of things they are related to.  Partitioning by client can be a
> good way to go.  Here are some patterns we have seen in nosql and perhaps
> they can help your situation….
>
> https://github.com/deanhiller/playorm/wiki/Patterns-Page
>
> Later,
> Dean
>
> From: David Mohl <dave@dave.cx<mailto:dave@dave.cx>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Friday, December 21, 2012 4:49 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Correct way to design a cassandra database
>
> Hello!
>
> I've recently started learning cassandra but still have troubles
> understanding the best way to design a cassandra database.
> I've posted my question already on stackoverflow but because this would
> very likely result in a discussion, it got closed. Orginal question here:
> http://stackoverflow.com/questions/13975868/correct-way-to-design-a-cassandra-database
>
>
> Assuming you have 3 types of objects: User, Photo and Album. Obviously a
> photo belongs to a user and can be part of a album. For querying, assume we
> just want to order by "last goes first". Paging by 10 elements should be
> possible.
>
> Would you go like every document has all the informations needed for a
> correct output. Something like this:
>
>     -- User
>        | -- Name
>        | -- ...
>        | -- Photos
>             | -- Photoname
>             | -- Uploaded at
>
> Or go a more relational way (while having a secondary index on the
> "belongs_to" columns:
>
>     -- User (userid is the row key)
>        | -- Name
>        | -- ...
>
>     -- Photoid
>        | -- belongs_to (userid)
>        | -- belongs_to_album (albumid)
>        | -- ...
>
>     -- Albumid
>        | -- belongs_to (userid)
>        | -- ...
>
> Another way that came in my mind would be kind of a mix:
>
>     -- User
>        | -- Name
>        | -- ...
>        | -- Photoids (e.g. 1,2,3,4,5)
>        | -- Albumids (e.g. 1,2,3,4,5)
>
>     -- Photoid (photoid is the row key)
>        | -- Name
>        | -- Uploaded at
>        | -- ...
>
>     -- Albumid (albumid is the row key)
>        | -- Name
>        | -- Photoids (e.g. 1,2,3,4,5)
>        | -- ...
>
> When using a random partitioner, the last example would be (IMO) the way
> to go. I can query the user object (out of a session id or something) and
> would get all the row keys I need for fetching photo / album data. However
> this would result in veeery large columns. Another down point would be
> inconsistency and identification problems. A photo (or a album) could not
> be identified by the row itself.
> Example: If I fetch a photo with ID 3456, I don't know in which albums it
> is part nor which user owns it. Adding this kind of information would
> result in a fairly large stack of points I have to alter on creation /
> update.
>
> The second example has all the information needed. However, if I want to
> fetch all photos that are part of album x, I have to query by a secondary
> index that COULD contain millions of entries over the whole cluster. And I
> guess I can forget the random partitioner on this example.
>
> Am I thinking to relational?
> It'd be great to hear some other opinions on this topic
>
> ---
> David
>
>

Mime
View raw message