cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryce Allen <>
Subject Re: two dimensional slicing
Date Mon, 30 Jan 2012 17:14:37 GMT
On Sun, 29 Jan 2012 23:26:52 +1300
aaron morton <> wrote:
> > and compare them, but at this point I need to focus on one to get
> > things working, so I'm trying to make a best initial guess.
> I would go for RP then, BOP may look like less work to start with but
> it *will* bite you later. If you use an increasing version number as
> a key you will get a hot spot. Get it working with RP and Standard
> CF's, accept the extra lookups, and then see if where you are
> performance / complexity wise. Cassandra can be pretty fast.
The keys are (random uuid)-(version), because there are many lists and
they already have a random id associated with them. Some of the lists
will be much larger than others, but with the random prefix the large
lists will be evenly distributed across the cluster. This is pretty
much the same as having some rows that are bigger than others with RP.
There is a small amount of other data that has non-random keys and
would require an artificial MD5(key) prefix, but it's (at least
currently) an insignificant subset of the total data. I do appreciate
the warning though - if things change and we end up with a lot of keys
that aren't naturally random, I can see how it would be a pain to

The reason I'm concerned about one more query (especially when it can't
be done in parallel), is that the overarching structure is actually a
tree, and the data payload under a name will often be a pointer to
another list. Each query required in the list lookup will be repeated
at each level.

Anyway I don't want to turn this into a BOP vs RP thread. I'm really
interested in the underlying modeling issues, and how it plays out
using different partitioning is instructive. I'm willing to use BOP _if_
it has real concrete advantages, because it seems very unlikely to
cause balance/hotspot issues for our application. That being said all
other things being equal (or almost equal), I would use RP, and
actually our latest design uses RP...

> I still don't really understand the problem, but I think you have
> many lists of names and when each list is updated you consider it a
> version. 
> You then want to answer a query such as "get all the names between
> foo and bar that were written to between version 100 and 200". Can
> this query can be re-written as "get all the names between foo and
> bar that existed at version 200 and were created on or after version
> 100" ?
There are two queries we need to answer - one is get the first N names
from version V of list L starting at name n0 (chunked listing). The
second is get name n from list L version V (or determine that it
doesn't exist). In many cases the list is too big to re-write on every
update, so storing deltas instead of the whole thing becomes
attractive. There can be additions, deletes, updates, and renames
(which can be modeled as deletion + addition). A background process
creates complete lists from the deltas at certain versions (compacts),
to prevent having to replay the entire history.

The queries are actually done based on timestamp, not on a specific
version (e.g. what was the state of the list at time T). The passed
timestamp won't in general correspond to the time of an update.

With this model, fetching a chunk of a list version requires pulling
the range of names from the most recent complete/compacted list less
than or equal to the desired version, and fetching the relevent deltas
between that and the desired version. Fetching the "relevent" deltas is
where it gets complicated.

We've gone through many iterations - this is our latest model (very
much still subject to change):

CF: List
row key: list id (random uuid)
columns: latest version and unversioned meta data about the list

CF: ListVersionIndex
row key: (list id)
columns: ts -> version, compact?

CF: ListCompact
row key: (list id)-(version)
columns: name -> associated data

CF: ListDelta
row key: (list id)-(version)
columns: name -> operation (create, delete, update) + associated data

With BOP and timestamp versions, ListVersionIndex isn't necessary - a
row range scan can be done to get the latest compact list, and then
another to get all the deltas since compaction, all with an appropriate
column offset and limit. Timestamp versions make cleaning up partial
updates more complicated though, since the versions numbers aren't

With RP, the idea is to query many versions in ListVersionIndex starting
at the desired version going backward, hoping that it will hit a
compact version. We could also maintain a separate CompactVersion
index, and accept another query.

In any case I think this model demonstrates a key point about two
dimensional range queries - RP really only requires one extra query on
an index to do get the row range, and then replaces the BOP row range
query with a multi get. Multi get can be done in parallel (correct me
if I'm wrong?), so it seems reasonable that in some cases it could
actually be faster than the row range query (but still at the cost of
the extra RTT to do the index lookup). It's definitely not much more
complicated when using RP; I was caught up in some nuances of our old
model when I wrote the last email.


> Could you re-write the entire list every version update?
> CF: VersionedList
> row: <list_name:version>
> col_name: name
> col_value: last updated version
> So you slice one row at the upper version and discard all the columns
> where the value is less than the lower version ? 
> Cheers
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> On 27/01/2012, at 5:31 AM, Bryce Allen wrote:
> > Thanks, comments inline:
> > 
> > On Mon, 23 Jan 2012 20:59:34 +1300
> > aaron morton <> wrote:
> >> It depends a bit on the data and the query patterns. 
> >> 
> >> * How many versions do you have ? 
> > We may have 10k versions in some cases, with up to a million names
> > total in any given version but more often <10K. To manage this we
> > are currently using two CFs, one for storing compacted complete
> > lists and one for storing deltas on the compacted list. Based on
> > usage, we will create a new compacted list and start writing deltas
> > against that. We should be able to limit the number of deltas in a
> > single row to below 100; I'd like to be able to keep it lower but
> > I'm not sure we can maintain that under all load scenarios. The
> > compacted lists are straightforward, but there are many ways to
> > structure the deltas and they all have trade offs. A CF with
> > composite columns that supported two dimensional slicing would be
> > perfect.
> > 
> >> * How many names in each version ?
> > We plan on limiting to a total of 1 million names, and around
> > 10,000 per version (by limiting the batch size), but many deltas
> > will have <10 names.
> > 
> >> * When querying do you know the versions numbers you want to query
> >> from ? How many are there normally?
> > Currently we don't know the version numbers in advance - they are
> > timestamps, and we are querying for versions less than or equal to
> > the desired timestamp. We have talked about using vector clock
> > versions and maintaining an index mapping time to version numbers,
> > in which case we would know the exact versions after the index
> > lookup, at the expense of another RTT on every operation.
> > 
> >> * How frequent are the updates and the reads ?
> > We expect reads to be more frequent than writes. Unfortunately we
> > don't have solid numbers on what to expect, but I would guess 20x.
> > Update operations will involve several reads to determine where to
> > write.
> > 
> > 
> >> I would lean towards using two standard CF's, one to list all the
> >> version numbers (in a single row probably) and one to hold the
> >> names in a particular version. 
> >> 
> >> To do your query slice the first CF and then run multi gets to the
> >> second. 
> >> 
> >> Thats probably not the best solution, if you can add some more info
> >> it may get better.
> > I'm actually leaning back toward BOP, as I run into more issues
> > and complexity with the RP models. I'd really like to implement both
> > and compare them, but at this point I need to focus on one to get
> > things working, so I'm trying to make a best initial guess.
> > 
> > 
> >> 
> >> On 21/01/2012, at 6:20 AM, Bryce Allen wrote:
> >> 
> >>> I'm storing very large versioned lists of names, and I'd like to
> >>> query a range of names within a given range of versions, which is
> >>> a two dimensional slice, in a single query. This is easy to do
> >>> using ByteOrderedPartitioner, but seems to require multiple (non
> >>> parallel) queries and extra CFs when using RandomPartitioner.
> >>> 
> >>> I see two approaches when using RP:
> >>> 
> >>> 1) Data is stored in a super column family, with one dimension
> >>> being the super column names and the other the sub column names.
> >>> Since slicing on sub columns requires a list of super column
> >>> names, a second standard CF is needed to get a range of names
> >>> before doing a query on the main super CF. With CASSANDRA-2710,
> >>> the same is possible using a standard CF with composite types
> >>> instead of a super CF.
> >>> 
> >>> 2) If one of the dimensions is small, a two dimensional slice
> >>> isn't required. The data can be stored in a standard CF with
> >>> linear ordering on a composite type (large_dimension,
> >>> small_dimension). Data is queried based on the large dimension,
> >>> and the client throws out the extra data in the other dimension.
> >>> 
> >>> Neither of the above solutions are ideal. Does anyone else have a
> >>> use case where two dimensional slicing is useful? Given the
> >>> disadvantages of BOP, is it practical to make the composite column
> >>> query model richer to support this sort of use case?
> >>> 
> >>> Thanks,
> >>> Bryce
> >> 

View raw message