cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: How to model data to achieve specific data locality
Date Sun, 07 Dec 2014 15:32:48 GMT
It would be helpful to look at some specific examples of sequences, showing how they grow.
I suspect that the term “sequence” is being overloaded in some subtly misleading way here.

Besides, we’ve already answered the headline question – data locality is achieved by having
a common partition key. So, we need some clarity as to what question we are really focusing
on

And, of course, we should be asking the “Cassandra Data Modeling 101” question of what
do your queries want to look like, how exactly do you want to access your data. Only after
we have a handle on how you need to read your data can we decide how it should be stored.

My immediate question to get things back on track: When you say “The typical read is to
load a subset of sequences with the same seq_id”, what type of “subset” are you talking
about? Again, a few explicit and concise example queries (in some concise, easy to read pseudo
language or even plain English, but not belabored with full CQL syntax.) would be very helpful.
I mean, Cassandra has no “subset” concept, nor a “load subset” command, so what are
we really talking about?

Also, I presume we are talking CQL, but some of the references seem more Thrift/slice oriented.

-- Jack Krupansky

From: Eric Stevens 
Sent: Sunday, December 7, 2014 10:12 AM
To: user@cassandra.apache.org 
Subject: Re: How to model data to achieve specific data locality

> Also new seq_types can be added and old seq_types can be deleted. This means I often
need to ALTER TABLE to add and drop columns. 

Kai, unless I'm misunderstanding something, I don't see why you need to alter the table to
add a new seq type.  From a data model perspective, these are just new values in a row.  

If you do have columns which are specific to particular seq_types, data modeling does become
a little more challenging.  In that case you may get some advantage from using collections
(especially map) to store data which applies to only a few seq types.  Or defining a schema
which includes the set of all possible columns (that's when you're getting into ALTERs when
a new column comes or goes).

> All sequences with the same seq_id tend to grow at the same rate.


Note that it is an anti pattern in Cassandra to append to the same row indefinitely.  I think
you understand this because of your original question.  But please note that a sub partitioning
strategy which reuses subpartitions will result in degraded read performance after a while.
 You'll need to rotate sub partitions by something that doesn't repeat in order to keep the
data for a given partition key grouped into just a few sstables.  A typical pattern there
is to use some kind of time bucket (hour, day, week, etc., depending on your write volume).


I do note that your original question was about preserving data locality - and having a consistent
locality for a given seq_id - for best offline analytics.  If you wanted to work for this,
you can certainly also include a blob value in your partitioning key, whose value is calculated
to force a ring collision with this record's sibling data.  With Cassandra's default partitioner
of murmur3, that's probably pretty challenging - murmur3 isn't designed to be cryptographically
strong (it doesn't work to make it difficult to force a collision), but it's meant to have
good distribution (it may still be computationally expensive to force a collision - I'm not
that familiar with its internal workings).  In this case, ByteOrderedPartitioner would be
a lot easier to force a ring collision on, but then you need to work on a good ring balancing
strategy to distribute your data evenly over the ring.

On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan <doanduyhai@gmail.com> wrote:

  "Those sequences are not fixed. All sequences with the same seq_id tend to grow at the same
rate. If it's one partition per seq_id, the size will most likely exceed the threshold quickly"



  --> Then use bucketing to avoid too wide partitions


  "Also new seq_types can be added and old seq_types can be deleted. This means I often need
to ALTER TABLE to add and drop columns. I am not sure if this is a good practice from operation
point of view."


  --> I don't understand why altering table is necessary to add seq_types. If "seq_types"
is defined as your clustering column, you can have many of them using the same table structure
...









  On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang <depend@gmail.com> wrote:

    On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <mightye@gmail.com> wrote:

      It depends on the size of your data, but if your data is reasonably small, there should
be no trouble including thousands of records on the same partition key.  So a data model using
PRIMARY KEY ((seq_id), seq_type) ought to work fine.  


      If the data size per partition exceeds some threshold that represents the right tradeoff
of increasing repair cost, gc pressure, threatening unbalanced loads, and other issues that
come with wide partitions, then you can subpartition via some means in a manner consistent
with your work load, with something like PRIMARY KEY ((seq_id, subpartition), seq_type).

      For example, if seq_type can be processed for a given seq_id in any order, and you need
to be able to locate specific records for a known seq_id/seq_type pair, you can compute subpartition
is computed deterministically.  Or if you only ever need to read all values for a given seq_id,
and the processing order is not important, just randomly generate a value for subpartition
at write time, as long as you can know all possible values for subpartition.

      If the values for the seq_types for a given seq_id must always be processed in order
based on seq_type, then your subpartition calculation would need to reflect that and place
adjacent seq_types in the same partition.  As a contrived example, say seq_type was an incrementing
integer, your subpartition could be seq_type / 100.

      On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <depend@gmail.com> wrote:

        I have a data model question. I am trying to figure out how to model the data to achieve
the best data locality for analytic purpose. Our application processes sequences. Each sequence
has a unique key in the format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
number of seq_types. The typical read is to load a subset of sequences with the same seq_id.
Naturally I would like to have all the sequences with the same seq_id to co-locate on the
same node(s). 




        However I can't simply create one partition per seq_id and use seq_id as my partition
key. That's because:




        1. there could be thousands or even more seq_types for each seq_id. It's not feasible
to include all the seq_types into one table.

        2. each seq_id might have different sets of seq_types.


        3. each application only needs to access a subset of seq_types for a seq_id. Based
on CASSANDRA-5762, select partial row loads the whole row. I prefer only touching the data
that's needed.




        As per above, I think I should use one partition per [seq_id]_[seq_type]. But how
can I archive the data locality on seq_id? One possible approach is to override IPartitioner
so that I just use part of the field (say 64 bytes) to get the token (for location) while
still using the whole field as partition key (for look up). But before heading that direction,
I would like to see if there are better options out there. Maybe any new or upcoming features
in C* 3.0?



        Thanks.



    Thanks, Eric.


    Those sequences are not fixed. All sequences with the same seq_id tend to grow at the
same rate. If it's one partition per seq_id, the size will most likely exceed the threshold
quickly. Also new seq_types can be added and old seq_types can be deleted. This means I often
need to ALTER TABLE to add and drop columns. I am not sure if this is a good practice from
operation point of view.


    I thought about your subpartition idea. If there are only a few applications and each
one of them uses a subset of seq_types, I can easily create one table per application since
I can compute the subpartition deterministically as you said. But in my case data scientists
need to easily write new applications using any combination of seq_types of a seq_id. So I
want the data model to be flexible enough to support applications using any different set
of seq_types without creating new tables, duplicate all the data etc.


    -Kai





Mime
View raw message