cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Dejanovski <a...@thelastpickle.com>
Subject Re: Help with data modelling (from MySQL to Cassandra)
Date Mon, 27 Mar 2017 06:47:50 GMT
Hi Zoltan,

you must try to avoid multi partition queries as much as possible. Instead,
use asynchronous queries to grab several partitions concurrently.
Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
your cluster.
This would leave you roughly with 1000+ async queries groups to run.
Performance will really depend on your hardware, consistency level, load
balancing policy, partition fragmentation (how many updates you'll run on
each element over time) and the SLA you're expecting.

If that approach doesn't meet your SLA requirements, you can try to use
wide partitions and group elements under buckets :

CREATE TABLE elements (
doc_id long,
bucket long,
element_id long,
element_content text,
PRIMARY KEY((doc_id, bucket), element_id)
)

The bucket here could be a modulus of the element_id (or of the hash of
element_id if it is not a numerical value). This way you can spread
elements over the cluster and access them directly if you have the doc_id
and the element_id to perform updates.
You'll get to run less queries concurrently but they'll take more time than
individual ones in the first scenario (1 partition per element). You should
benchmark both solutions to see which one gives best performance.
Bucket your elements so that your partitions don't grow over 100MB. Large
partitions are silent cluster killers (1GB+ partitions are a direct threat
to cluster stability)...

To ensure best performance, use prepared statements along with the
TokenAwarePolicy
<http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
to
avoid unnecessary coordination.

Cheers,


On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zollor@gmail.com> wrote:

> Querying by (doc_id and element_id ) OR just by (element_id) is fine, but
> the real question is, will it be efficient to query 100k+ primary keys in
> the elements table?
> e.g.
>
> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
> element_id3, .... element_id100K+)  ?
>
> The elements_id is a primary key.
>
> Thank you?
>
>
> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0204@gmail.com>
> wrote:
>
> Have one table hold document metadata (doc_id, title, description, ...)
> and have another table elements where partition key is doc_id and
> clustering key is element_id.
> Only problem here is if you need to query and/or update element just by
> element_id but I don't know your queries up front.
>
> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zollor@gmail.com> wrote:
>
> Dear cassandra users,
>
> We have the following structure in MySql:
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(index), title, description]
>
> Notation: table name->[column1(key or index), column2, …]
>
> We want to transfer the data to Cassandra.
>
> Each document can contain a large number of elements (between 1 and 100k+)
>
> We have two requirements:
> a) Load all elements for a given doc_id quickly
> b) Update the value of one individual element quickly
>
>
> We were thinking on the following cassandra configurations:
>
> Option A
>
> documents->[doc_id(primary key), title, description, elements] (elements
> could be a SET or a TEXT, each time new elements are added (they are never
> removed) we would append it to this column)
> elements->[element_id(primary key), title, description]
>
> Loading a document:
>
>  a) Load document with given <doc_id> and get all element ids
>     SELECT * from documents where doc_id=‘id’
>
>  b) Load all elements with the given ids
>     SELECT * FROM elements where element_id IN (ids loaded from query a)
>
>
> Option B
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(secondary index), title,
> description]
>
> Loading a document:
>  a) SELECT * from elements where doc_id=‘id’
>
>
> Neither solutions doesn’t seem to be good, in Option A, even if we are
> querying by Primary keys, the second query will have 100k+ primary key id’s
> in the WHERE clause, and the second solution looks like an anti pattern in
> cassandra.
>
> Could anyone give any advice how would we create a model for our use case?
>
> Thank you in advance,
> Zoltan.
>
>
>
> --
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Mime
View raw message