cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zoltan Lorincz <zol...@gmail.com>
Subject Re: Help with data modelling (from MySQL to Cassandra)
Date Mon, 27 Mar 2017 10:08:06 GMT
Hi Alexander,

thank you for your help! I think we found the answer:

CREATE TABLE documents (
    doc_id uuid,
    description text,
    title text,
    PRIMARY KEY (doc_id)
 );

CREATE TABLE nodes (
    doc_id uuid,
    element_id uuid,
    title text,
    PRIMARY KEY (doc_id, element_id)
);

We can retrieve all elements with the following query:
 SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
 UPDATE elements SET title='Hello' WHERE
doc_id=131cfa55-181e-431e-7956-fe449139d613 AND
element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;

Zoltan.


On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <
alex@thelastpickle.com> wrote:

> Hi Zoltan,
>
> you must try to avoid multi partition queries as much as possible.
> Instead, use asynchronous queries to grab several partitions concurrently.
> Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
> your cluster.
> This would leave you roughly with 1000+ async queries groups to run.
> Performance will really depend on your hardware, consistency level, load
> balancing policy, partition fragmentation (how many updates you'll run on
> each element over time) and the SLA you're expecting.
>
> If that approach doesn't meet your SLA requirements, you can try to use
> wide partitions and group elements under buckets :
>
> CREATE TABLE elements (
> doc_id long,
> bucket long,
> element_id long,
> element_content text,
> PRIMARY KEY((doc_id, bucket), element_id)
> )
>
> The bucket here could be a modulus of the element_id (or of the hash of
> element_id if it is not a numerical value). This way you can spread
> elements over the cluster and access them directly if you have the doc_id
> and the element_id to perform updates.
> You'll get to run less queries concurrently but they'll take more time
> than individual ones in the first scenario (1 partition per element). You
> should benchmark both solutions to see which one gives best performance.
> Bucket your elements so that your partitions don't grow over 100MB. Large
> partitions are silent cluster killers (1GB+ partitions are a direct threat
> to cluster stability)...
>
> To ensure best performance, use prepared statements along with the
> TokenAwarePolicy
> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
to
> avoid unnecessary coordination.
>
> Cheers,
>
>
> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zollor@gmail.com> wrote:
>
>> Querying by (doc_id and element_id ) OR just by (element_id) is fine, but
>> the real question is, will it be efficient to query 100k+ primary keys in
>> the elements table?
>> e.g.
>>
>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
>> element_id3, .... element_id100K+)  ?
>>
>> The elements_id is a primary key.
>>
>> Thank you?
>>
>>
>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0204@gmail.com>
>> wrote:
>>
>> Have one table hold document metadata (doc_id, title, description, ...)
>> and have another table elements where partition key is doc_id and
>> clustering key is element_id.
>> Only problem here is if you need to query and/or update element just by
>> element_id but I don't know your queries up front.
>>
>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zollor@gmail.com>
>> wrote:
>>
>> Dear cassandra users,
>>
>> We have the following structure in MySql:
>>
>> documents->[doc_id(primary key), title, description]
>> elements->[element_id(primary key), doc_id(index), title, description]
>>
>> Notation: table name->[column1(key or index), column2, …]
>>
>> We want to transfer the data to Cassandra.
>>
>> Each document can contain a large number of elements (between 1 and
>> 100k+)
>>
>> We have two requirements:
>> a) Load all elements for a given doc_id quickly
>> b) Update the value of one individual element quickly
>>
>>
>> We were thinking on the following cassandra configurations:
>>
>> Option A
>>
>> documents->[doc_id(primary key), title, description, elements] (elements
>> could be a SET or a TEXT, each time new elements are added (they are never
>> removed) we would append it to this column)
>> elements->[element_id(primary key), title, description]
>>
>> Loading a document:
>>
>>  a) Load document with given <doc_id> and get all element ids
>>     SELECT * from documents where doc_id=‘id’
>>
>>  b) Load all elements with the given ids
>>     SELECT * FROM elements where element_id IN (ids loaded from query a)
>>
>>
>> Option B
>>
>> documents->[doc_id(primary key), title, description]
>> elements->[element_id(primary key), doc_id(secondary index), title,
>> description]
>>
>> Loading a document:
>>  a) SELECT * from elements where doc_id=‘id’
>>
>>
>> Neither solutions doesn’t seem to be good, in Option A, even if we are
>> querying by Primary keys, the second query will have 100k+ primary key id’s
>> in the WHERE clause, and the second solution looks like an anti pattern in
>> cassandra.
>>
>> Could anyone give any advice how would we create a model for our use case?
>>
>> Thank you in advance,
>> Zoltan.
>>
>>
>>
>> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Mime
View raw message