cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mehdi Bada <mehdi.b...@dbi-services.com>
Subject Re: Cassandra data model right definition
Date Tue, 04 Oct 2016 08:33:14 GMT
Hi all, 

Just to refocus the debat (because I'm the at the origin of this very interesting exchanges).

I think for a good understanding of the data model of any DMBS, we have (technical experts)
to decompose the data objects of the model and understand how the data is precisely stored
and what kind of mechanisms is used. 
In this way, I think, Russell has describe very well the situation, and we can said that Apache
Cassandra data model can be defined as a Partitioned Row Store . 

Many thanks for your all feedbacks and contribution 

Best Regards 
Mehdi Bada 

--- 
Mehdi Bada | Consultant 
Phone: +41 32 422 96 00 | Mobile: +41 79 928 75 48 | Fax: +41 32 499 96 15 
dbi services, Rue de la Jeunesse 2, CH-2800 Delémont 
mehdi.bada@dbi-services.com 
www.dbi-services.com 




From: "Edward Capriolo" <edlinuxguru@gmail.com> 
To: "user" <user@cassandra.apache.org> 
Sent: Monday, October 3, 2016 4:53:16 PM 
Subject: Re: Cassandra data model right definition 

My original point can be summed up as: 

Do not define cassandra in terms SMILES & METAPHORS. Such words include "like" and "close
relative". 

For the specifics: 

Any relational db could (and I'm sure one does!) allow for sparse fields as well. MySQL can
be backed by rocksdb now, does that make it not a row store? 

Lets draw some lines, a relational database is clearly defined. 

https://en.wikipedia.org/wiki/Edgar_F._Codd 


Codd's theorem , a result proven in his seminal work on the relational model, equates the
expressive power of relational algebra and relational calculus (both of which, lacking recursion,
are strictly less powerful than first-order logic ). [ citation needed ] 

As the relational model started to become fashionable in the early 1980s, Codd fought a sometimes
bitter campaign to prevent the term being misused by database vendors who had merely added
a relational veneer to older technology. As part of this campaign, he published his 12 rules
to define what constituted a relational database. This made his position in IBM increasingly
difficult, so he left to form his own consulting company with Chris Date and others. 

Cassandra is not a relational database. 



I am have attempted to illustrate that a "row store" is defined as well. I do not believe
Cassandra is a "row store". 

" Just because it uses log structured storage, sparse fields, and semi-flexible collections
doesn't disqualify it from calling it a "row store"" 

What is the definition of "row store". Is it a logical construct or a physical one? 

Why isn't mongo DB a "row store"? I can drop a schema on top of mongo and present it as rows
and columns. It seems to pass the litmus test being presented. 

https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage 







On Mon, Oct 3, 2016 at 10:02 AM, Jonathan Haddad < jon@jonhaddad.com > wrote: 


Sorry Ed, but you're really stretching here. A table in Cassandra is structured by a schema
with the data for each row stored together in each data file. Just because it uses log structured
storage, sparse fields, and semi-flexible collections doesn't disqualify it from calling it
a "row store" 

Postgres added flexible storage through hstore, I don't hear anyone arguing that it needs
to be renamed. 

Any relational db could (and I'm sure one does!) allow for sparse fields as well. MySQL can
be backed by rocksdb now, does that make it not a row store? 

You're arguing that everything is wrong but you're not proposing an alternative, which is
not productive. 
On Mon, Oct 3, 2016 at 9:40 AM Edward Capriolo < edlinuxguru@gmail.com > wrote: 

BQ_BEGIN

Also every piece of techincal information that describes a rowstore 

http://cs-www.cs.yale.edu/homes/dna/talks/abadi-sigmod08-slides.pdf 
https://en.wikipedia.org/wiki/Column-oriented_DBMS#Row-oriented_systems 

Does it like this: 
001:10,Smith,Joe,40000;
002:12,Jones,Mary,50000;
003:11,Johnson,Cathy,44000;
004:22,Jones,Bob,55000; 


The never depict a scenario where a the data looks like this on disk: 

001:10,Smith 
001:10,40000; 
Which is much closer to how Cassandra stores it's data. 



On Fri, Sep 30, 2016 at 5:12 PM, Benedict Elliott Smith < benedict@apache.org > wrote:


BQ_BEGIN

Absolutely. A "partitioned row store" is exactly what I would call it. As it happens, our
README thinks the same, which is fantastic. 

I thought I'd take a look at the rest of our cohort, and didn't get far before disappointment.
HBase literally calls itself a " column-oriented store" - which is so totally wrong it's simultaneously
hilarious and tragic. 

I guess we can't blame the wider internet for misunderstanding/misnaming us poor "wide column
stores" if even one of the major examples doesn't know what it, itself, is! 




On 30 September 2016 at 21:47, Jonathan Haddad < jon@jonhaddad.com > wrote: 

BQ_BEGIN
+1000 to what Benedict says. I usually call it a "partitioned row store" which usually needs
some extra explanation but is more accurate than "column family" or whatever other thrift
era terminology people still use. 
On Fri, Sep 30, 2016 at 1:53 PM DuyHai Doan < doanduyhai@gmail.com > wrote: 

BQ_BEGIN

I used to present Cassandra as a NoSQL datastore with "distributed" table. This definition
is closer to CQL and has some academic background (distributed hash table). 


On Fri, Sep 30, 2016 at 7:43 PM, Benedict Elliott Smith < benedict@apache.org > wrote:


BQ_BEGIN

Cassandra is not a "wide column store" anymore. It has a schema. Only thrift users no longer
think they have a schema (though they do), and thrift is being deprecated. 

I really wish everyone would kill the term "wide column store" with fire. It seems to have
never meant anything beyond "schema-less, row-oriented", and a "column store" means literally
the opposite of this. 

Not only that, but people don't even seem to realise the term "column store" existed long
before "wide column store" and the latter is often abbreviated to the former, as here: http://www.planetcassandra.org/what-is-nosql/


Since it no longer applies, let's all agree as a community to forget this awful nomenclature
ever existed. 



On 30 September 2016 at 18:09, Joaquin Casares < joaquin@thelastpickle.com > wrote:


BQ_BEGIN

Hi Mehdi, 

I can help clarify a few things. 

As Carlos said, Cassandra is a Wide Column Store. Theoretically a row can have 2 billion columns,
but in practice it shouldn't have more than 100 million columns. 

Cassandra partitions data to certain nodes based on the partition key(s), but does provide
the option of setting zero or more clustering keys. Together, the partition key(s) and clustering
key(s) form the primary key. 

When writing to Cassandra, you will need to provide the full primary key, however, when reading
from Cassandra, you only need to provide the full partition key. 

When you only provide the partition key for a read operation, you're able to return all columns
that exist on that partition with low latency. These columns are displayed as "CQL rows" to
make it easier to reason about. 

Consider the schema: 


BQ_BEGIN

CREATE TABLE foo ( 
bar uuid, 



BQ_BEGIN

boz uuid, 

BQ_END

BQ_BEGIN

baz timeuuid, 
data1 text, 

BQ_END

BQ_BEGIN

data2 text, 

BQ_END

BQ_BEGIN

PRIMARY KEY ((bar, boz), baz) 

BQ_END

BQ_BEGIN

); 

BQ_END

When you write to Cassandra you will need to send bar, boz, and baz and optionally data*,
if it's relevant for that CQL row. If you chose not to define a data* field for a particular
CQL row, then nothing is stored nor allocated on disk. But I wouldn't consider that caveat
to be "schema-less". 

However, all writes to the same bar/boz will end up on the same Cassandra replica set (a configurable
number of nodes) and be stored on the same place(s) on disk within the SSTable(s). And on
disk, each field that's not a partition key is stored as a column, including clustering keys
(this is optimized in Cassandra 3+, but now we're getting deep into internals). 

In this way you can get fast responses for all activity for bar/boz either over time, or for
a specific time, with roughly the same number of disk seeks, with varying lengths on the disk
scans. 

Hope that helps! 

Joaquin Casares 
Consultant 
Austin, TX 

Apache Cassandra Consulting 
http://www.thelastpickle.com 

On Fri, Sep 30, 2016 at 11:40 AM, Carlos Alonso < info@mrcalonso.com > wrote: 

BQ_BEGIN

Cassandra is a Wide Column Store http://db-engines.com/en/system/Cassandra 

Carlos Alonso | Software Engineer | @calonso 

On 30 September 2016 at 18:24, Mehdi Bada < mehdi.bada@dbi-services.com > wrote: 

BQ_BEGIN

Hi all, 

I have a theoritical question: 
- Is Apache Cassandra really a column store? 
Column store mean storing the data as column rather than as a rows. 

In fact C* store the data as row, and data is partionned with row key. 

Finally, for me, Cassandra is a row oriented schema less DBMS.... Is it true for you also???


Many thanks in advance for your reply 

Best Regards 
Mehdi Bada 
---- 

Mehdi Bada | Consultant 
Phone: +41 32 422 96 00 | Mobile: +41 79 928 75 48 | Fax: +41 32 422 96 15 
dbi services, Rue de la Jeunesse 2, CH-2800 Delémont 
mehdi.bada@dbi-services.com 
www.dbi-services.com 



⇒ dbi services is recruiting Oracle & SQL Server experts ! – Join the team 

BQ_END



BQ_END



BQ_END



BQ_END



BQ_END


BQ_END



BQ_END



BQ_END


BQ_END



Mime
View raw message