cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benedict Elliott Smith <bened...@apache.org>
Subject Re: Cassandra data model right definition
Date Mon, 03 Oct 2016 13:53:20 GMT
The equivalent statement would be:  "Like a bike, a scooter has wheels."

This is a really important linguistic distinction you seem to be glossing
over.  It is not saying "A is like X," it is saying "A has specific traits
in common with X."

For example "Like cancer, heart disease is a leading cause of mortality."
 Cancer is really very unalike heart disease, but it is similar in that it
causes death.  This kind of phraseology is tremendously common, and I see
nothing wrong with it.

This conversation suggests people like yourself are indeed confused by
these constructs, so we should perhaps avoid them where that confusion can
confound further understanding.  So, as already suggested, file a
ticket/pull request to update the phrasing.

----

To respond to your second email:  That is simply not how C* stores its data
on disk.  It does, as of 3.x, store it almost exactly (in general terms;
the minutiae obviously differ from system to system) like an RDBMS.  But
even before this, the inefficiency of the storage format doesn't change the
fact it is a "row store" - the literature makes no prescriptions on the
data format besides the spatial locality of rows vs columns.

This all ignores the LSMT confounder, which I am unsure is what you were
referring to.  That is largely orthogonal to this discussion AFAICT, though
if you wanted to call C* a "partitioned LSMT row store" I certainly
wouldn't object.  Of course the more qualifiers you add, the more it starts
to become a description rather than a named category / shorthand.  It's
also not clear C* will remain exclusively LSMT based indefinitely.

It seems like this conversation is a bit of a dead end to me, so I will try
really hard not to respond to further follow ups.  Regrettably,
https://xkcd.com/386/




On 3 October 2016 at 14:25, Edward Capriolo <edlinuxguru@gmail.com> wrote:

> The phrase is defensible, but that is the root of the problem. Take for
> example a skateboard.
>
> "A skateboard is like a bike because it has wheels and you ride on it."
>
> That is true and defensively true. :) However with not much more text you
> can accurately describe what it is, as opposed to something it is almost
> like.
>
> "A skateboard is a thin piece of wood on top of four small wheels that you
> stand on and ride"
>
> The old sentence Cassandra statement was something to the effect of "with
> the storage model of big table and the consistency model of dynamo". This
> accurately described the system and gave reference to specific known
> quantities (bigtable/dynamo) in which white papers existed for further
> reading.
>
> On Mon, Oct 3, 2016 at 6:24 AM, Benedict Elliott Smith <
> benedict@apache.org> wrote:
>
>> While that sentence leaves a lot to be desired (for me because it confers
>> a different meaning on row store), it doesn't say "Cassandra is like a
>> RDBMS" - it says "like an RDBMS, it organises data by rows and columns" -
>> i.e., in this regard only it is like an RDBMS, not more generally.
>>
>> I believe it was meant to help people, especially those afraid of the
>> NoSQL thrift world, understand that it still uses the basic concept of a
>> rows and columns they are used to.  I agree it could be improved to
>> minimise the chance of misreading it, and I'm certain contributions would
>> be welcome here.
>>
>> I don't personally want to get bogged down in analysing every piece of
>> text anyone has ever written, so I'll bow out of further discussion on
>> this.  These phrases may all be suboptimal, but they are certainly
>> defensible.  Column store is not, that's all I wanted to contribute here.
>>
>>
>>
>>
>>
>> On 1 October 2016 at 19:35, Peter Lin <woolfel@gmail.com> wrote:
>>
>>> I'll second Ed's comment.
>>>
>>> The documentation should be more careful when using phrases "like
>>> relational databases". When we look at the history of relational databases,
>>> people expect certain things like ACID transactions, primary/foriegn key
>>> constraints, query planners, joins and relational algebra. Clearly
>>> Cassandra's storage engine does not follow most of those principals for a
>>> good reason.
>>>
>>> The term row oriented storage would be more descriptive and appropriate.
>>> It avoids conflating Cassandra storage engine with "traditional" relational
>>> storage engines. Those of us that have spent over a decade using IBM DB2,
>>> Oracle, Sql Server and Sybase tend to think of relational databases in a
>>> certain way. If we go back to 1998, most RDBMS storage engine had a max row
>>> size limit. Databases like Sybase before version 9 preferred RAW disk for
>>> optimal performance. I can go on and on, but there's no point really.
>>>
>>> Cassandra's storage engine is "row oriented", but it's not relational in
>>> RDBMS sense. We do everyone a huge disservice by using confusing
>>> terminology and then making fun of those who get confused. No one wins when
>>> that happens. At the end of the day, what differentiates cassandra's
>>> storage engine is it support static and dynamic columns, which traditional
>>> RDBMS don't support today. Calling Cassandra storage "distributed tables"
>>> doesn't really help in my bias opinion.
>>>
>>> For example, if you tell a SqlServer or Oracle RAC admin "cassandra uses
>>> distributed tables" they might answer "so what, sql server and oracle can
>>> do that too." The difference is with RDBMS the partitioning is optional and
>>> requires more work to configure. Whereas with Cassandra you can have
>>> everything in 1 node, which means there is only 1 partition and no
>>> different to 1 instance of sql server. Where you win is when you need to
>>> add 2 more nodes, Cassandra makes this easier whereas with SqlServer and
>>> Oracle you have to do a little bit more work. I've lost count of how many
>>> times I've to explained noSql databases to RDBMS admins and had to explain
>>> the official docs are stupid.
>>>
>>>
>>>
>>> On Sat, Oct 1, 2016 at 11:31 AM, Edward Capriolo <edlinuxguru@gmail.com>
>>> wrote:
>>>
>>>> https://github.com/apache/cassandra
>>>>
>>>> Row store <http://wiki.apache.org/cassandra/DataModel> means that like
>>>> relational databases, Cassandra organizes data by rows and columns. The
>>>> Cassandra Query Language (CQL) is a close relative of SQL.
>>>>
>>>> I generally do not know what to say about these high level
>>>> "oversimplifications" like "firewalls block hackers". Are there "firewalls"
>>>> or do they mean IP routers with layer 4 packet inspections and layer 3
>>>> Access Control Lists?
>>>>
>>>> We say (and I catch myself doing it all the time) "like relational
>>>> databases" often as if all relational databases work alike. A columnar
>>>> store like HP Vertica is a relational database.MySql has different storage
>>>> engines does MyIsam work like InnoDB?
>>>>
>>>> Google docs organizes data by rows and columns as well. You can wrap
>>>> any storage system into an API that makes them look like rows and columns.
>>>> Microsoft LINQ can enumerate your network cars and query them
>>>> https://msdn.microsoft.com/en-us/library/bb308959.aspx , that really
>>>> does not make your network cards a "row store"
>>>>
>>>> "Theoretically a row can have 2 billion columns, but in practice it
>>>> shouldn't have more than 100 million columns."
>>>> In practice (In my experience) the number is much lower than 100
>>>> million, and if the data actually is deleted and readded frequently the
>>>> number of live columns(rows, whatever) you can use happily is even lower
>>>>
>>>>
>>>> I believe on twitter (I am unable to find the tweet) someone was trying
>>>> to convince me Cassandra was a "columnar analytic database".  ROFL
>>>>
>>>> I believe telling someone it "row store" "like a database", is not a
>>>> good idea. They might away content with that explanation. You are setting
>>>> them up to walk into an anti-pattern. Like a case where the user is
>>>> attempting to write and deleting 1 row and 1 column 6 billion times a day.
>>>> Then you end up explaining to them http://stackoverflow.com/
>>>> questions/21755286/what-exactly-happens-when-tombstone-limit-is-reached
>>>>
>>>>
>>>> and how the cassandra storage model is not "like a relational
>>>> database".
>>>>
>>>> On Fri, Sep 30, 2016 at 9:22 PM, Edward Capriolo <edlinuxguru@gmail.com
>>>> > wrote:
>>>>
>>>>> I can iterate over JSON data stored in mongo and present it as a table
>>>>> with rows and columns. It does not make mongo a rowstore.
>>>>>
>>>>> On Fri, Sep 30, 2016 at 9:16 PM, Edward Capriolo <
>>>>> edlinuxguru@gmail.com> wrote:
>>>>>
>>>>>> The problem with calling it a row store:
>>>>>>
>>>>>> https://en.wikipedia.org/wiki/Row_(database)
>>>>>>
>>>>>> In the context of a relational database
>>>>>> <https://en.wikipedia.org/wiki/Relational_database>, a *row*—also
>>>>>> called a record
>>>>>> <https://en.wikipedia.org/wiki/Record_(computer_science)> or
tuple
>>>>>> <https://en.wikipedia.org/wiki/Tuple>—represents a single,
>>>>>> implicitly structured data <https://en.wikipedia.org/wiki/Data>
item
>>>>>> in a table <https://en.wikipedia.org/wiki/Table_(database)>.
In
>>>>>> simple terms, a database table can be thought of as consisting of
>>>>>> *rows* andcolumns <https://en.wikipedia.org/wiki/Column_(database)>
>>>>>>  or fields <https://en.wikipedia.org/wiki/Field_(computer_science)>.[
>>>>>> 1] <https://en.wikipedia.org/wiki/Row_(database)#cite_note-1>
Each
>>>>>> row in a table represents a set of related data, and every row in
the table
>>>>>> has the same structure.
>>>>>>
>>>>>> When you have static columns and rows with maps, and lists, it is
>>>>>> hard to argue that every row has the same structure. Physically at
the
>>>>>> storage layer they do not have the same structure and logically when
>>>>>> accessing the data they barely have the same structure, as the static
>>>>>> column is just appearing inside each row it is actually not contained
in.
>>>>>>
>>>>>> On Fri, Sep 30, 2016 at 4:47 PM, Jonathan Haddad <jon@jonhaddad.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1000 to what Benedict says. I usually call it a "partitioned
row
>>>>>>> store" which usually needs some extra explanation but is more
accurate than
>>>>>>> "column family" or whatever other thrift era terminology people
still use.
>>>>>>> On Fri, Sep 30, 2016 at 1:53 PM DuyHai Doan <doanduyhai@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I used to present Cassandra as a NoSQL datastore with "distributed"
>>>>>>>> table. This definition is closer to CQL and has some academic
background
>>>>>>>> (distributed hash table).
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 30, 2016 at 7:43 PM, Benedict Elliott Smith <
>>>>>>>> benedict@apache.org> wrote:
>>>>>>>>
>>>>>>>>> Cassandra is not a "wide column store" anymore.  It has
a schema.
>>>>>>>>> Only thrift users no longer think they have a schema
(though they do), and
>>>>>>>>> thrift is being deprecated.
>>>>>>>>>
>>>>>>>>> I really wish everyone would kill the term "wide column
store"
>>>>>>>>> with fire.  It seems to have never meant anything beyond
"schema-less,
>>>>>>>>> row-oriented", and a "column store" means literally the
opposite of this.
>>>>>>>>>
>>>>>>>>> Not only that, but people don't even seem to realise
the term
>>>>>>>>> "column store" existed long before "wide column store"
and the latter is
>>>>>>>>> often abbreviated to the former, as here:
>>>>>>>>> http://www.planetcassandra.org/what-is-nosql/
>>>>>>>>>
>>>>>>>>> Since it no longer applies, let's all agree as a community
to
>>>>>>>>> forget this awful nomenclature ever existed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 30 September 2016 at 18:09, Joaquin Casares <
>>>>>>>>> joaquin@thelastpickle.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Mehdi,
>>>>>>>>>>
>>>>>>>>>> I can help clarify a few things.
>>>>>>>>>>
>>>>>>>>>> As Carlos said, Cassandra is a Wide Column Store.
Theoretically a
>>>>>>>>>> row can have 2 billion columns, but in practice it
shouldn't have more than
>>>>>>>>>> 100 million columns.
>>>>>>>>>>
>>>>>>>>>> Cassandra partitions data to certain nodes based
on the partition
>>>>>>>>>> key(s), but does provide the option of setting zero
or more clustering
>>>>>>>>>> keys. Together, the partition key(s) and clustering
key(s) form the primary
>>>>>>>>>> key.
>>>>>>>>>>
>>>>>>>>>> When writing to Cassandra, you will need to provide
the full
>>>>>>>>>> primary key, however, when reading from Cassandra,
you only need to provide
>>>>>>>>>> the full partition key.
>>>>>>>>>>
>>>>>>>>>> When you only provide the partition key for a read
operation,
>>>>>>>>>> you're able to return all columns that exist on that
partition with low
>>>>>>>>>> latency. These columns are displayed as "CQL rows"
to make it easier to
>>>>>>>>>> reason about.
>>>>>>>>>>
>>>>>>>>>> Consider the schema:
>>>>>>>>>>
>>>>>>>>>> CREATE TABLE foo (
>>>>>>>>>>   bar uuid,
>>>>>>>>>>
>>>>>>>>>>   boz uuid,
>>>>>>>>>>
>>>>>>>>>>   baz timeuuid,
>>>>>>>>>>   data1 text,
>>>>>>>>>>
>>>>>>>>>>   data2 text,
>>>>>>>>>>
>>>>>>>>>>   PRIMARY KEY ((bar, boz), baz)
>>>>>>>>>>
>>>>>>>>>> );
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> When you write to Cassandra you will need to send
bar, boz, and
>>>>>>>>>> baz and optionally data*, if it's relevant for that
CQL row. If you chose
>>>>>>>>>> not to define a data* field for a particular CQL
row, then nothing is
>>>>>>>>>> stored nor allocated on disk. But I wouldn't consider
that caveat to be
>>>>>>>>>> "schema-less".
>>>>>>>>>>
>>>>>>>>>> However, all writes to the same bar/boz will end
up on the same
>>>>>>>>>> Cassandra replica set (a configurable number of nodes)
and be stored on the
>>>>>>>>>> same place(s) on disk within the SSTable(s). And
on disk, each field that's
>>>>>>>>>> not a partition key is stored as a column, including
clustering keys (this
>>>>>>>>>> is optimized in Cassandra 3+, but now we're getting
deep into internals).
>>>>>>>>>>
>>>>>>>>>> In this way you can get fast responses for all activity
for
>>>>>>>>>> bar/boz either over time, or for a specific time,
with roughly the same
>>>>>>>>>> number of disk seeks, with varying lengths on the
disk scans.
>>>>>>>>>>
>>>>>>>>>> Hope that helps!
>>>>>>>>>>
>>>>>>>>>> Joaquin Casares
>>>>>>>>>> Consultant
>>>>>>>>>> Austin, TX
>>>>>>>>>>
>>>>>>>>>> Apache Cassandra Consulting
>>>>>>>>>> http://www.thelastpickle.com
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 30, 2016 at 11:40 AM, Carlos Alonso <
>>>>>>>>>> info@mrcalonso.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Cassandra is a Wide Column Store http://db-engines.com/en
>>>>>>>>>>> /system/Cassandra
>>>>>>>>>>>
>>>>>>>>>>> Carlos Alonso | Software Engineer | @calonso
>>>>>>>>>>> <https://twitter.com/calonso>
>>>>>>>>>>>
>>>>>>>>>>> On 30 September 2016 at 18:24, Mehdi Bada <
>>>>>>>>>>> mehdi.bada@dbi-services.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I have a theoritical question:
>>>>>>>>>>>> - Is Apache Cassandra really a column store?
>>>>>>>>>>>> Column store mean storing the data as column
rather than as a
>>>>>>>>>>>> rows.
>>>>>>>>>>>>
>>>>>>>>>>>> In fact C* store the data as row, and data
is partionned with
>>>>>>>>>>>> row key.
>>>>>>>>>>>>
>>>>>>>>>>>> Finally, for me, Cassandra is a row oriented
schema less
>>>>>>>>>>>> DBMS.... Is it true for you also???
>>>>>>>>>>>>
>>>>>>>>>>>> Many thanks in advance for your reply
>>>>>>>>>>>>
>>>>>>>>>>>> Best Regards
>>>>>>>>>>>> Mehdi Bada
>>>>>>>>>>>> ----
>>>>>>>>>>>>
>>>>>>>>>>>> *Mehdi Bada* | Consultant
>>>>>>>>>>>> Phone: +41 32 422 96 00 | Mobile: +41 79
928 75 48 | Fax: +41
>>>>>>>>>>>> 32 422 96 15
>>>>>>>>>>>> dbi services, Rue de la Jeunesse 2, CH-2800
Delémont
>>>>>>>>>>>> mehdi.bada@dbi-services.com
>>>>>>>>>>>> www.dbi-services.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *⇒ dbi services is recruiting Oracle &
SQL Server experts ! –
>>>>>>>>>>>> Join the team
>>>>>>>>>>>> <http://www.dbi-services.com/fr/dbi-services-et-ses-collaborateurs/offres-emplois-opportunites-carrieres/>*
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message