cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Trivial Update of "ThomasBoose/EERD model components to Cassandra Column family's" by ThomasBoose
Date Sat, 11 Dec 2010 18:42:00 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "ThomasBoose/EERD model components to Cassandra Column family's" page has been changed
by ThomasBoose.
http://wiki.apache.org/cassandra/ThomasBoose/EERD%20model%20components%20to%20Cassandra%20Column%20family%27s?action=diff&rev1=10&rev2=11

--------------------------------------------------

  
  If this is not yet making sence, read on.
  
+ 
+ 
- == Indexing ==
+ === Indexing ===
  In order to add an index to a column, other then the ColumnFamily key, we should to create
a second ColumnFamily. Every insert, which can be either an insert or update in Cassandra,
on the original ColumnFamily we will update the corresponding index.
  
  Think of a ColumnFamily cf_Person (examples in Python using pycassa)
@@ -25, +27 @@

  cfi_Person_City.insert ('Haarlem', {'234':''})
  }}}
  This way a hash will be created containing columns for every person's key that lives in
a specific City. The ColumnFamily architecture of Cassandra can store a unlimited number of
columns for each key. This meens that when deleting a person it's reference in the cfi_Person_City
index should be removed first. When updating a person, maybe moving to anothor City, we have
to remove the element from the cfi_Person_City first and then store it with the corresponding
new City.'' ''
+ 
+ === Deleting values ===
+ Because of the way Cassandra clusters operate it is nearly impossible to delete values and
know for sure the values are deleted on everynode. If values would simply be deleted and afterwards
a node turns up which still holds the given value it would replicate ths value back to the
existing nodes. Read more about deleting values at: DistributedDeletes
+ 
+ Cassandra on the other hand is isanely fast at inserting and updating values. This is why
I would advise any programmer trying to build DBMS logic to introduce a value that "meens
" deleted and have you DBMS tier respond "Does not exist" in both cases, when values actualy
do not exist or contain the "deleted" value. Cassandra makes no distinction between updates
and inserts so updating from "deleted" and inserting can be achieved the same way.
  
  == Relations ==
  === 1 on 1 ===
@@ -45, +52 @@

  ||John ||0555-123456 ||10.000 ||
  ||<style="text-align: center;" |2>321-21-4321 ||name ||phone ||salary ||
  ||Jane ||0555-654321 ||12.000 ||
- 
  
  
  ||||||<tablewidth="400px" tablestyle="text-align: left;"style="text-align: center;">CF_Phone
''' ''' ||
@@ -92, +98 @@

  ||Jane ||SE ||Amsterdam ||
  
  
- 
  ||||||<tablewidth="400px" tablestyle="text-align: left;"style="text-align: center;">CF_School_Unit
''' ''' ||
  ||<style="text-align: center;" |2>SE ||name ||loc ||
  ||software engineering ||hsl ||
  
  
- 
  ||||||<tablewidth="400px" tablestyle="text-align: left;"style="text-align: center;">CFK_School_Unit_Student
''' ''' ||
  ||<style="text-align: center;" |2>SE ||123-12-1234 ||321-21-4321 ||
- ||             ||             ||
+ || || ||
  
  
  
@@ -118, +122 @@

  As it is perfectly valid in a relational database to have a key value composed of several
columns, in cassandra there is only one key per ColumnFamily. I did already discuss the need
 to create seperate  ColumnFamily's for one to many relationships given the fact that you
can never tell for sure whether or not maybe in the future a new relation will popup to another
entity sharing the same keyvalues. This means that we will need 5 ColumnFamily's to implement
the model above:
  
  CF_Order is a straight forward  ColumnFamily, Modeled after the design
- 
  ||||||||<style="text-align: center;">CF_Order ''' ''' ||
  ||<style="text-align: center;" |2>1234 ||order_date ||order_discount ||customer_id
||
  ||20100808 ||0.4 ||1234 ||
- ||<style="text-align: center;" |2>4321 ||customer_id ||order_date ||             ||
+ ||<style="text-align: center;" |2>4321 ||customer_id ||order_date || ||
- ||3451 ||20100802 ||             ||
+ ||3451 ||20100802 || ||
- ||<style="text-align: center;" |2>1354 ||order_discount ||order_date ||          
  customer_id||
+ ||<style="text-align: center;" |2>1354 ||order_discount ||order_date ||customer_id
||
- ||3 ||20100802 ||             3451||
+ ||3 ||20100802 ||3451 ||
  
  
  
  
  The same for CF_Product
- 
  ||||||||<style="text-align: center;">CF_Product ''' ''' ||
  ||<style="text-align: center;" |2>DSK_SGT_5GB_7200 ||description ||image ||list_price
||
- ||5 GB Seagate harddisk ||             ||130 ||
+ ||5 GB Seagate harddisk || ||130 ||
  ||<style="text-align: center;" |2>KBD_LGT_WRL_350 ||description ||image ||list_price
||
- ||Wireless keyboard ||             ||75 ||
+ ||Wireless keyboard || ||75 ||
  
  
  
  
  As we cannot itterate throu a ColumnFamily based on any attribute except its key and we
wan't to use a solution to relations that is generic we create a seperate ColumnFamily for
each one to many relationship. As we know that a many to many relationship is typicly solved
by introducing 2 one to many relationships to a newly created ColumnFamily. So there will
be one ColumnFamily to store which orders a product can be found on.
- 
  ||||||||<style="text-align: center;">CF_Product_Order ''' ''' ||
- ||<style="text-align: center;" |2>DSK_SGT_5GB_7200 ||1234 ||4321 ||             1354||
+ ||<style="text-align: center;" |2>DSK_SGT_5GB_7200 ||1234 ||4321 ||1354 ||
- ||             ||             ||             ||
+ || || || ||
- ||<style="text-align: center;" |2>KBD_LGT_WRL_350 ||4321 ||             ||       
     ||
+ ||<style="text-align: center;" |2>KBD_LGT_WRL_350 ||4321 || || ||
- ||             ||             ||             ||
+ || || || ||
  
  
  
  
  And we'll create one ColumnFamily that stores which products can be found on each order.
We can see that this solution tends to invite anomalies. product that have 3 order columns
but no reference to the product in the CF_Order_Product ColumnFamily. It up to the programmers
of the DBMS tier to make sure genic code is available to keep the ColumnFamily's consistent
al of the time.
- 
- ||||||||<style="text-align: center;">CF_Order_Product||
+ ||||||||<style="text-align: center;">CF_Order_Product ||
  ||<style="text-align: center;" |2>1234 ||DSK_SGT_5GB_7200 || || ||
  || || || ||
  ||<style="text-align: center;" |2>4321 ||KBD_LGT_WRL_350 ||DSK_SGT_5GB_7200 || ||
@@ -163, +163 @@

  || || || ||
  
  
+ 
+ 
  ''' '''Last but not least we'll introduce a ColumnFamily to store the attributes connected
to the relationship between Product and Order. As mentioned it is not possible to create a
key containing 2 seperate values. So we'll have to introduce a new value that conatenates
both values using a strikt format. These formats should be part of a design if implementing
a (E)ERD in Cassandra.
- 
  ||||||||<style="text-align: center;">CF_order_line''' ''' ||
- ||<style="text-align: center;" |2>1234_DSK_SGT_5GB_7200 ||number ||delivery_date ||
            ||
+ ||<style="text-align: center;" |2>1234_DSK_SGT_5GB_7200 ||number ||delivery_date ||
||
- ||5 ||             20101215||             ||
+ ||5 ||20101215 || ||
  ||<style="text-align: center;" |2>4321_KBD_LGT_WRL_350 ||product_discount ||number
||delivery_date ||
- ||0.3 ||             1||20100901 ||
+ ||0.3 ||1 ||20100901 ||
  ||<style="text-align: center;" |2>4321_DSK_SGT_5GB_7200 ||product_discount ||number
||delivery_date ||
- ||5 ||2||20100901 ||
+ ||5 ||2 ||20100901 ||
  ||<style="text-align: center;" |2>1354_DSK_SGT_5GB_7200 ||product_discount ||number
||delivery_date ||
- ||5 ||8||20100901 ||
+ ||5 ||8 ||20100901 ||
  
  
  

Mime
View raw message