incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Data Craftsman <database.crafts...@gmail.com>
Subject Materialized Views or Index CF - data model question
Date Thu, 05 Apr 2012 18:46:35 GMT
Howdy,

Can I ask a data model question here?

We have a book table with 20 columns, 300 million rows, average row
size is 1500 bytes.

create table book(
 book_id,
 isbn,
 price,
 author,
 titile,
 ...
 col_n1,
 col_n2,
 ...
 col_nm
);

Data usage:

We need to query data by each column, do pagination as below,

select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
select * from book where price  < 992   order by price  descending limit 30;
select * from book where col_n1 < 789   order by col_n1 descending limit 30;
select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
...
select * from book where col_nm < 978 order by col_nm descending limit 30;

Write: 100 million updates a day.
Read : 16  million queries a day. 200 queries per second, one query
returns 30 rows.

***
Materialized Views approach

{"ISBN_01",book_object1},{"ISBN_02",book_object2},...,{"ISBN_N",book_objectN}
...
We will end up with 20 timelines.


***
Index approach - create 2nd Column Family as Index

'ISBN_01': 'book_id_a01','book_id_a02',...,'book_id_aN'
'ISBN_02': 'book_id_b01','book_id_b02',...,'book_id_bN'
...
'ISBN_0m': 'book_id_m01','book_id_m02',...,'book_id_mN'

This way, we will create 20 index Column Family(s).

---

If we choose Materialized Views approach, we have to update all
20 Materialized View column family(s), for each base row update.
Will the Cassandra write performance acceptable?

Redis recommend building an index for the query on each column, that
is your 1st strategy - create 2nd index CF:
http://redis.io/topics/data-types-intro
(see section [ Pushing IDs instead of the actual data in Redis lists ]

Should we just normalize the data, create base book table with book_id
as primary key, and then
build 20 index column family(s), use wide row column slicing approach,
with index column data value as column name and book_id as value?
This way, we only need to update fewer affected column family that
column value changed, but not all 20 Materialized Views CF(s).

Another option would be using Redis to store master book data, using
Cassandra Column Family to maintain 2nd index.

What will you recommend?

Charlie (@mujiang) 一个 木匠
=======
Data Architect Developer
http://mujiang.blogspot.com


p.s.

Gist from datastax dev blog (
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra )
"
 If the same event is tracked in multiple timelines,
 it’s okay to denormalize and store all of the event data in each of
those timelines.
 One of the main principles that Cassandra was built on is that disk
space is very cheap resource;
 minimizing disk seeks at the cost of higher space consumption is a
good tradeoff.
 Unless the data for each event is ^very large^, I always prefer this
strategy over the index strategy.
"

Will 1500 bytes row size be large or small for Cassandra from your
understanding?

Mime
View raw message