incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Materialized Views or Index CF - data model question
Date Wed, 11 Apr 2012 18:49:11 GMT
> a) "These queries are not easily supported on standard Cassandra"
> select * from book where price  < 992   order by price descending limit 30;
> 
> This is a typical (time series data)timeline query well supported by
> Cassandra, from my understanding.
Queries that use a secondary index (on price) must include an equality operator. 

> 
> b) "You do not need a different CF for each custom secondary index.
> Try putting the name of the index in the row key. "
> 
> I couldn't understand it. Can you help to build an demo with CF
> structure and some sample data?
You can have one CF that contains multiple secondary indexes. 

key: col_1:value_1
col_name: entity_id_1

key: col_2:value_2
col_name: entity_id_1

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 11/04/2012, at 7:24 AM, Data Craftsman wrote:

> Hi Aaron,
> 
> Thanks for the quick answer, I'll build a prototype to benchmark each
> approach next week.
> 
> Here are more questions based on your reply:
> 
> a) "These queries are not easily supported on standard Cassandra"
> select * from book where price  < 992   order by price descending limit 30;
> 
> This is a typical (time series data)timeline query well supported by
> Cassandra, from my understanding.
> 
> b) "You do not need a different CF for each custom secondary index.
> Try putting the name of the index in the row key. "
> 
> I couldn't understand it. Can you help to build an demo with CF
> structure and some sample data?
> 
> Thanks,
> Charlie | DBA developer
> 
> 
> 
> On Sun, Apr 8, 2012 at 2:30 PM, aaron morton <aaron@thelastpickle.com> wrote:
>> We need to query data by each column, do pagination as below,
>> 
>> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
>> select * from book where price  < 992   order by price  descending limit 30;
>> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
>> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
>> ...
>> select * from book where col_nm < 978 order by col_nm descending limit 30;
>> 
>> These queries are not easily supported on standard Cassandra. If you need
>> this level of query complexity consider Data Stax Enterprise, Solr, or a
>> RDBMS.
>> 
>> If we choose Materialized Views approach, we have to update all
>> 20 Materialized View column family(s), for each base row update.
>> Will the Cassandra write performance acceptable?
>> 
>> Yes, depending on the size of the cluster and the machine spec.
>> 
>> It's often a good idea to design CF's to match the workloads. If you have
>> some data that changes faster than other, consider splitting them into
>> different CFs.
>> 
>> Should we just normalize the data, create base book table with book_id
>> as primary key, and then
>> build 20 index column family(s), use wide row column slicing approach,
>> with index column data value as column name and book_id as value?
>> 
>> You do not need a different CF for each custom secondary index. Try putting
>> the name of the index in the row key.
>> 
>> What will you recommend?
>> 
>> Take another look at the queries you *need* to support. Then build a small
>> proof of concept to see if Cassandra will work for you.
>> 
>> Hope that helps.
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 6/04/2012, at 6:46 AM, Data Craftsman wrote:
>> 
>> Howdy,
>> 
>> Can I ask a data model question here?
>> 
>> We have a book table with 20 columns, 300 million rows, average row
>> size is 1500 bytes.
>> 
>> create table book(
>> book_id,
>> isbn,
>> price,
>> author,
>> titile,
>> ...
>> col_n1,
>> col_n2,
>> ...
>> col_nm
>> );
>> 
>> Data usage:
>> 
>> We need to query data by each column, do pagination as below,
>> 
>> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
>> select * from book where price  < 992   order by price  descending limit 30;
>> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
>> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
>> ...
>> select * from book where col_nm < 978 order by col_nm descending limit 30;
>> 
>> Write: 100 million updates a day.
>> Read : 16  million queries a day. 200 queries per second, one query
>> returns 30 rows.
>> 
>> ***
>> Materialized Views approach
>> 
>> {"ISBN_01",book_object1},{"ISBN_02",book_object2},...,{"ISBN_N",book_objectN}
>> ...
>> We will end up with 20 timelines.
>> 
>> 
>> ***
>> Index approach - create 2nd Column Family as Index
>> 
>> 'ISBN_01': 'book_id_a01','book_id_a02',...,'book_id_aN'
>> 'ISBN_02': 'book_id_b01','book_id_b02',...,'book_id_bN'
>> ...
>> 'ISBN_0m': 'book_id_m01','book_id_m02',...,'book_id_mN'
>> 
>> This way, we will create 20 index Column Family(s).
>> 
>> ---
>> 
>> If we choose Materialized Views approach, we have to update all
>> 20 Materialized View column family(s), for each base row update.
>> Will the Cassandra write performance acceptable?
>> 
>> Redis recommend building an index for the query on each column, that
>> is your 1st strategy - create 2nd index CF:
>> http://redis.io/topics/data-types-intro
>> (see section [ Pushing IDs instead of the actual data in Redis lists ]
>> 
>> Should we just normalize the data, create base book table with book_id
>> as primary key, and then
>> build 20 index column family(s), use wide row column slicing approach,
>> with index column data value as column name and book_id as value?
>> This way, we only need to update fewer affected column family that
>> column value changed, but not all 20 Materialized Views CF(s).
>> 
>> Another option would be using Redis to store master book data, using
>> Cassandra Column Family to maintain 2nd index.
>> 
>> What will you recommend?
>> 
>> Charlie (@mujiang) 一个 木匠
>> =======
>> Data Architect Developer
>> http://mujiang.blogspot.com
>> 
>> 
>> p.s.
>> 
>> Gist from datastax dev blog (
>> http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra )
>> "
>> If the same event is tracked in multiple timelines,
>> it's okay to denormalize and store all of the event data in each of
>> those timelines.
>> One of the main principles that Cassandra was built on is that disk
>> space is very cheap resource;
>> minimizing disk seeks at the cost of higher space consumption is a
>> good tradeoff.
>> Unless the data for each event is ^very large^, I always prefer this
>> strategy over the index strategy.
>> "
>> 
>> Will 1500 bytes row size be large or small for Cassandra from your
>> understanding?
>> 
>> 
> 
> 
> 
> -- 
> --
> Thanks,
> 
> Charlie (@mujiang) 一个 木匠
> =======
> Data Architect Developer
> http://mujiang.blogspot.com


Mime
View raw message