cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carlos Alonso <i...@mrcalonso.com>
Subject Re: Efficient Paging Option in Wide Rows
Date Sun, 24 Apr 2016 17:16:01 GMT
Hi Anuj,

That's a very good question and I'd like to hear an answer from anyone who
can give a detailed answer, but in the mean time I'll try to give my two
cents.

First of all I think I'd rather split all the values into different
partition keys for two reasons:
1.- If you're sure you're accessing all data at the same time you'll be
able to parallelize the queries by hitting more nodes on your cluster
rather than creating a hotspot on the owner(s) of the data.
2.- It is a recommended good practice to keep partitions small enough.
Check if your partition would fit in the good practice by applying the
formulae from this video:
https://academy.datastax.com/courses/ds220-data-modeling/physical-partition-size

Cheers!

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 23 April 2016 at 20:25, Anuj Wadehra <anujw_2003@yahoo.co.in> wrote:

> Hi,
>
> Can anyone take this question?
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>
> On Sat, 23 Apr, 2016 at 2:30 PM, Anuj Wadehra
> <anujw_2003@yahoo.co.in> wrote:
> I think I complicated the question..so I am trying to put the question
> crisply..
>
> We have a table defined with clustering key/column. We have  50000
> different clustering key values.
>
> If we want to fetch all 50000 rowd,Which query option would be faster and
> why?
>
> 1. Given a single primary key/partition key with 50000 clustering keys..we
> will page through the single partition 500 records at a time.Thus, we will
> do 50000/500=100 db hits but for same partition key.
>
> 2. Given 100 different primary keys with each primary key having just 500
> clustering key columns. Here also we will need 100 db hits but for
> different partitions.
>
>
> Basically I want to understand any optimizations built into CQL/Cassandra
> which make paging through a single partition more efficient than querying
> data from different partitions.
>
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>
> On Fri, 22 Apr, 2016 at 8:27 PM, Anuj Wadehra
> <anujw_2003@yahoo.co.in> wrote:
> Hi,
>
> I have a wide row index table so that I can fetch all row keys
> corresponding to a column value.
>
> Row of index_table will look like:
>
> ColValue1:bucket1 >> rowkey1, rowkey2.. rowkeyn
> ......
> ColValue1:bucketn>> rowkey1, rowkey2.. rowkeyn
>
> We will have buckets to avoid hotspots. Row keys of main table are random
> numbers and we will never do column slice like:
>
> Select * from index_table where key=xxx and
> Col > rowkey1 and col < rowkey10
>
> Also, we will ALWAYS fetch all data for a given value of index column.
> Thus all buckets havr to be read.
>
> Each index column value can map to thousands-millions of row keys in main
> table.
>
> Based on our use case, there are two design choices in front of me:
>
> 1. Have large number of buckets/rows for an index column value and have
> lesser data ( around few thousands) in each row.
>
> Thus, every time we want to fetch all row keys for an index col value, we
> will query more rows and for each row we will have to page through data 500
> records at a time.
>
> 2. Have fewer buckets/rows for an index column value.
>
> Every time we want to fetch all row keys for an index col value, we will
> query data less numner of wider rows and then page through each wide row
> reading 500 columns at a time.
>
>
> Which approach is more efficient?
>
>  Approach1: More number of rows with less data in each row.
>
>
> OR
>
> Approach 2: less number of  rows with more data in each row
>
>
> Either ways,  we are fetching only 500 records at a time in a query. Even
> in approach 2 (wider rows) , we can query only small data of 500 at a time.
>
>
> Thanks
> Anuj
>
>
>
>
>
>

Mime
View raw message