cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leena Ghatpande <lghatpa...@hotmail.com>
Subject Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.
Date Tue, 12 Mar 2019 13:02:29 GMT
Our data model cannot be like below as you have recommended as majority of the reads need to
select the data by the partition key (id) only, not by date.
You could remodel your data in such way that you would make primary key like this
((date), hour-minute, id)
or
((date, hour-minute), id)


By adding the date as clustering column, yes the idea was to use the Allow Filtering on the
date and pull the records. Understand that it is not recommended to do this, but we have been
doing this on another existing large table and have not run into any issue so far. But want
to understand if there is a better approach to this?

Thanks

________________________________
From: Stefan Miklosovic <stefan.miklosovic@instaclustr.com>
Sent: Monday, March 11, 2019 7:12 PM
To: user@cassandra.apache.org
Subject: Re: Migrate large volume of data from one table to another table within the same
cluster when COPY is not an option.

The query which does not work should be like this, I made a mistake there

cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this
query as it might involve data filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance unpredictability, use ALLOW FILTERING"


On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <stefan.miklosovic@instaclustr.com<mailto:stefan.miklosovic@instaclustr.com>>
wrote:
Hi Leena,

"We are thinking of creating a new table with a date field as a clustering column to be able
to query for date ranges, but partition key to clustering key will be 1-1. Is this a good
approach?"

If you want to select by some time range here, I am wondering how would making datetime a
clustering column help you here? You still have to provide primary key, right?

E.g. select * from your_keyspace.your_table where id=123 and my_date > yesterday and my_date
< tomorrow (you got the idea)

If you make my_date clustering column, you cant not do this below, because you still have
to specify partition key fully and then clustering key (optionally) where you can further
order and do ranges. But you cant do a query without specifying partition key. Well, you can
use ALLOW FILTERING but you do not want to do this at all in your situation as it would scan
everything.

select * from your_keyspace.your_table where my_date > yesterday and my_date < tomorrow

cqlsh> create KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy',
'dc1': '1'};
cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY ((id), number));

cqlsh> SELECT * from my_keyspace.my_table ;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
 abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
 90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
 8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2

cqlsh> SELECT * from my_keyspace.my_table where id = '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f'
and  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid STRING constant
(6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for "id" of type uuid"

cqlsh> SELECT * from my_keyspace.my_table where id = 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f
and  number > 2;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3

You could remodel your data in such way that you would make primary key like this

((date), hour-minute, id)

or

((date, hour-minute), id)

I would prefer the second one because if you expect a lot of data per day, they would all
end up on same set of replicas as hash of partition key would be same whole day if you have
same date all day so I think you would end up with hotspots. You want to have your data spread
more evenly so the second one seems to be better to me.

You can also investigate how to do this with materialized view but I am not sure about the
performance here.

If you want to copy data you can do this e.g. by Cassandra Spark connector, you would just
read table and as you read it you would write to another one. That is imho the fastest approach
and the least error prone. You can do that on live production data and you can just make a
"switch" afterwards. Not sure about ttls but that should be transparent while copying that.

On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lghatpande@hotmail.com<mailto:lghatpande@hotmail.com>>
wrote:

We have a table with over 70M rows with a partition key that is unique.  We have a  created
datetime stamp on each record, and we have a need to select all rows created for a date range.
Secondary index is not an option as its high cardinality and could slow performance doing
a full scan on 70M rows.


We are thinking of creating a new table with a date field as a clustering column to be able
to query for date ranges, but partition key to clustering key will be 1-1. Is this a good
approach?

To do this, we need to copy this large volume of data from table1 to table2 within the same
cluster, while updates are still happening to table1. We need to do this real time without
impacting our customers. COPY is not an option, as we have ttl's on each row on table1 that
need to be applied to table2 as well.


So what would be the best approach

  1.  To be able select data using date range without impacting performance. This operation
will be needed only on adhoc basis and it wont be as frequent .
  2.  Best way to migrate large volume of data with ttl from one table to another within the
same cluster.


Any other suggestions also will be greatly appreciated.



Stefan Miklosovic

Stefan Miklosovic

Mime
View raw message