cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Miklosovic <stefan.mikloso...@instaclustr.com>
Subject Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.
Date Tue, 12 Mar 2019 22:46:18 GMT
Hi Sean,

for sure, the best approach would be to create another table which would
treat just that specific query.

How do I set the flag for not allowing allow filtering in cassandra.yaml? I
read a doco and there seems to be nothing about that.

Regards

On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <SEAN_R_DURITY@homedepot.com>
wrote:

> If there are 2 access patterns, I would consider having 2 tables. The
> first one with the ID, which you say is the majority use case.  Then have a
> second table that uses a time-bucket approach as others have suggested:
>
> (time bucket, id) as primary key
>
> Choose a time bucket (day, week, hour, month, whatever) that would hold
> less than 100 MB of data in the time-bucket partition.
>
>
>
> You could include all relevant data in the second table to meet your
> query. OR, if that data seems too large or too volatile to duplicate, just
> include your primary key and look-up the data in the primary table as
> needed.
>
>
>
> If you use allow filtering, you are setting yourself up for failure to
> scale. I tell my developers, “if you use allow filtering, you are doing it
> wrong.” In fact, I think the Cassandra admin should be able to set a flag
> in cassandra.yaml to not allow filtering at all. The cluster should be able
> to protect itself from bad queries.
>
>
>
>
>
>
>
> *From:* Leena Ghatpande <lghatpande@hotmail.com>
> *Sent:* Tuesday, March 12, 2019 9:02 AM
> *To:* Stefan Miklosovic <stefan.miklosovic@instaclustr.com>;
> user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to
> another table within the same cluster when COPY is not an option.
>
>
>
> Our data model cannot be like below as you have recommended as majority of
> the reads need to select the data by the partition key (id) only, not by
> date.
>
> You could remodel your data in such way that you would make primary key
> like this
>
> ((date), hour-minute, id)
>
> or
>
> ((date, hour-minute), id)
>
>
>
>
>
> By adding the date as clustering column, yes the idea was to use the Allow
> Filtering on the date and pull the records. Understand that it is not
> recommended to do this, but we have been doing this on another existing
> large table and have not run into any issue so far. But want to understand
> if there is a better approach to this?
>
>
>
> Thanks
>
>
> ------------------------------
>
> *From:* Stefan Miklosovic <stefan.miklosovic@instaclustr.com>
> *Sent:* Monday, March 11, 2019 7:12 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Migrate large volume of data from one table to another
> table within the same cluster when COPY is not an option.
>
>
>
> The query which does not work should be like this, I made a mistake there
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
>
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Cannot execute this query as it might involve data filtering and
> thus may have unpredictable performance. If you want to execute this query
> despite the performance unpredictability, use ALLOW FILTERING"
>
>
>
>
>
> On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <
> stefan.miklosovic@instaclustr.com> wrote:
>
> Hi Leena,
>
>
>
> "We are thinking of creating a new table with a date field as a
> clustering column to be able to query for date ranges, but partition key to
> clustering key will be 1-1. Is this a good approach?"
>
>
>
> If you want to select by some time range here, I am wondering how would
> making datetime a clustering column help you here? You still have to
> provide primary key, right?
>
>
>
> E.g. select * from your_keyspace.your_table where id=123 and my_date >
> yesterday and my_date < tomorrow (you got the idea)
>
>
>
> If you make my_date clustering column, you cant not do this below, because
> you still have to specify partition key fully and then clustering key
> (optionally) where you can further order and do ranges. But you cant do a
> query without specifying partition key. Well, you can use ALLOW FILTERING
> but you do not want to do this at all in your situation as it would scan
> everything.
>
>
>
> select * from your_keyspace.your_table where my_date > yesterday and
> my_date < tomorrow
>
>
>
> cqlsh> create KEYSPACE my_keyspace WITH replication = {'class':
> 'NetworkTopologyStrategy', 'dc1': '1'};
>
> cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY
> ((id), number));
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table ;
>
>
>
>  id                                   | number
>
> --------------------------------------+--------
>
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>
>  abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
>
>  90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
>
>  8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
>
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for
> "id" of type uuid"
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;
>
>
>
>  id                                   | number
>
> --------------------------------------+--------
>
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>
>
>
> You could remodel your data in such way that you would make primary key
> like this
>
>
>
> ((date), hour-minute, id)
>
>
>
> or
>
>
>
> ((date, hour-minute), id)
>
>
>
> I would prefer the second one because if you expect a lot of data per day,
> they would all end up on same set of replicas as hash of partition key
> would be same whole day if you have same date all day so I think you would
> end up with hotspots. You want to have your data spread more evenly so the
> second one seems to be better to me.
>
>
>
> You can also investigate how to do this with materialized view but I am
> not sure about the performance here.
>
>
>
> If you want to copy data you can do this e.g. by Cassandra Spark
> connector, you would just read table and as you read it you would write to
> another one. That is imho the fastest approach and the least error prone.
> You can do that on live production data and you can just make a "switch"
> afterwards. Not sure about ttls but that should be transparent while
> copying that.
>
>
>
> On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lghatpande@hotmail.com>
> wrote:
>
> We have a table with over 70M rows with a partition key that is unique.
> We have a  created datetime stamp on each record, and we have a need to
> select all rows created for a date range. Secondary index is not an option
> as its high cardinality and could slow performance doing a full scan on 70M
> rows.
>
>
>
> We are thinking of creating a new table with a date field as a clustering
> column to be able to query for date ranges, but partition key to clustering
> key will be 1-1. Is this a good approach?
>
> To do this, we need to copy this large volume of data from table1 to
> table2 within the same cluster, while updates are still happening to
> table1. We need to do this real time without impacting our customers. COPY
> is not an option, as we have ttl's on each row on table1 that need to be
> applied to table2 as well.
>
>
>
> So what would be the best approach
>
>    1. To be able select data using date range without impacting
>    performance. This operation will be needed only on adhoc basis and it wont
>    be as frequent .
>    2. Best way to migrate large volume of data with ttl from one table to
>    another within the same cluster.
>
>
>
> Any other suggestions also will be greatly appreciated.
>
>
>
>
>
>
> Stefan Miklosovic
>
>
>
> Stefan Miklosovic
>
> ------------------------------
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

Stefan Miklosovic

Mime
View raw message