cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li, George" <guangxing...@pearson.com>
Subject Re: Questions about C* performance related to tombstone
Date Wed, 10 Apr 2019 13:52:32 GMT
Thanks for the information.

George.

On Wed, Apr 10, 2019 at 3:14 AM Alok Dwivedi <alok.dwivedi@instaclustr.com>
wrote:

> Your delete query
>
> "DELETE FROM myTable WHERE course_id = 'C' AND assignment_id = 'A1';”.
>
> will generate multi row range tombstones. Since you are reading entire
> partition which effectively will be read in pages (slice query equivalent)
> you may get tombstones in certain pages depending upon how much deletes you
> are doing. However looking at your use case I don’t think you will end with
> very high ratio of deleted to live data so normal deletes should be fine as
> is already pointed out below. Note that range tombstones are more effective
> storage space wise as they have start/end range rather than deleted info
> for every deleted row. So I also don’t think  your workaround of using
> ‘active’ flag is really needed unless its for auditing. Another thing to
> note is if you have a use case where you want to be more aggressive in
> evicting tombstones then here are some settings worth exploring
> - tombstone_threshold
> - unchecked_tombstone_compaction
> -tombstone_compaction_interval
> Additionally gc_grace_seconds can be looked at but it must be handled very
> carefully as we must ensure that repair completes in an interval less than
> this setting to prevent any deleted data reappearing.
>
> Regards
> Alok
>
>
> On 9 Apr 2019, at 15:56, Jon Haddad <jon@jonhaddad.com> wrote:
>
> Normal deletes are fine.
>
> Sadly there's a lot of hand wringing about tombstones in the generic
> sense which leads people to try to work around *every* case where
> they're used.  This is unnecessary.  A tombstone over a single row
> isn't a problem, especially if you're only fetching that one row back.
> Tombstones can be quite terrible under a few conditions:
>
> 1. When a range tombstone shadows hundreds / thousands / millions of
> rows.  This wasn't even detectable prior to Cassandra 3 unless you
> were either looking for it specifically or were doing CPU profiling:
>
> http://thelastpickle.com/blog/2018/07/05/undetectable-tombstones-in-apache-cassandra.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__thelastpickle.com_blog_2018_07_05_undetectable-2Dtombstones-2Din-2Dapache-2Dcassandra.html&d=DwMFaQ&c=0YLnzTkWOdJlub_y7qAx8Q&r=rLBSjDttDAvJ4dNFETxOxW7tkSzdYwi8u8cKGod45fQ&m=P5LBsQs63k7WyPdn_oNSktA-LysOkndY9ur93lYKgwY&s=UWdZkSOzMZSzoOqC7gp1euyRNifD-RO_5fnX4y78k0c&e=>
> 2. When rows were frequently created then deleted, and scanned over.
> This is the queue pattern that we detest so much.
> 3. When they'd be created as a side effect from over writing
> collections.  This is an accident typically.
>
> The 'active' flag is good if you want to be able to go back and look
> at old deleted assignments.  If you don't care about that, use a
> normal delete.
>
> Jon
>
> On Tue, Apr 9, 2019 at 7:00 AM Li, George <guangxing.li@pearson.com>
> wrote:
>
>
> Hi,
>
> I have a table defined like this:
>
> CREATE TABLE myTable (
> course_id text,
> assignment_id text,
> assignment_item_id text,
> data text,
> boolean active,
> PRIMARY KEY (course_id, assignment_id, assignment_item_id)
> );
> i.e. course_id as the partition key and assignment_id, assignment_item_id
> as clustering keys.
>
> After data is populated, some delete queries by course_id and
> assignment_id occurs, e.g. "DELETE FROM myTable WHERE course_id = 'C' AND
> assignment_id = 'A1';". This would create tombstones so query "SELECT *
> FROM myTable WHERE course_id = 'C';" would be affected, right? Would query
> "SELECT * FROM myTable WHERE course_id = 'C' AND assignment_id = 'A2';" be
> affected too?
>
> For query "SELECT * FROM myTable WHERE course_id = 'C';", to workaround
> the tombstone problem, we are thinking about not doing hard deletes,
> instead doing soft deletes. So instead of doing "DELETE FROM myTable WHERE
> course_id = 'C' AND assignment_id = 'A1';", we do "UPDATE myTable SET
> active = false WHERE course_id = 'C' AND assignment_id = 'A1';". Then in
> the application, we do query "SELECT * FROM myTable WHERE course_id = 'C';"
> and filter out records that have "active" equal to "false". I am not really
> sure this would improve performance because C* still has to scan through
> all records with the partition key "C". It is just instead of scanning
> through X records + Y tombstone records with hard deletes that generate
> tombstones, it now scans through X + Y records with soft deletes and no
> tombstones. Am I right?
>
> Thanks.
>
> George
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>
>

Mime
View raw message