cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DuyHai Doan <doanduy...@gmail.com>
Subject Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values
Date Fri, 04 Jan 2019 13:20:29 GMT
"The problem is I can't know the combination of set/unset values" --> Just
for this requirement, Achilles has a working solution for many years using
INSERT_NOT_NULL_FIELDS strategy:

https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy

Or you can use the Update API that by design only perform update on not
null fields:
https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity


Behind the scene, for each new combination of INSERT INTO table(x,y,z)
statement, Achilles will check its prepared statement cache and if the
statement does not exist yet, create a new prepared statement and put it
into the cache for later re-use for you

Disclaiment: I'm the creator of Achilles



On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <tomas.bartalos@gmail.com>
wrote:

> Hello,
>
> The problem is I can't know the combination of set/unset values. From my
> perspective every value should be set. The event from Kafka represents the
> complete state of the happening at certain point in time. In my table I
> want to store the latest event so the most recent state of the happening
> (in this table I don't care about the history). Actually I used wrong
> expression since its just the opposite of "incremental update", every event
> carries all data (state) for specific point of time.
>
> The event is represented with nested json structure. Top level elements of
> the json are table fields with type like text, boolean, timestamp, list and
> the nested elements are UDT fields.
>
> Simplified example:
> There is a new purchase for the happening, event:
> {total_amount: 50, items : [A, B, C, new_item], purchase_time :
> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...}
> I don't know what actually happened for this event, maybe there is a new
> item purchased, maybe some customer info have been changed, maybe the
> specials have been revoked and I have to reset them. I just need to store
> the state as it artived from Kafka, there might already be an event for
> this happening saved before, or maybe this is the first one.
>
> BR,
> Tomas
>
>
> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <mightye@gmail.com wrote:
>
>> Depending on the use case, creating separate prepared statements for each
>> combination of set / unset values in large INSERT/UPDATE statements may be
>> prohibitive.
>>
>> Instead, you can look into driver level support for UNSET values.
>> Requires Cassandra 2.2 or later IIRC.
>>
>> See:
>> Java Driver:
>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
>> Python Driver:
>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
>> Node Driver:
>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset
>>
>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <
>> SEAN_R_DURITY@homedepot.com> wrote:
>>
>>> You say the events are incremental updates. I am interpreting this to
>>> mean only some columns are updated. Others should keep their original
>>> values.
>>>
>>> You are correct that inserting null creates a tombstone.
>>>
>>> Can you only insert the columns that actually have new values? Just skip
>>> the columns with no information. (Make the insert generator a bit smarter.)
>>>
>>> Create table happening (id text primary key, event text, a text, b text,
>>> c text);
>>> Insert into table happening (id, event, a, b, c) values
>>> ("MainEvent","The most complete info we have right now","Priceless","10
>>> pm","Grand Ballroom");
>>> -- b changes
>>> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>>>
>>>
>>> Sean Durity
>>>
>>>
>>> -----Original Message-----
>>> From: Tomas Bartalos <tomas.bartalos@gmail.com>
>>> Sent: Thursday, December 27, 2018 9:27 AM
>>> To: user@cassandra.apache.org
>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values
>>>
>>> Hello,
>>>
>>> I’d start with describing my use case and how I’d like to use Cassandra
>>> to solve my storage needs.
>>> We're processing a stream of events for various happenings. Every event
>>> have a unique happening_id.
>>> One happening may have many events, usually ~ 20-100 events. I’d like to
>>> store only the latest event for the same happening (Event is an incremental
>>> update and it contains all up-to date data about happening).
>>> Technically the events are streamed from Kafka, processed with Spark an
>>> saved to Cassandra.
>>> In Cassandra we use upserts (insert with same primary key).  So far so
>>> good, however there comes the tombstone...
>>>
>>> When I’m inserting field with NULL value, Cassandra creates tombstone
>>> for this field. As I understood this is due to space efficiency, Cassandra
>>> doesn’t have to remember there is a NULL value, she just deletes the
>>> respective column and a delete creates a ... tombstone.
>>> I was hoping there could be an option to tell Cassandra not to be so
>>> space effective and store “unset" info without generating tombstones.
>>> Something similar to inserting empty strings instead of null values:
>>>
>>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into
>>> happening (‘1’, ‘event1’); — tombstone is generated insert into happening
>>> (‘1’, null); — tombstone is not generated insert into happening (‘1’,
'’);
>>>
>>> Possible solutions:
>>> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable low
>>> value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore
>>> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not
>>> good since this will never overwrite previously inserted event field with
>>> “empty” one.
>>> 3. On inserts with spark, find all NULL values and replace them with
>>> “empty” equivalent (empty string for text, 0 for integer). Very inefficient
>>> and problematic to find “empty” equivalent for some data types.
>>>
>>> Until tombstones appeared Cassandra was the right fit for our use case,
>>> however now I’m not sure if we’re heading the right direction.
>>> Could you please give me some advice how to solve this problem ?
>>>
>>> Thank you,
>>> Tomas
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>
>>>
>>> ________________________________
>>>
>>> The information in this Internet Email is confidential and may be
>>> legally privileged. It is intended solely for the addressee. Access to this
>>> Email by anyone else is unauthorized. If you are not the intended
>>> recipient, any disclosure, copying, distribution or any action taken or
>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>> When addressed to our clients any opinions or advice contained in this
>>> Email are subject to the terms and conditions expressed in any applicable
>>> governing The Home Depot terms of business or client engagement letter. The
>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>> content of this attachment and for any damages or losses arising from any
>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>> items of a destructive nature, which may be contained in this attachment
>>> and shall not be liable for direct, indirect, consequential or special
>>> damages in connection with this e-mail message or its attachment.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>
>>

Mime
View raw message