cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Petrov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-13004) Corruption while adding a column to a table
Date Wed, 31 May 2017 19:39:04 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031813#comment-16031813
] 

Alex Petrov edited comment on CASSANDRA-13004 at 5/31/17 7:38 PM:
------------------------------------------------------------------

I've been able to narrow it down a little bit and wanted to elaborate on why I've decided
to take some extra time and caution on the issue to avoid protocol changes. Another problem
is that if we fix the problem in {{SerializationHeader}} and not {{ColumnFilter}} we may potentially
end up having other issues with {{ColumnFilter}} since it would still have inconsistent view
of columns on the replica side. To sum it up, I do agree that the issue is related to the
initial exchange of columns, but I think that the problem is hiding within the {{ColumnFilter}}.
Two problems, in fact.

It looks like {{ColumnFilter}} can be created with one set of metadata and afterwards used
with the other. Since {{ColumnFilter}} is used for both serialisation and deserialisation,
between creating a column filter and deserialising responses from replica, column set may
get changed, and {{partitionColumns}} of {{CfMetadata}} will be a different collection. This
will have negative impact on deserialisation: binary data will be validated and processed
as a different set of columns. In some cases this results into corrupt flags, in other cases
into EOF and the worst case scenario is when the data actually passes all validations and
mutation lands the sstable. When trying to read, it will contain the data from the other column.
The column itself in turn will just contain less data. In order to solve this problem, it
is sufficient to cache the columns during the creation of the column filter.

The second problem is, as Jeff has noted is hiding in the fact that we can not rely on the
output of subset deserialisation in header [here|https://github.com/apache/cassandra/blob/8b3a60b9a7dbefeecc06bace617279612ec7092d/src/java/org/apache/cassandra/db/SerializationHeader.java#L401],
because it is happening on replica an in the case with {{fetchesAll}}, it will rely on the
local replica metadata. Although I would suggest we resolve it a bit differently: by changing
the way we serialise the column filter. Namely, by avoiding relying on the synchronised metadata
in case of fetch all, and encoding all the columns. If this is considered to be too big of
an overhead (I can imagine large column families holding a lot of columns may cause quite
some traffic), we can at least use a hash of concatenated column names. 

On trunk the problem (1) can be fixed simpler, by passing a command cf reference instead of
grabbing the base cf reference, which is fetched later. 

I have a preliminary patch that addresses (1) and (2) for 3.0 and 3.11 and a separate different
patch for trunk, but since the logic of ColumnFilter kind of gets out of hand with all the
excluding cases, I would like to take a bit more time to simplify it a bit. I hope I can submit
it tomorrow.




was (Author: ifesdjeen):
I've been able to narrow it down a little bit and wanted to elaborate on why I've decided
to take some extra time and caution on the issue to avoid protocol changes. I do agree that
the issue is related to the initial exchange of columns, but I think that the problem is hiding
within the {{ColumnFilter}}. Two problems, in fact.

It looks like {{ColumnFilter}} can be created with one set of metadata and afterwards used
with the other. Since {{ColumnFilter}} is used for both serialisation and deserialisation,
between creating a column filter and deserialising responses from replica, column set may
get changed, and {{partitionColumns}} of {{CfMetadata}} will be a different collection. This
will have negative impact on deserialisation: binary data will be validated and processed
as a different set of columns. In some cases this results into corrupt flags, in other cases
into EOF and the worst case scenario is when the data actually passes all validations and
mutation lands the sstable. When trying to read, it will contain the data from the other column.
The column itself in turn will just contain less data. In order to solve this problem, it
is sufficient to cache the columns during the creation of the column filter.

The second problem is, as Jeff has noted is hiding in the fact that we can not rely on the
output of subset deserialisation in header [here|https://github.com/apache/cassandra/blob/8b3a60b9a7dbefeecc06bace617279612ec7092d/src/java/org/apache/cassandra/db/SerializationHeader.java#L401],
because it is happening on replica an in the case with {{fetchesAll}}, it will rely on the
local replica metadata. Although I would suggest we resolve it a bit differently: by changing
the way we serialise the column filter. Namely, by avoiding relying on the synchronised metadata
in case of fetch all, and encoding all the columns. If this is considered to be too big of
an overhead (I can imagine large column families holding a lot of columns may cause quite
some traffic), we can at least use a hash of concatenated column names. 

On trunk the problem (1) can be fixed simpler, by passing a command cf reference instead of
grabbing the base cf reference, which is fetched later. 

I have a preliminary patch that addresses (1) and (2) for 3.0 and 3.11 and a separate different
patch for trunk, but since the logic of ColumnFilter kind of gets out of hand with all the
excluding cases, I would like to take a bit more time to simplify it a bit. I hope I can submit
it tomorrow.



> Corruption while adding a column to a table
> -------------------------------------------
>
>                 Key: CASSANDRA-13004
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13004
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Distributed Metadata
>            Reporter: Stanislav Vishnevskiy
>            Priority: Blocker
>             Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We had the following schema in production. 
> {code:none}
> CREATE TYPE IF NOT EXISTS discord_channels.channel_recipient (
>     nick text
> );
> CREATE TYPE IF NOT EXISTS discord_channels.channel_permission_overwrite (
>     id bigint,
>     type int,
>     allow_ int,
>     deny int
> );
> CREATE TABLE IF NOT EXISTS discord_channels.channels (
>     id bigint,
>     guild_id bigint,
>     type tinyint,
>     name text,
>     topic text,
>     position int,
>     owner_id bigint,
>     icon_hash text,
>     recipients map<bigint, frozen<channel_recipient>>,
>     permission_overwrites map<bigint, frozen<channel_permission_overwrite>>,
>     bitrate int,
>     user_limit int,
>     last_pin_timestamp timestamp,
>     last_message_id bigint,
>     PRIMARY KEY (id)
> );
> {code}
> And then we executed the following alter.
> {code:none}
> ALTER TABLE discord_channels.channels ADD application_id bigint;
> {code}
> And one row (that we can tell) got corrupted at the same time and could no longer be
read from the Python driver. 
> {code:none}
> [E 161206 01:56:58 geventreactor:141] Error decoding response from Cassandra. ver(4);
flags(0000); stream(27); op(8); offset(9); len(887); buffer: '\x84\x00\x00\x1b\x08\x00\x00\x03w\x00\x00\x00\x02\x00\x00\x00\x01\x00\x00\x00\x0f\x00\x10discord_channels\x00\x08channels\x00\x02id\x00\x02\x00\x0eapplication_id\x00\x02\x00\x07bitrate\x00\t\x00\x08guild_id\x00\x02\x00\ticon_hash\x00\r\x00\x0flast_message_id\x00\x02\x00\x12last_pin_timestamp\x00\x0b\x00\x04name\x00\r\x00\x08owner_id\x00\x02\x00\x15permission_overwrites\x00!\x00\x02\x000\x00\x10discord_channels\x00\x1cchannel_permission_overwrite\x00\x04\x00\x02id\x00\x02\x00\x04type\x00\t\x00\x06allow_\x00\t\x00\x04deny\x00\t\x00\x08position\x00\t\x00\nrecipients\x00!\x00\x02\x000\x00\x10discord_channels\x00\x11channel_recipient\x00\x01\x00\x04nick\x00\r\x00\x05topic\x00\r\x00\x04type\x00\x14\x00\nuser_limit\x00\t\x00\x00\x00\x01\x00\x00\x00\x08\x03\x8a\x19\x8e\xf8\x82\x00\x01\xff\xff\xff\xff\x00\x00\x00\x04\x00\x00\xfa\x00\x00\x00\x00\x08\x00\x00\xfa\x00\x00\xf8G\xc5\x00\x00\x00\x00\x00\x00\x00\x08\x03\x8b\xc0\xb5nB\x00\x02\x00\x00\x00\x08G\xc5\xffI\x98\xc4\xb4(\x00\x00\x00\x03\x8b\xc0\xa8\xff\xff\xff\xff\x00\x00\x01<\x00\x00\x00\x06\x00\x00\x00\x08\x03\x81L\xea\xfc\x82\x00\n\x00\x00\x00$\x00\x00\x00\x08\x03\x81L\xea\xfc\x82\x00\n\x00\x00\x00\x04\x00\x00\x00\x01\x00\x00\x00\x04\x00\x00\x08\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x08\x03\x8a\x1e\xe6\x8b\x80\x00\n\x00\x00\x00$\x00\x00\x00\x08\x03\x8a\x1e\xe6\x8b\x80\x00\n\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x040\x07\xf8Q\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x08\x03\x8a\x1f\x1b{\x82\x00\x00\x00\x00\x00$\x00\x00\x00\x08\x03\x8a\x1f\x1b{\x82\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x07\xf8Q\x00\x00\x00\x04\x10\x00\x00\x00\x00\x00\x00\x08\x03\x8a\x1fH6\x82\x00\x01\x00\x00\x00$\x00\x00\x00\x08\x03\x8a\x1fH6\x82\x00\x01\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x05\xe8A\x00\x00\x00\x04\x10\x02\x00\x00\x00\x00\x00\x08\x03\x8a+=\xca\xc0\x00\n\x00\x00\x00$\x00\x00\x00\x08\x03\x8a+=\xca\xc0\x00\n\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x08\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x08\x03\x8a\x8f\x979\x80\x00\n\x00\x00\x00$\x00\x00\x00\x08\x03\x8a\x8f\x979\x80\x00\n\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00
\x08\x01\x00\x00\x00\x04\xc4\xb4(\x00\xff\xff\xff\xff\x00\x00\x00O[f\x80Q\x07general\x05\xf8G\xc5\xffI\x98\xc4\xb4(\x00\xf8O[f\x80Q\x00\x00\x00\x02\x04\xf8O[f\x80Q\x00\xf8G\xc5\xffI\x98\x01\x00\x00\xf8O[f\x80Q\x00\x00\x00\x00\xf8G\xc5\xffI\x97\xc4\xb4(\x06\x00\xf8O\x7fe\x1fm\x08\x03\x00\x00\x00\x01\x00\x00\x00\x00\x04\x00\x00\x00\x00'
> {code}
> And then in cqlsh when trying to read the row we got this. 
> {code:none}
> /usr/bin/cqlsh.py:632: DateOverFlowWarning: Some timestamps are larger than Python datetime
can represent. Timestamps are displayed in milliseconds from epoch.
> Traceback (most recent call last):
>   File "/usr/bin/cqlsh.py", line 1301, in perform_simple_statement
>     result = future.result()
>   File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.5.0.post0-d8d0456.zip/cassandra-driver-3.5.0.post0-d8d0456/cassandra/cluster.py",
line 3650, in result
>     raise self._final_exception
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 2: invalid start
byte
> {code}
> We tried to read the data and it would refuse to read the name column (the UTF8 error)
and the last_pin_timestamp column had an absurdly large value.
> We ended up rewriting the whole row as we had the data in another place and it fixed
the problem. However there is clearly a race condition in the schema change sub-system.
> Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message