cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sagar Jambhulkar <sagar.jambhul...@gmail.com>
Subject Re: Cassandra Writes Duplicated/Concatenated List Data
Date Sat, 19 Aug 2017 01:46:15 GMT
For the example  provided by you , are you saying you are getting two rows
for same pk1,pk2,time?
It may be a problem with your inserts when you are inserting multiple
distinct rows or  to validate all nodes are in sync try fetching using
CONSISTENCY ALL in cql.

On 18-Aug-2017 9:37 PM, "Nathan McLean" <nmclean@kinsolresearch.com> wrote:

> @Sagar,
>
> A query to get the data looks like this (primary key values included in
> the query).
>
> SELECT * FROM table WHERE pk1='2269202-onstreet_high' AND pk2=2017 AND
> time='2017-07-18 03:15:00+0000';
>
> (in actual practice, the queries in our code would use query a range of
> time values).
>
> @Cristophe
>
> I actually haven't been able to reproduce this problem while testing. Rows
> like the example I gave just seem to show up very occasionally in our
> production data.
>
> On Wed, Aug 16, 2017 at 9:11 PM, Sagar Jambhulkar <
> sagar.jambhulkar@gmail.com> wrote:
>
>> What is your query to fetch rows. Can you share P1,pk2,time for the
>> sample rows you pasted?
>>
>> On 17-Aug-2017 2:20 AM, "Nathan McLean" <nmclean@kinsolresearch.com>
>> wrote:
>>
>>> Hello All,
>>>
>>> I have a Cassandra cluster with a table similar to the following:
>>>
>>> ```
>>> CREATE TABLE table (
>>>     pk1 text,
>>>     pk2 int,
>>>     time timestamp,
>>>     ...
>>>     probability list<double>,
>>>     PRIMARY KEY ((pk1, pk2), time)
>>> ) WITH CLUSTERING ORDER BY (time DESC)
>>> ```
>>>
>>> Python processes write to this table using the DataStax python Cassandra
>>> driver package. I am occasionally seeing rows written to the table where
>>> the "probability" column list is the same list, duplicated and concatenated.
>>>
>>> e.g.
>>>
>>> probability
>>> ---------------
>>> [3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
>>> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859,
>>>  3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
>>> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859]
>>>
>>> The code that writes to Cassandra uses "INSERT" statements and validates
>>> that "probability" lists must always approximately sum to 1.0, so it does
>>> not seem possible that the python code that writes to Cassandra has a bug
>>> which is generating this data. The code may occasionally write to the same
>>> row multiple times.
>>>
>>> It appears that there may be a bug in either Cassandra or the python
>>> driver package which results in this list column being written to and
>>> appended to with the same data.
>>>
>>> Similar invalid data was also generated by a PySpark data migration
>>> script (using the DataStax spark Cassandra connector) that copied this list
>>> data to a new table.
>>>
>>> Here are the versions of libraries we are using:
>>>
>>> Cassandra version 3.6
>>> Spark version 1.6.0-hadoop2.6
>>> Python Cassandra driver 3.7.1
>>> (https://github.com/datastax/python-driver)
>>>
>>> Any help/insight into this problem would be greatly appreciated.
>>>
>>> Regards,
>>>
>>> Nathan
>>>
>>
>

Mime
View raw message