cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan McLean <nmcl...@kinsolresearch.com>
Subject Re: Cassandra Writes Duplicated/Concatenated List Data
Date Fri, 18 Aug 2017 16:07:49 GMT
@Sagar,

A query to get the data looks like this (primary key values included in the
query).

SELECT * FROM table WHERE pk1='2269202-onstreet_high' AND pk2=2017 AND
time='2017-07-18 03:15:00+0000';

(in actual practice, the queries in our code would use query a range of
time values).

@Cristophe

I actually haven't been able to reproduce this problem while testing. Rows
like the example I gave just seem to show up very occasionally in our
production data.

On Wed, Aug 16, 2017 at 9:11 PM, Sagar Jambhulkar <
sagar.jambhulkar@gmail.com> wrote:

> What is your query to fetch rows. Can you share P1,pk2,time for the sample
> rows you pasted?
>
> On 17-Aug-2017 2:20 AM, "Nathan McLean" <nmclean@kinsolresearch.com>
> wrote:
>
>> Hello All,
>>
>> I have a Cassandra cluster with a table similar to the following:
>>
>> ```
>> CREATE TABLE table (
>>     pk1 text,
>>     pk2 int,
>>     time timestamp,
>>     ...
>>     probability list<double>,
>>     PRIMARY KEY ((pk1, pk2), time)
>> ) WITH CLUSTERING ORDER BY (time DESC)
>> ```
>>
>> Python processes write to this table using the DataStax python Cassandra
>> driver package. I am occasionally seeing rows written to the table where
>> the "probability" column list is the same list, duplicated and concatenated.
>>
>> e.g.
>>
>> probability
>> ---------------
>> [3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
>> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859,
>>  3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
>> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859]
>>
>> The code that writes to Cassandra uses "INSERT" statements and validates
>> that "probability" lists must always approximately sum to 1.0, so it does
>> not seem possible that the python code that writes to Cassandra has a bug
>> which is generating this data. The code may occasionally write to the same
>> row multiple times.
>>
>> It appears that there may be a bug in either Cassandra or the python
>> driver package which results in this list column being written to and
>> appended to with the same data.
>>
>> Similar invalid data was also generated by a PySpark data migration
>> script (using the DataStax spark Cassandra connector) that copied this list
>> data to a new table.
>>
>> Here are the versions of libraries we are using:
>>
>> Cassandra version 3.6
>> Spark version 1.6.0-hadoop2.6
>> Python Cassandra driver 3.7.1
>> (https://github.com/datastax/python-driver)
>>
>> Any help/insight into this problem would be greatly appreciated.
>>
>> Regards,
>>
>> Nathan
>>
>

Mime
View raw message