From user-return-65182-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Tue Feb 11 15:51:17 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 35B0818063F for ; Tue, 11 Feb 2020 16:51:17 +0100 (CET) Received: (qmail 24779 invoked by uid 500); 11 Feb 2020 15:51:14 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 24762 invoked by uid 99); 11 Feb 2020 15:51:13 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Feb 2020 15:51:13 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 4D8CDC1CFA for ; Tue, 11 Feb 2020 15:51:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.689 X-Spam-Level: X-Spam-Status: No, score=-0.689 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=datastax.com header.b=Xih3agVr; dkim=pass (1024-bit key) header.d=datastax.com header.b=XHs8SMTT Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 4Jktp7HTPLmE for ; Tue, 11 Feb 2020 15:51:08 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=148.163.151.30; helo=mx0a-002bad01.pphosted.com; envelope-from=cedrick.lunven@datastax.com; receiver= Received: from mx0a-002bad01.pphosted.com (mx0a-002bad01.pphosted.com [148.163.151.30]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 0B9FF7E133 for ; Tue, 11 Feb 2020 15:51:07 +0000 (UTC) Received: from pps.filterd (m0121912.ppops.net [127.0.0.1]) by mx0a-002bad01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 01BFfroA023499 for ; Tue, 11 Feb 2020 07:51:06 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datastax.com; h=mime-version : references : in-reply-to : from : date : message-id : subject : to : content-type; s=proofpoint20180122; bh=KZ6xsEi91YSJqiPFY+PGTE/U+K2RoGs3Gi1yvpNIwKU=; b=Xih3agVrEAvZMblCCwSnCdoMz24ooHs9nyS+ovW6G83bT1Z6aW4v51hrMyR+gXSZHEKF v6nWoQ3XawYK1IAYi8cOvaJ1evmvTKtSaVRRAlUxVJ6eK6d0j2ac+Aeb1ijqfnlP8J/0 gpPaY0wwOvaHhasR7pa4ZX5J+vnFz7i0fUAR8zrg2R8W/hughc4IWlXwBhcdCzYHMUIH geC6n9olS/lCg8xlclDlAMVHNW0Vy6bCPPhZrKgpSMFue4p4JE8l5pcfAP2kGs3VKhUz P9s+oLOTFlIf7931R/Y59RGvfMDRRqEFk7hfg1W+kxIjhx0Oz0Xv8JmEB94oUV4Z2Tkb IA== Received: from mail-vk1-f197.google.com (mail-vk1-f197.google.com [209.85.221.197]) by mx0a-002bad01.pphosted.com with ESMTP id 2y3f05ru11-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Tue, 11 Feb 2020 07:51:05 -0800 Received: by mail-vk1-f197.google.com with SMTP id s205so3599985vka.17 for ; Tue, 11 Feb 2020 07:51:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datastax.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=KZ6xsEi91YSJqiPFY+PGTE/U+K2RoGs3Gi1yvpNIwKU=; b=XHs8SMTTjzFocod9GuqbfsMOlZit8VsRB8UkyzxWPXoc5DopjQiFA20/lb3zkz+4aU UOHc0dW4DK6ZxganRKCEn5D/5wvoWka/as67uu3pzF5oFjR0Rw/J2KL6EG/SGGfl3iYk R3oXBhJgY6Q/ey/xRWxit6QL+CVhZWdE6eL0o= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=KZ6xsEi91YSJqiPFY+PGTE/U+K2RoGs3Gi1yvpNIwKU=; b=of38ez1zLBcUr38lLVNeXrifbfZ6u0SYkockF9t/9nbVM9lSKLbwkkzmsFjRoCB4mE nUTEuP85sbfsbUA3rFqEMcehCPKuysj8qC3f4V0AS4w8suZ4ywOQBDlZvTBeJJU1lSJI fGaQJSsQCSONiY5HfGilo97EYhoHSWJ7CZIOmdDK0Mwk/fSYvMQroRWLxEyQbR9NAXkn njO2BdloTOYggeu/4hi1EEQeHY8H7rt+wG87f8ZXdvFM4KMoQm/axZSSsN85ZBxY0hMG VmUfW1Kv6cKAD0IXsBOOv3MScEot3veNTqg7cUZI4VcmIBzpRQUnu4J0Vry5GrxFHBUq AA3w== X-Gm-Message-State: APjAAAWDMlurwUvRJ0JJTatOANQ1yvtCNWUXJ9qI/z8UA0WJfRjUZjBi Xa1wmWNQ4hOG/EMku+gnPiDFezF1lj1IqlkOaYLeYNKF8O3kwQZteA9bx5wmqxR79CfHJ53bHHB EepqEoU39k+tiDQzkE4Oeo1zuXIYZKAV9XHVW6aHM X-Received: by 2002:a67:f758:: with SMTP id w24mr9748434vso.5.1581436264388; Tue, 11 Feb 2020 07:51:04 -0800 (PST) X-Google-Smtp-Source: APXvYqwYz7/RTAfYtbC7FHFUaRpmay+8cvNxUEBcGpeqcZeg9S4o8zj2ugjDUW60z6VztX+5i7ObBF4R3Ppq5YviRVA= X-Received: by 2002:a67:f758:: with SMTP id w24mr9748413vso.5.1581436263880; Tue, 11 Feb 2020 07:51:03 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Cedrick Lunven Date: Tue, 11 Feb 2020 16:50:52 +0100 Message-ID: Subject: Re: COPY FROM ordering To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="000000000000bec327059e4ed2ec" X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138,18.0.572 definitions=2020-02-11_04:2020-02-10,2020-02-11 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 mlxlogscore=999 malwarescore=0 bulkscore=0 phishscore=0 adultscore=0 mlxscore=0 lowpriorityscore=0 impostorscore=0 clxscore=1011 suspectscore=1 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2001150001 definitions=main-2002110114 --000000000000bec327059e4ed2ec Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Bogdan, First - I am probably captain obvious here - but you will lose some data in the process. as *interaction_timestamp *is not part of the primary key anymore you will upsert the new row multiple times, last one win. The CSV rows are loaded in parallel with multiple threads and multiple chunks as such the order is* NOT guaranteed* indeed. I have to go back to older documentation to get the proper description : COPY FROM loads rows from a CSV file in a parallel non-deterministic order. https://docs.datastax.com/en/cql-oss/3.1/cql/cql_reference/copy_r.html You can still play with CHUNKSIZE but this is not want you want. https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/cqlshCopy.html Solution here would be to work on in the dump I am afraid (items are ordered in first table and keep only latest *interaction_timestamp)* Cedrick On Tue, Feb 11, 2020 at 1:17 PM Bogdan Gherca wrote: > Hey Cassandra folks, > > I'm trying to change the schema of an existing table by creating a new on= e > and migrating the data. > > The initial table schema looks like this: > > > > > > > *CREATE TABLE IF NOT EXISTS initial_table ( user_id > text, message_id timeuuid, interaction_state > text, interaction_timestamp timestamp, PRIMARY KEY ((user_id), > message_id, interaction_state, interaction_timestamp));* > > We're trying to remove interaction timestamp from the PK - same schema bu= t > with *PRIMARY KEY ((user_id), message_id, interaction_state)* > > When importing the .csv dump obtained from the *initial_table, *the > timestamp column seems to be written in a weird way. Multiple rows from t= he > old schema need to be merged to a single entry of the new schema. For mos= t > cases, it seems the last entry entry gets copied over to the new table > while for others a random one gets copied. Check out the below csv sample > and the copy from result. > > *initial_table_dump.csv* > > *123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-03 > 17:50:59+0000123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-1= 0 > 00:05:41+0000* > > *copy new_table(user_id, message_id, interaction_state, > interaction_timestamp) from '~/initial_table_dump.csv';* > > *Result:* > > > > *user_id | message_id | interaction_state | > interaction_timestamp---------+--------------------------------------+---= ----------------+-------------------------- 123 > | ed6c69a0-0add-11b2-8080-808080808080 | DISMISSED | 2020-01-03 > 17:50:59+0000* > > Notice the first row from the csv gets written into the new table in this > case - here there are only two rows, but for multiple ones it seems a > random one would be copied over, not the first/last one necessarily. When > updating the interaction_timestamp column value as below, it seems to cop= y > the latest entry to the new table. > > *initial_table_dump_2.csv* > > *123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-03 > 17:50:59+0000123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-0= 5 > 00:05:41+0000* > > *- perform same copy from operation - * > > *Result:* > > > *user_id | message_id | interaction_state | > interaction_timestamp---------+--------------------------------------+---= ----------------+-------------------------- 123 > | ed6c69a0-0add-11b2-8080-808080808080 | DISMISSED | **2020-01-05 > 00:05:41+0000* > > Could someone help me understand why this might happen? Does the 'copy > from' follow the order from the csv when doing the import or there are no > order guarantees? > > I'm using the below cqlsh and Cassandra versions: > *[cqlsh 5.0.1 | Cassandra 2.2.15 | CQL spec 3.3.1 | Native protocol v4]* > > Thanks, > Bogdan > --=20 C=C3=A9drick Lunven *EMEA Developer Advocate Manager * =F0=9F=8E=93Free Trainings : *DataStax Academy * =E2=9D=93Ask us your questions : *DataStax Community * =F0=9F=94=ACTest our new products : *DataStax Labs * --000000000000bec327059e4ed2ec Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Bogdan,

First - = I am probably captain obvious here - but you will lose=C2=A0some data in th= e process. as=C2=A0=C2=A0interaction_timestamp=C2=A0is not part of the pri= mary key anymore you will upsert the new row multiple times, last one win.<= /div>

The CSV rows are loaded in parallel with mul= tiple threads and multiple chunks as such the order is NOT guaranteed indeed. I have to go back to older documentation to get the proper descri= ption :=C2=A0COPY FROM=C2=A0loads rows from a CSV= file in a parallel non-deterministic order.=C2=A0

You can still play with CHUNKSIZE but t= his is not want you want.

<= /div>
Solution here would be to work on in the dump I am afraid (ite= ms are ordered in first table and keep only latest=C2=A0interaction_timestamp)=

Cedrick

On Tue, Feb 11, 20= 20 at 1:17 PM Bogdan Gherca <bogdan.gherca@gmail.com> wrote:
Hey Cassandra fol= ks,

I'm trying to change the schema of an existing t= able by creating a new one and migrating the data.

The initial table schema looks like this:
CREATE TABLE IF NOT= EXISTS initial_table (
=C2=A0 =C2=A0 user_id =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 text,
=C2=A0 =C2=A0 message_id =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0timeuuid,
=C2=A0 =C2=A0 interac= tion_state =C2=A0 =C2=A0 =C2=A0 text,
=C2=A0 =C2=A0 interaction_timestam= p =C2=A0 timestamp,
=C2=A0 =C2=A0 PRIMARY KEY ((user_id), message_id, in= teraction_state, interaction_timestamp)
);


=
We're trying to remove interaction timestamp from the PK - s= ame schema but with=C2=A0PRIMARY KEY ((user_id), message_id, interaction= _state)

When importing the .csv dump ob= tained from the=C2=A0initial_table,=C2=A0the timestamp column seems = to be written in a weird=C2=A0way. Multiple rows from the old schema need t= o be merged to a single entry of the new schema. For most cases, it seems t= he last entry entry gets copied over to the new table while for others a ra= ndom one gets copied. Check out the below csv sample and the copy from resu= lt.

initial_table_dump.csv
123= ,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-03 17:50:59+0000123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-10 00:05:41+000= 0

copy new_table(user_id, message_id, i= nteraction_state, interaction_timestamp) from '~/initial_table_dump.csv= ';

Result:
= user_id | message_id =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | interaction_state | interaction_ti= mestamp
---------+--------------------------------------+---------------= ----+--------------------------
=C2=A0123 | ed6c69a0-0add-11b2-8080-8080= 80808080 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 DISMISSED |=C2=A02020-01-03 17:50= :59+0000

Notice the first row from the= csv gets written into the new table in this case - here there are only two= rows, but for multiple ones it seems a random one would be copied over, no= t the first/last one necessarily. When updating the interaction_timestamp c= olumn value as below, it seems to copy the latest entry to the new table.= =C2=A0

initial_table_dump_2.csv
<= div>123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-03 17:50:= 59+0000
123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-05= 00:05:41+0000

- pe= rform same copy from operation -=C2=A0

=
Result:
user_id | message_id =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | int= eraction_state | interaction_timestamp
---------+-----------------------= ---------------+-------------------+--------------------------
=C2=A0123= | ed6c69a0-0add-11b2-8080-808080808080 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 DISMI= SSED |=C2=A0
2020-01-05 00:05:41+0000

Could someone help me understand why this might happen? D= oes the 'copy from' follow the order from the csv when doing the im= port or there are no order guarantees?

I'm usi= ng the below cqlsh and Cassandra versions:
[cqlsh 5.0.1 | = Cassandra 2.2.15 | CQL spec 3.3.1 | Native protocol v4]
<= br>
Thanks,
Bogdan


--
<= div dir=3D"ltr">



C=C3=A9drick Lunven

EMEA Developer Advocate= Manager

=F0=9F= =8E=93Fre= e Trainings : DataStax Academy

=E2=9D=93Ask us your questions= : DataStax Community


= =F0=9F=94=ACTest our new products : DataStax Labs


= --000000000000bec327059e4ed2ec--