Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ACCB01067D for ; Tue, 22 Oct 2013 21:30:43 +0000 (UTC) Received: (qmail 87650 invoked by uid 500); 22 Oct 2013 21:30:00 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 87165 invoked by uid 500); 22 Oct 2013 21:29:42 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 86703 invoked by uid 99); 22 Oct 2013 21:29:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Oct 2013 21:29:37 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com designates 65.55.90.156 as permitted sender) Received: from [65.55.90.156] (HELO snt0-omc3-s17.snt0.hotmail.com) (65.55.90.156) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Oct 2013 21:29:31 +0000 Received: from SNT149-W79 ([65.55.90.136]) by snt0-omc3-s17.snt0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Tue, 22 Oct 2013 14:29:10 -0700 X-TMN: [CD3BKnx55fl9ATl+7J3zUt3Yz4tcgKBD] X-Originating-Email: [java8964@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_4043d42a-3016-4c2a-9182-d5a043e53e20_" From: java8964 java8964 To: "user@cassandra.apache.org" Subject: Questions related to the data in SSTable files Date: Tue, 22 Oct 2013 17:29:10 -0400 Importance: Normal MIME-Version: 1.0 X-OriginalArrivalTime: 22 Oct 2013 21:29:10.0377 (UTC) FILETIME=[BF095D90:01CECF6D] X-Virus-Checked: Checked by ClamAV on apache.org --_4043d42a-3016-4c2a-9182-d5a043e53e20_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi=2C I have some questions related the data in the SSTable files. Our production environment has 36 boxes=2C so in theory 12 of them will mak= e one group of data without replication. Right now=2C I got all the SSTable files from 12 nodes of the cluster (Base= d on my understanding=2C these 12 nodes are one replication group=2C and th= ey are NOT randomly picked up by our Cassandra admin) from one full snapsho= t and one incremental backup after the snapshot for one column family. This column family stores the time serials data only=2C so there is no Upda= te/Delete action in Cassandra=2C only insert. But when I use sstable2json t= o parse all the data out for both snapshot and incremental backup=2C I got = the following cases which I cannot explain. In this column family=2C we hav= e following schema structure: key is the composite key as (entity_1_id=2C entity_2_id)column is the compo= site column with name as (entity_3_id=2C entity_4_id=2C reverse(date as cre= ate_on_timestamp))=2C and json data as the value. I use the sstable2json to parse all the data out=2C and also parse the colu= mn timestamp in the output=2C just want to understand the data better. I al= so explode the data=2C which means if one row having 10 columns=2C I flatte= n them into 10 rows=2C so I can check the duplication. But when I check the= output from all 12 nodes=2C I have the following cases=2C which I don't kn= ow why they happened in the SSTable files data: 1) In the data of full snapshot=2C I see more than 10% of duplication data.= What I mean duplication is that there are event_activities with the same (= entity_1_id=2C entity_2_id=2C entity_3_id=2C entity_4_id=2C created_on_time= stamp=2C column_timestamp). I am surprised to see the high level duplicatio= n data=2C especially even adding with the column_timestamp. As my understan= ding=2C the column_timestamp is provided from the client when Cassandra sto= re the column in the row key data. So if there are some small amount of dup= lication=2C I can explain as application bug=2C or duplication comes from t= he replication. But more than 10% is too much to explain this way. 2) More puzzle output is when I parse the incremental backup data. In the o= utput=2C I found out a lot of data in the following format: (entity_1_id=2C entity_2_id=2C entity_3_id=2C entity_4_id=2C created_on_ti= mestamp as (Dec-22-2012) =2C column_time_stamp as (Oct 14-2013)). The snapshot was taken on Oct 12th=2C 2013=2C and incremental backup was ta= ken on Oct 15th=2C 2013. So the above records shown in the incremental back= up makes sense based on the column_timestamp=2C as it is between these 2 da= tes. But the event_activity date is too old. This means the event happened = on Dec 2012=2C which is almost more than 10 months ago. First=2C I search t= he output of snapshot for above record=2C I cannot find this event activity= based on the UUIDs given=2C but I cannot image an event happened 10 months= ago flushed to SSTable files now. This kind of records is not in small amo= unt=2C but quite a lot. The event activity created_on dates veried from Dec= 2012 to Oct 11th 2013. Why is that? I know from the business point=2C ther= e is NO update for any existing records in Cassandra. I also check from the= output of Json=2C there is NO delete type record=2C which confirms my unde= rstanding that there is no delete action in Cassandra system. But no update= is just based on ourunderstanding of the business point. I cannot explain why above 2 cases happen in the data parsed out from snaps= hot and backups. One possible reason is the wrong nodes are given to me=2C = so replication make the duplication count is so high. Even so=2C there is s= till no reason to explain why the case 2 shown up=2C with so many occurrenc= es? Does anyone have any hint what could cause case 2? Thanks Yong = --_4043d42a-3016-4c2a-9182-d5a043e53e20_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hi=2C I have some questions rela= ted the data in the SSTable files.

Our production enviro= nment has 36 boxes=2C so in theory 12 of them will make one group of data w= ithout replication.

Right now=2C I got all the SST= able files from 12 nodes of the cluster (Based on my understanding=2C these= 12 nodes are one replication group=2C and they are NOT randomly picked up = by our Cassandra admin) from one full snapshot and one incremental backup a= fter the snapshot for one column family.

This colu= mn family stores the time serials data only=2C so there is no Update/Delete= action in Cassandra=2C only insert. But when I use sstable2json to parse a= ll the data out for both snapshot and incremental backup=2C I got the follo= wing cases which I cannot explain. In this column family=2C we have followi= ng schema structure:

key is the composite key as (= entity_1_id=2C entity_2_id)
column is the composite column with n= ame as (entity_3_id=2C entity_4_id=2C reverse(date as create_on_timestamp))= =2C and json data as the value.

I use the sstable2= json to parse all the data out=2C and also parse the column timestamp in th= e output=2C just want to understand the data better. I also explode the dat= a=2C which means if one row having 10 columns=2C I flatten them into 10 row= s=2C so I can check the duplication. But when I check the output from all 1= 2 nodes=2C I have the following cases=2C which I don't know why they happen= ed in the SSTable files data:

1) In the data of fu= ll snapshot=2C I see more than 10% of duplication data. What I mean duplica= tion is that there are event_activities with the same (entity_1_id=2C entit= y_2_id=2C entity_3_id=2C entity_4_id=2C created_on_timestamp=2C column_time= stamp). I am surprised to see the high level duplication data=2C especially= even adding with the column_timestamp. As my understanding=2C the column_t= imestamp is provided from the client when Cassandra store the column in the= row key data. So if there are some small amount of duplication=2C I can ex= plain as application bug=2C or duplication comes from the replication. But = more than 10% is too much to explain this way.

2) = More puzzle output is when I parse the incremental backup data. In the outp= ut=2C I found out a lot of data in the following format:

 =3B(entity_1_id=2C entity_2_id=2C entity_3_id=2C entity_4_id=2C= created_on_timestamp as (Dec-22-2012) =2C column_time_stamp as (Oct 14-201= 3)).

The snapshot was taken on Oct 12th=2C 2013=2C= and incremental backup was taken on Oct 15th=2C 2013. So the above records= shown in the incremental backup makes sense based on the column_timestamp= =2C as it is between these 2 dates. But the event_activity date is too old.= This means the event happened on Dec 2012=2C which is almost more than 10 = months ago. First=2C I search the output of snapshot for above record=2C I = cannot find this event activity based on the UUIDs given=2C but I cannot im= age an event happened 10 months ago flushed to SSTable files now. This kind= of records is not in small amount=2C but quite a lot. The event activity c= reated_on dates veried from Dec 2012 to Oct 11th 2013. Why is that? I know = from the business point=2C there is NO update for any existing records in C= assandra. I also check from the output of Json=2C there is NO delete type r= ecord=2C which confirms my understanding that there is no delete action in = Cassandra system. But no update is just based on ourunderstanding of the bu= siness point.

I cannot explain why above 2 cases h= appen in the data parsed out from snapshot and backups. One possible reason= is the wrong nodes are given to me=2C so replication make the duplication = count is so high. Even so=2C there is still no reason to explain why the ca= se 2 shown up=2C with so many occurrences? Does anyone have any hint what c= ould cause case 2?

Thanks

Yong

= --_4043d42a-3016-4c2a-9182-d5a043e53e20_--