Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com
 designates 65.55.90.156 as permitted sender)
Message-ID: <SNT149-W7955ED1DA0851A09561191D0020@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_4043d42a-3016-4c2a-9182-d5a043e53e20_"
From: java8964 java8964 <java8964@hotmail.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Questions related to the data in SSTable files
Date: Tue, 22 Oct 2013 17:29:10 -0400
Importance: Normal
MIME-Version: 1.0

--_4043d42a-3016-4c2a-9182-d5a043e53e20_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi=2C I have some questions related the data in the SSTable files.
Our production environment has 36 boxes=2C so in theory 12 of them will mak=
e one group of data without replication.
Right now=2C I got all the SSTable files from 12 nodes of the cluster (Base=
d on my understanding=2C these 12 nodes are one replication group=2C and th=
ey are NOT randomly picked up by our Cassandra admin) from one full snapsho=
t and one incremental backup after the snapshot for one column family.
This column family stores the time serials data only=2C so there is no Upda=
te/Delete action in Cassandra=2C only insert. But when I use sstable2json t=
o parse all the data out for both snapshot and incremental backup=2C I got =
the following cases which I cannot explain. In this column family=2C we hav=
e following schema structure:
key is the composite key as (entity_1_id=2C entity_2_id)column is the compo=
site column with name as (entity_3_id=2C entity_4_id=2C reverse(date as cre=
ate_on_timestamp))=2C and json data as the value.
I use the sstable2json to parse all the data out=2C and also parse the colu=
mn timestamp in the output=2C just want to understand the data better. I al=
so explode the data=2C which means if one row having 10 columns=2C I flatte=
n them into 10 rows=2C so I can check the duplication. But when I check the=
 output from all 12 nodes=2C I have the following cases=2C which I don't kn=
ow why they happened in the SSTable files data:
1) In the data of full snapshot=2C I see more than 10% of duplication data.=
 What I mean duplication is that there are event_activities with the same (=
entity_1_id=2C entity_2_id=2C entity_3_id=2C entity_4_id=2C created_on_time=
stamp=2C column_timestamp). I am surprised to see the high level duplicatio=
n data=2C especially even adding with the column_timestamp. As my understan=
ding=2C the column_timestamp is provided from the client when Cassandra sto=
re the column in the row key data. So if there are some small amount of dup=
lication=2C I can explain as application bug=2C or duplication comes from t=
he replication. But more than 10% is too much to explain this way.
2) More puzzle output is when I parse the incremental backup data. In the o=
utput=2C I found out a lot of data in the following format:
 (entity_1_id=2C entity_2_id=2C entity_3_id=2C entity_4_id=2C created_on_ti=
mestamp as (Dec-22-2012) =2C column_time_stamp as (Oct 14-2013)).
The snapshot was taken on Oct 12th=2C 2013=2C and incremental backup was ta=
ken on Oct 15th=2C 2013. So the above records shown in the incremental back=
up makes sense based on the column_timestamp=2C as it is between these 2 da=
tes. But the event_activity date is too old. This means the event happened =
on Dec 2012=2C which is almost more than 10 months ago. First=2C I search t=
he output of snapshot for above record=2C I cannot find this event activity=
 based on the UUIDs given=2C but I cannot image an event happened 10 months=
 ago flushed to SSTable files now. This kind of records is not in small amo=
unt=2C but quite a lot. The event activity created_on dates veried from Dec=
 2012 to Oct 11th 2013. Why is that? I know from the business point=2C ther=
e is NO update for any existing records in Cassandra. I also check from the=
 output of Json=2C there is NO delete type record=2C which confirms my unde=
rstanding that there is no delete action in Cassandra system. But no update=
 is just based on ourunderstanding of the business point.
I cannot explain why above 2 cases happen in the data parsed out from snaps=
hot and backups. One possible reason is the wrong nodes are given to me=2C =
so replication make the duplication count is so high. Even so=2C there is s=
till no reason to explain why the case 2 shown up=2C with so many occurrenc=
es? Does anyone have any hint what could cause case 2?
Thanks
Yong
 		 	   		  =

--_4043d42a-3016-4c2a-9182-d5a043e53e20_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 12pt=3B
font-family:Calibri
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'>Hi=2C I have some questions rela=
ted the data in the SSTable files.<div><br></div><div>Our production enviro=
nment has 36 boxes=2C so in theory 12 of them will make one group of data w=
ithout replication.</div><div><br></div><div>Right now=2C I got all the SST=
able files from 12 nodes of the cluster (Based on my understanding=2C these=
 12 nodes are one replication group=2C and they are NOT randomly picked up =
by our Cassandra admin) from one full snapshot and one incremental backup a=
fter the snapshot for one column family.</div><div><br></div><div>This colu=
mn family stores the time serials data only=2C so there is no Update/Delete=
 action in Cassandra=2C only insert. But when I use sstable2json to parse a=
ll the data out for both snapshot and incremental backup=2C I got the follo=
wing cases which I cannot explain. In this column family=2C we have followi=
ng schema structure:</div><div><br></div><div>key is the composite key as (=
entity_1_id=2C entity_2_id)</div><div>column is the composite column with n=
ame as (entity_3_id=2C entity_4_id=2C reverse(date as create_on_timestamp))=
=2C and json data as the value.</div><div><br></div><div>I use the sstable2=
json to parse all the data out=2C and also parse the column timestamp in th=
e output=2C just want to understand the data better. I also explode the dat=
a=2C which means if one row having 10 columns=2C I flatten them into 10 row=
s=2C so I can check the duplication. But when I check the output from all 1=
2 nodes=2C I have the following cases=2C which I don't know why they happen=
ed in the SSTable files data:</div><div><br></div><div>1) In the data of fu=
ll snapshot=2C I see more than 10% of duplication data. What I mean duplica=
tion is that there are event_activities with the same (entity_1_id=2C entit=
y_2_id=2C entity_3_id=2C entity_4_id=2C created_on_timestamp=2C column_time=
stamp). I am surprised to see the high level duplication data=2C especially=
 even adding with the column_timestamp. As my understanding=2C the column_t=
imestamp is provided from the client when Cassandra store the column in the=
 row key data. So if there are some small amount of duplication=2C I can ex=
plain as application bug=2C or duplication comes from the replication. But =
more than 10% is too much to explain this way.</div><div><br></div><div>2) =
More puzzle output is when I parse the incremental backup data. In the outp=
ut=2C I found out a lot of data in the following format:</div><div><br></di=
v><div>&nbsp=3B(entity_1_id=2C entity_2_id=2C entity_3_id=2C entity_4_id=2C=
 created_on_timestamp as (Dec-22-2012) =2C column_time_stamp as (Oct 14-201=
3)).</div><div><br></div><div>The snapshot was taken on Oct 12th=2C 2013=2C=
 and incremental backup was taken on Oct 15th=2C 2013. So the above records=
 shown in the incremental backup makes sense based on the column_timestamp=
=2C as it is between these 2 dates. But the event_activity date is too old.=
 This means the event happened on Dec 2012=2C which is almost more than 10 =
months ago. First=2C I search the output of snapshot for above record=2C I =
cannot find this event activity based on the UUIDs given=2C but I cannot im=
age an event happened 10 months ago flushed to SSTable files now. This kind=
 of records is not in small amount=2C but quite a lot. The event activity c=
reated_on dates veried from Dec 2012 to Oct 11th 2013. Why is that? I know =
from the business point=2C there is NO update for any existing records in C=
assandra. I also check from the output of Json=2C there is NO delete type r=
ecord=2C which confirms my understanding that there is no delete action in =
Cassandra system. But no update is just based on ourunderstanding of the bu=
siness point.</div><div><br></div><div>I cannot explain why above 2 cases h=
appen in the data parsed out from snapshot and backups. One possible reason=
 is the wrong nodes are given to me=2C so replication make the duplication =
count is so high. Even so=2C there is still no reason to explain why the ca=
se 2 shown up=2C with so many occurrences? Does anyone have any hint what c=
ould cause case 2?</div><div><br></div><div>Thanks</div><div><br></div><div=
>Yong</div><div><br></div> 		 	   		  </div></body>
</html>=

--_4043d42a-3016-4c2a-9182-d5a043e53e20_--