Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com
 designates 65.54.51.88 as permitted sender)
Message-ID: <SNT149-W6165F8BAB1F5327F609800D0270@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_725f04c5-2f5c-4004-b581-913b3944648e_"
From: java8964 java8964 <java8964@hotmail.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: RE: questions related to the SSTable file
Date: Tue, 17 Sep 2013 09:50:40 -0400
Importance: Normal
In-Reply-To: <CE5DB765.322DC%Dean.Hiller@nrel.gov>
References: 
 <SNT149-W715BC8EA89A3AF7F608BD6D0270@phx.gbl>,<CE5DB765.322DC%Dean.Hiller@nrel.gov>
MIME-Version: 1.0

--_725f04c5-2f5c-4004-b581-913b3944648e_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Thanks Dean for clarification.
But if I put hundreds of megabyte data of one row through one put=2C what y=
ou mean is Cassandra will put all of them into one SSTable=2C even the data=
 is very big=2C right? Let's assume in this case the Memtables in memory re=
aches its limit by this change.What I want to know is if there is possibili=
ty 2 SSTables be generated in above case=2C what is the boundary.
I understand if following changes apply to the same row key as above exampl=
e=2C additional SSTable file could be generated. That is clear for me.
Yong

> From: Dean.Hiller@nrel.gov
> To: user@cassandra.apache.org
> Date: Tue=2C 17 Sep 2013 07:39:48 -0600
> Subject: Re: questions related to the SSTable file
>=20
> You have to first understand the rules of
>=20
>  1.  Sstables are immutable so Color-1-Data.db will not be modified and o=
nly deleted once compacted
>  2.  Memtables are flushed when reaching a limit so if Blue:{hex} is modi=
fied=2C it is done in the in-memory memtable that is eventually flushed
>  3.  Once flushed=2C it is an SSTable on disk and you have two values for=
 "hex" both with two timestamps so we know which one is the current value
>=20
> When it finally compacts=2C the old value can go away.
>=20
> Dean
>=20
> From: java8964 java8964 <java8964@hotmail.com<mailto:java8964@hotmail.com=
>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <=
user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Tuesday=2C September 17=2C 2013 7:32 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@c=
assandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: RE: questions related to the SSTable file
>=20
> Hi=2C Takenori:
>=20
> Thanks for your quick reply. Your explain is clear for me understanding w=
hat compaction mean=2C and I also can understand now same row key will exis=
t in multi SSTable file.
>=20
> But beyond that=2C I want to know what happen if one row data is too larg=
e to put in one SSTable file. In your example=2C the same row exist in mult=
i SSTable files as it is keeping changing and flushing into the disk at run=
time. That's fine=2C in this case=2C in every SSTable file of the 4=2C ther=
e is no single file contains whole data of that row=2C but each one does co=
ntain full picture of individual unit ( I don't know what I should call thi=
s unit=2C but it will be larger than one column=2C right?). Just in your ex=
ample=2C there is no way in any time=2C we could have SSTable files like fo=
llowing=2C right:
>=20
> - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}=2C {Blue: {hex: #0000}}]
> - Color-1-Data_1.db:  [{Blue: {hex:FF}}]
> - Color-2-Data.db: [{Green: {hex: #008000}}=2C {Blue: {hex2: #2c86ff}}]
> - Color-3-Data.db: [{Aqua: {hex: #00FFFF}}=2C {Green: {hex2: #32CD32}}=2C=
 {Blue: {}}]
> - Color-4-Data.db: [{Magenta: {hex: #FF00FF}}=2C {Gold: {hex: #FFD700}}]
>=20
> I don't see any reason Cassandra will ever do that=2C but just want to co=
nfirm=2C as your 'no' answer to my 2 question is confusion.
>=20
> Another question from my originally email=2C even though I may get the an=
swer already from your example=2C but just want to confirm it.
> Just use your example=2C let's say after the first 2 steps:
>=20
> - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}=2C {Blue: {hex: #0000FF}}]
> - Color-2-Data.db: [{Green: {hex: #008000}}=2C {Blue: {hex2: #2c86ff}}]
> There is a incremental backup. After that=2C there is following changes c=
oming:
>=20
> - Add a column of (key=2C column=2C column_value =3D Green=2C hex2=2C #32=
CD32)
> - Add a row of (key=2C column=2C column_value =3D Aqua=2C hex=2C #00FFFF)
> - Delete a row of (key =3D Blue)
> ---- memtable is flushed =3D> Color-3-Data.db ----
> Another incremental backup right now.
>=20
> Now in this case=2C my assumption is only Color-3-Data.db will be in this=
 backup=2C right? Even though Color-1-Data.db and Color-2-Data.db contains =
the data of the same row key as Color-3-Data.db=2C but from a incremental b=
ackup point of view=2C only Color-3-Data.db will be stored.
>=20
> The reason I asked those question is that I am thinking to use MapReduce =
jobs to parse the incremental backup files=2C and rebuild the snapshot in H=
adoop side. Of course=2C the column families I am doing is pure Fact data. =
So there is delete/update in Cassandra for these kind of data=2C just appen=
ding. But it is still important for me to understand the SSTable file's con=
tent.
>=20
> Thanks
>=20
> Yong
>=20
>=20
> ________________________________
> Date: Tue=2C 17 Sep 2013 11:12:01 +0900
> From: tsato@cloudian.com<mailto:tsato@cloudian.com>
> To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
> Subject: Re: questions related to the SSTable file
>=20
> Hi=2C
>=20
> > 1) I will expect same row key could show up in both sstable2json output=
=2C as this one row exists in both SSTable files=2C right?
>=20
> Yes.
>=20
> > 2) If so=2C what is the boundary? Will Cassandra guarantee the column l=
evel as the boundary? What I mean is that for one column's data=2C it will =
be guaranteed to be either in the first file=2C or 2nd file=2C right? There=
 is no chance that Cassandra will cut the data of one column into 2 part=2C=
 and one part stored in first SSTable file=2C and the other part stored in =
second SSTable file. Is my understanding correct?
>=20
> No.
>=20
> > 3) If what we are talking about are only the SSTable files in snapshot=
=2C incremental backup SSTable files=2C exclude the runtime SSTable files=
=2C will anything change? For snapshot or incremental backup SSTable files=
=2C first can one row data still may exist in more than one SSTable file? A=
nd any boundary change in this case?
> > 4) If I want to use incremental backup SSTable files as the way to catc=
h data being changed=2C is it a good way to do what I try to archive? In th=
is case=2C what happen in the following example:
>=20
> I don't fully understand=2C but snapshot will do. It will create hard lin=
ks to all the SSTable files present at snapshot.
>=20
>=20
> Let me explain how SSTable and compaction works.
>=20
> Suppose we have 4 files being compacted(the last one has bee just flushed=
=2C then which triggered compaction). Note that file names are simplified.
>=20
> - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}=2C {Blue: {hex: #0000FF}}]
> - Color-2-Data.db: [{Green: {hex: #008000}}=2C {Blue: {hex2: #2c86ff}}]
> - Color-3-Data.db: [{Aqua: {hex: #00FFFF}}=2C {Green: {hex2: #32CD32}}=2C=
 {Blue: {}}]
> - Color-4-Data.db: [{Magenta: {hex: #FF00FF}}=2C {Gold: {hex: #FFD700}}]
>=20
> They are created by the following operations.
>=20
> - Add a row of (key=2C column=2C column_value =3D Blue=2C hex=2C #0000FF)
> - Add a row of (key=2C column=2C column_value =3D Lavender=2C hex=2C #E6E=
6FA)
> ---- memtable is flushed =3D> Color-1-Data.db ----
> - Add a row of (key=2C column=2C column_value =3D Green=2C hex=2C #008000=
)
> - Add a column of (key=2C column=2C column_value =3D Blue=2C hex2=2C #2c8=
6ff)
> ---- memtable is flushed =3D> Color-2-Data.db ----
> - Add a column of (key=2C column=2C column_value =3D Green=2C hex2=2C #32=
CD32)
> - Add a row of (key=2C column=2C column_value =3D Aqua=2C hex=2C #00FFFF)
> - Delete a row of (key =3D Blue)
> ---- memtable is flushed =3D> Color-3-Data.db ----
> - Add a row of (key=2C column=2C column_value =3D Magenta=2C hex=2C #FF00=
FF)
> - Add a row of (key=2C column=2C column_value =3D Gold=2C hex=2C #FFD700)
> ---- memtable is flushed =3D> Color-4-Data.db ----
>=20
> Then=2C a compaction will merge all those fragments together into the lat=
est ones as follows.
>=20
> - Color-5-Data.db: [{Lavender: {hex: #E6E6FA}=2C {Aqua: {hex: #00FFFF}=2C=
 {Green: {hex: #008000=2C hex2: #32CD32}}=2C {Magenta: {hex: #FF00FF}}=2C {=
Gold: {hex: #FFD700}}]
> * assuming RandomPartitioner is used
>=20
> Hope they would help.
>=20
> - Takenori
>=20
> (2013/09/17 10:51)=2C java8964 java8964 wrote:
> Hi=2C I have some questions related to the SSTable in the Cassandra=2C as=
 I am doing a project to use it and hope someone in this list can share som=
e thoughts.
>=20
> My understand is the SSTable is per column family. But each column family=
 could have multi SSTable files. During the runtime=2C one row COULD split =
into more than one SSTable file=2C even this is not good for performance=2C=
 but it does happen=2C and Cassandra will try to merge and store one row da=
ta into one SSTable file during compassion.
>=20
> The question is when one row is split in multi SSTable files=2C what is t=
he boundary? Or let me ask this way=2C if one row exists in 2 SSTable files=
=2C if I run sstable2json tool to run on both SSTable files individually:
>=20
> 1) I will expect same row key could show up in both sstable2json output=
=2C as this one row exists in both SSTable files=2C right?
> 2) If so=2C what is the boundary? Will Cassandra guarantee the column lev=
el as the boundary? What I mean is that for one column's data=2C it will be=
 guaranteed to be either in the first file=2C or 2nd file=2C right? There i=
s no chance that Cassandra will cut the data of one column into 2 part=2C a=
nd one part stored in first SSTable file=2C and the other part stored in se=
cond SSTable file. Is my understanding correct?
> 3) If what we are talking about are only the SSTable files in snapshot=2C=
 incremental backup SSTable files=2C exclude the runtime SSTable files=2C w=
ill anything change? For snapshot or incremental backup SSTable files=2C fi=
rst can one row data still may exist in more than one SSTable file? And any=
 boundary change in this case?
> 4) If I want to use incremental backup SSTable files as the way to catch =
data being changed=2C is it a good way to do what I try to archive? In this=
 case=2C what happen in the following example:
>=20
> For column family A:
> at Time 0=2C one row key (key1) has some data. It is being stored and bac=
k up in SSTable file 1.
> at Time 1=2C if any column for key1 has any change (a new column insert=
=2C a column updated/deleted=2C or even whole row being deleted)=2C I will =
expect this whole row exists in the any incremental backup SSTable files af=
ter time 1=2C right?
>=20
> What happen if the above row just happen to store in more than one SSTabl=
e file?
> at Time 0=2C one row key (key1) has some data=2C and it just is stored in=
 SSTable file1 and file2=2C and being backup.
> at Time 1=2C if one column is added in row key1=2C and the change in fact=
 will happen in SSTable file2 only in this case=2C and if we do a increment=
al backup after that=2C what SSTable files should I expect in this backup? =
Both SSTable files? Or Just SSTable file 2?
>=20
> I was thinking incremental backup SSTable files are good candidate for ca=
tching data being changed=2C but as one row data could exist in multi SSTab=
le file makes thing complex now. Did anyone have any experience to use SSTa=
ble file in this way? What are the lessons?
>=20
> Thanks
>=20
> Yong
>=20
 		 	   		  =

--_725f04c5-2f5c-4004-b581-913b3944648e_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 12pt=3B
font-family:Calibri
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'>Thanks Dean for clarification.<d=
iv><br></div><div>But if I put hundreds of megabyte data of one row through=
 one put=2C what you mean is Cassandra will put all of them into one SSTabl=
e=2C even the data is very big=2C right? Let's assume in this case the Memt=
ables in memory reaches its limit by this change.</div><div>What I want to =
know is if there is possibility 2 SSTables be generated in above case=2C wh=
at is the boundary.</div><div><br></div><div>I understand if following chan=
ges apply to the same row key as above example=2C additional SSTable file c=
ould be generated. That is clear for me.</div><div><br></div><div>Yong<br><=
br><div>&gt=3B From: Dean.Hiller@nrel.gov<br>&gt=3B To: user@cassandra.apac=
he.org<br>&gt=3B Date: Tue=2C 17 Sep 2013 07:39:48 -0600<br>&gt=3B Subject:=
 Re: questions related to the SSTable file<br>&gt=3B <br>&gt=3B You have to=
 first understand the rules of<br>&gt=3B <br>&gt=3B  1.  Sstables are immut=
able so Color-1-Data.db will not be modified and only deleted once compacte=
d<br>&gt=3B  2.  Memtables are flushed when reaching a limit so if Blue:{he=
x} is modified=2C it is done in the in-memory memtable that is eventually f=
lushed<br>&gt=3B  3.  Once flushed=2C it is an SSTable on disk and you have=
 two values for "hex" both with two timestamps so we know which one is the =
current value<br>&gt=3B <br>&gt=3B When it finally compacts=2C the old valu=
e can go away.<br>&gt=3B <br>&gt=3B Dean<br>&gt=3B <br>&gt=3B From: java896=
4 java8964 &lt=3Bjava8964@hotmail.com&lt=3Bmailto:java8964@hotmail.com&gt=
=3B&gt=3B<br>&gt=3B Reply-To: "user@cassandra.apache.org&lt=3Bmailto:user@c=
assandra.apache.org&gt=3B" &lt=3Buser@cassandra.apache.org&lt=3Bmailto:user=
@cassandra.apache.org&gt=3B&gt=3B<br>&gt=3B Date: Tuesday=2C September 17=
=2C 2013 7:32 AM<br>&gt=3B To: "user@cassandra.apache.org&lt=3Bmailto:user@=
cassandra.apache.org&gt=3B" &lt=3Buser@cassandra.apache.org&lt=3Bmailto:use=
r@cassandra.apache.org&gt=3B&gt=3B<br>&gt=3B Subject: RE: questions related=
 to the SSTable file<br>&gt=3B <br>&gt=3B Hi=2C Takenori:<br>&gt=3B <br>&gt=
=3B Thanks for your quick reply. Your explain is clear for me understanding=
 what compaction mean=2C and I also can understand now same row key will ex=
ist in multi SSTable file.<br>&gt=3B <br>&gt=3B But beyond that=2C I want t=
o know what happen if one row data is too large to put in one SSTable file.=
 In your example=2C the same row exist in multi SSTable files as it is keep=
ing changing and flushing into the disk at runtime. That's fine=2C in this =
case=2C in every SSTable file of the 4=2C there is no single file contains =
whole data of that row=2C but each one does contain full picture of individ=
ual unit ( I don't know what I should call this unit=2C but it will be larg=
er than one column=2C right?). Just in your example=2C there is no way in a=
ny time=2C we could have SSTable files like following=2C right:<br>&gt=3B <=
br>&gt=3B - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}=2C {Blue: {hex: #0=
000}}]<br>&gt=3B - Color-1-Data_1.db:  [{Blue: {hex:FF}}]<br>&gt=3B - Color=
-2-Data.db: [{Green: {hex: #008000}}=2C {Blue: {hex2: #2c86ff}}]<br>&gt=3B =
- Color-3-Data.db: [{Aqua: {hex: #00FFFF}}=2C {Green: {hex2: #32CD32}}=2C {=
Blue: {}}]<br>&gt=3B - Color-4-Data.db: [{Magenta: {hex: #FF00FF}}=2C {Gold=
: {hex: #FFD700}}]<br>&gt=3B <br>&gt=3B I don't see any reason Cassandra wi=
ll ever do that=2C but just want to confirm=2C as your 'no' answer to my 2 =
question is confusion.<br>&gt=3B <br>&gt=3B Another question from my origin=
ally email=2C even though I may get the answer already from your example=2C=
 but just want to confirm it.<br>&gt=3B Just use your example=2C let's say =
after the first 2 steps:<br>&gt=3B <br>&gt=3B - Color-1-Data.db: [{Lavender=
: {hex: #E6E6FA}}=2C {Blue: {hex: #0000FF}}]<br>&gt=3B - Color-2-Data.db: [=
{Green: {hex: #008000}}=2C {Blue: {hex2: #2c86ff}}]<br>&gt=3B There is a in=
cremental backup. After that=2C there is following changes coming:<br>&gt=
=3B <br>&gt=3B - Add a column of (key=2C column=2C column_value =3D Green=
=2C hex2=2C #32CD32)<br>&gt=3B - Add a row of (key=2C column=2C column_valu=
e =3D Aqua=2C hex=2C #00FFFF)<br>&gt=3B - Delete a row of (key =3D Blue)<br=
>&gt=3B ---- memtable is flushed =3D&gt=3B Color-3-Data.db ----<br>&gt=3B A=
nother incremental backup right now.<br>&gt=3B <br>&gt=3B Now in this case=
=2C my assumption is only Color-3-Data.db will be in this backup=2C right? =
Even though Color-1-Data.db and Color-2-Data.db contains the data of the sa=
me row key as Color-3-Data.db=2C but from a incremental backup point of vie=
w=2C only Color-3-Data.db will be stored.<br>&gt=3B <br>&gt=3B The reason I=
 asked those question is that I am thinking to use MapReduce jobs to parse =
the incremental backup files=2C and rebuild the snapshot in Hadoop side. Of=
 course=2C the column families I am doing is pure Fact data. So there is de=
lete/update in Cassandra for these kind of data=2C just appending. But it i=
s still important for me to understand the SSTable file's content.<br>&gt=
=3B <br>&gt=3B Thanks<br>&gt=3B <br>&gt=3B Yong<br>&gt=3B <br>&gt=3B <br>&g=
t=3B ________________________________<br>&gt=3B Date: Tue=2C 17 Sep 2013 11=
:12:01 +0900<br>&gt=3B From: tsato@cloudian.com&lt=3Bmailto:tsato@cloudian.=
com&gt=3B<br>&gt=3B To: user@cassandra.apache.org&lt=3Bmailto:user@cassandr=
a.apache.org&gt=3B<br>&gt=3B Subject: Re: questions related to the SSTable =
file<br>&gt=3B <br>&gt=3B Hi=2C<br>&gt=3B <br>&gt=3B &gt=3B 1) I will expec=
t same row key could show up in both sstable2json output=2C as this one row=
 exists in both SSTable files=2C right?<br>&gt=3B <br>&gt=3B Yes.<br>&gt=3B=
 <br>&gt=3B &gt=3B 2) If so=2C what is the boundary? Will Cassandra guarant=
ee the column level as the boundary? What I mean is that for one column's d=
ata=2C it will be guaranteed to be either in the first file=2C or 2nd file=
=2C right? There is no chance that Cassandra will cut the data of one colum=
n into 2 part=2C and one part stored in first SSTable file=2C and the other=
 part stored in second SSTable file. Is my understanding correct?<br>&gt=3B=
 <br>&gt=3B No.<br>&gt=3B <br>&gt=3B &gt=3B 3) If what we are talking about=
 are only the SSTable files in snapshot=2C incremental backup SSTable files=
=2C exclude the runtime SSTable files=2C will anything change? For snapshot=
 or incremental backup SSTable files=2C first can one row data still may ex=
ist in more than one SSTable file? And any boundary change in this case?<br=
>&gt=3B &gt=3B 4) If I want to use incremental backup SSTable files as the =
way to catch data being changed=2C is it a good way to do what I try to arc=
hive? In this case=2C what happen in the following example:<br>&gt=3B <br>&=
gt=3B I don't fully understand=2C but snapshot will do. It will create hard=
 links to all the SSTable files present at snapshot.<br>&gt=3B <br>&gt=3B <=
br>&gt=3B Let me explain how SSTable and compaction works.<br>&gt=3B <br>&g=
t=3B Suppose we have 4 files being compacted(the last one has bee just flus=
hed=2C then which triggered compaction). Note that file names are simplifie=
d.<br>&gt=3B <br>&gt=3B - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}=2C {=
Blue: {hex: #0000FF}}]<br>&gt=3B - Color-2-Data.db: [{Green: {hex: #008000}=
}=2C {Blue: {hex2: #2c86ff}}]<br>&gt=3B - Color-3-Data.db: [{Aqua: {hex: #0=
0FFFF}}=2C {Green: {hex2: #32CD32}}=2C {Blue: {}}]<br>&gt=3B - Color-4-Data=
.db: [{Magenta: {hex: #FF00FF}}=2C {Gold: {hex: #FFD700}}]<br>&gt=3B <br>&g=
t=3B They are created by the following operations.<br>&gt=3B <br>&gt=3B - A=
dd a row of (key=2C column=2C column_value =3D Blue=2C hex=2C #0000FF)<br>&=
gt=3B - Add a row of (key=2C column=2C column_value =3D Lavender=2C hex=2C =
#E6E6FA)<br>&gt=3B ---- memtable is flushed =3D&gt=3B Color-1-Data.db ----<=
br>&gt=3B - Add a row of (key=2C column=2C column_value =3D Green=2C hex=2C=
 #008000)<br>&gt=3B - Add a column of (key=2C column=2C column_value =3D Bl=
ue=2C hex2=2C #2c86ff)<br>&gt=3B ---- memtable is flushed =3D&gt=3B Color-2=
-Data.db ----<br>&gt=3B - Add a column of (key=2C column=2C column_value =
=3D Green=2C hex2=2C #32CD32)<br>&gt=3B - Add a row of (key=2C column=2C co=
lumn_value =3D Aqua=2C hex=2C #00FFFF)<br>&gt=3B - Delete a row of (key =3D=
 Blue)<br>&gt=3B ---- memtable is flushed =3D&gt=3B Color-3-Data.db ----<br=
>&gt=3B - Add a row of (key=2C column=2C column_value =3D Magenta=2C hex=2C=
 #FF00FF)<br>&gt=3B - Add a row of (key=2C column=2C column_value =3D Gold=
=2C hex=2C #FFD700)<br>&gt=3B ---- memtable is flushed =3D&gt=3B Color-4-Da=
ta.db ----<br>&gt=3B <br>&gt=3B Then=2C a compaction will merge all those f=
ragments together into the latest ones as follows.<br>&gt=3B <br>&gt=3B - C=
olor-5-Data.db: [{Lavender: {hex: #E6E6FA}=2C {Aqua: {hex: #00FFFF}=2C {Gre=
en: {hex: #008000=2C hex2: #32CD32}}=2C {Magenta: {hex: #FF00FF}}=2C {Gold:=
 {hex: #FFD700}}]<br>&gt=3B * assuming RandomPartitioner is used<br>&gt=3B =
<br>&gt=3B Hope they would help.<br>&gt=3B <br>&gt=3B - Takenori<br>&gt=3B =
<br>&gt=3B (2013/09/17 10:51)=2C java8964 java8964 wrote:<br>&gt=3B Hi=2C I=
 have some questions related to the SSTable in the Cassandra=2C as I am doi=
ng a project to use it and hope someone in this list can share some thought=
s.<br>&gt=3B <br>&gt=3B My understand is the SSTable is per column family. =
But each column family could have multi SSTable files. During the runtime=
=2C one row COULD split into more than one SSTable file=2C even this is not=
 good for performance=2C but it does happen=2C and Cassandra will try to me=
rge and store one row data into one SSTable file during compassion.<br>&gt=
=3B <br>&gt=3B The question is when one row is split in multi SSTable files=
=2C what is the boundary? Or let me ask this way=2C if one row exists in 2 =
SSTable files=2C if I run sstable2json tool to run on both SSTable files in=
dividually:<br>&gt=3B <br>&gt=3B 1) I will expect same row key could show u=
p in both sstable2json output=2C as this one row exists in both SSTable fil=
es=2C right?<br>&gt=3B 2) If so=2C what is the boundary? Will Cassandra gua=
rantee the column level as the boundary? What I mean is that for one column=
's data=2C it will be guaranteed to be either in the first file=2C or 2nd f=
ile=2C right? There is no chance that Cassandra will cut the data of one co=
lumn into 2 part=2C and one part stored in first SSTable file=2C and the ot=
her part stored in second SSTable file. Is my understanding correct?<br>&gt=
=3B 3) If what we are talking about are only the SSTable files in snapshot=
=2C incremental backup SSTable files=2C exclude the runtime SSTable files=
=2C will anything change? For snapshot or incremental backup SSTable files=
=2C first can one row data still may exist in more than one SSTable file? A=
nd any boundary change in this case?<br>&gt=3B 4) If I want to use incremen=
tal backup SSTable files as the way to catch data being changed=2C is it a =
good way to do what I try to archive? In this case=2C what happen in the fo=
llowing example:<br>&gt=3B <br>&gt=3B For column family A:<br>&gt=3B at Tim=
e 0=2C one row key (key1) has some data. It is being stored and back up in =
SSTable file 1.<br>&gt=3B at Time 1=2C if any column for key1 has any chang=
e (a new column insert=2C a column updated/deleted=2C or even whole row bei=
ng deleted)=2C I will expect this whole row exists in the any incremental b=
ackup SSTable files after time 1=2C right?<br>&gt=3B <br>&gt=3B What happen=
 if the above row just happen to store in more than one SSTable file?<br>&g=
t=3B at Time 0=2C one row key (key1) has some data=2C and it just is stored=
 in SSTable file1 and file2=2C and being backup.<br>&gt=3B at Time 1=2C if =
one column is added in row key1=2C and the change in fact will happen in SS=
Table file2 only in this case=2C and if we do a incremental backup after th=
at=2C what SSTable files should I expect in this backup? Both SSTable files=
? Or Just SSTable file 2?<br>&gt=3B <br>&gt=3B I was thinking incremental b=
ackup SSTable files are good candidate for catching data being changed=2C b=
ut as one row data could exist in multi SSTable file makes thing complex no=
w. Did anyone have any experience to use SSTable file in this way? What are=
 the lessons?<br>&gt=3B <br>&gt=3B Thanks<br>&gt=3B <br>&gt=3B Yong<br>&gt=
=3B <br></div></div> 		 	   		  </div></body>
</html>=

--_725f04c5-2f5c-4004-b581-913b3944648e_--