incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 java8964 <java8...@hotmail.com>
Subject RE: questions related to the SSTable file
Date Tue, 17 Sep 2013 13:42:42 GMT
Hi, Dean:
Can you explain a little more about what do you mean?
If I change example a little bit:
Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #0000FF}}]
Now if we add a new Green column, and update the blue column, but the data flushed to another
SSTable file:
Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex: #2c86ff}}]
So you mean at this time, I could get 2 SSTable files, both contain column "Blue" for the
same row key, right? In this case, I should be fine as value of the "Blue" column contain
the timestamp to help me to find out which is the last change, right? In MR world, each file
COULD be processed by different Mapper, but will be sent to the same reducer as both data
will be shared same key.
Yong

> From: Dean.Hiller@nrel.gov
> To: user@cassandra.apache.org
> Date: Tue, 17 Sep 2013 06:32:03 -0600
> Subject: Re: questions related to the SSTable file
> 
> You may want to be careful as column 1 could be stored in both files until compaction
as well when column 1 has encountered changes and cassandra returns the latest column 1 version
but two sstables contain column 1.  (At least that is the way I understand it).
> 
> Later,
> Dean
> 
> From: "Takenori Sato (Cloudian)" <tsato@cloudian.com<mailto:tsato@cloudian.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Monday, September 16, 2013 8:12 PM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Re: questions related to the SSTable file
> 
> Hi,
> 
> > 1) I will expect same row key could show up in both sstable2json output, as this
one row exists in both SSTable files, right?
> 
> Yes.
> 
> > 2) If so, what is the boundary? Will Cassandra guarantee the column level as the
boundary? What I mean is that for one column's data, it will be guaranteed to be either in
the first file, or 2nd file, right? There is no chance that Cassandra will cut the data of
one column into 2 part, and one part stored in first SSTable file, and the other part stored
in second SSTable file. Is my understanding correct?
> 
> No.
> 
> > 3) If what we are talking about are only the SSTable files in snapshot, incremental
backup SSTable files, exclude the runtime SSTable files, will anything change? For snapshot
or incremental backup SSTable files, first can one row data still may exist in more than one
SSTable file? And any boundary change in this case?
> > 4) If I want to use incremental backup SSTable files as the way to catch data being
changed, is it a good way to do what I try to archive? In this case, what happen in the following
example:
> 
> I don't fully understand, but snapshot will do. It will create hard links to all the
SSTable files present at snapshot.
> 
> 
> Let me explain how SSTable and compaction works.
> 
> Suppose we have 4 files being compacted(the last one has bee just flushed, then which
triggered compaction). Note that file names are simplified.
> 
> - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #0000FF}}]
> - Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]
> - Color-3-Data.db: [{Aqua: {hex: #00FFFF}}, {Green: {hex2: #32CD32}}, {Blue: {}}]
> - Color-4-Data.db: [{Magenta: {hex: #FF00FF}}, {Gold: {hex: #FFD700}}]
> 
> They are created by the following operations.
> 
> - Add a row of (key, column, column_value = Blue, hex, #0000FF)
> - Add a row of (key, column, column_value = Lavender, hex, #E6E6FA)
> ---- memtable is flushed => Color-1-Data.db ----
> - Add a row of (key, column, column_value = Green, hex, #008000)
> - Add a column of (key, column, column_value = Blue, hex2, #2c86ff)
> ---- memtable is flushed => Color-2-Data.db ----
> - Add a column of (key, column, column_value = Green, hex2, #32CD32)
> - Add a row of (key, column, column_value = Aqua, hex, #00FFFF)
> - Delete a row of (key = Blue)
> ---- memtable is flushed => Color-3-Data.db ----
> - Add a row of (key, column, column_value = Magenta, hex, #FF00FF)
> - Add a row of (key, column, column_value = Gold, hex, #FFD700)
> ---- memtable is flushed => Color-4-Data.db ----
> 
> Then, a compaction will merge all those fragments together into the latest ones as follows.
> 
> - Color-5-Data.db: [{Lavender: {hex: #E6E6FA}, {Aqua: {hex: #00FFFF}, {Green: {hex: #008000,
hex2: #32CD32}}, {Magenta: {hex: #FF00FF}}, {Gold: {hex: #FFD700}}]
> * assuming RandomPartitioner is used
> 
> Hope they would help.
> 
> - Takenori
> 
> (2013/09/17 10:51), java8964 java8964 wrote:
> Hi, I have some questions related to the SSTable in the Cassandra, as I am doing a project
to use it and hope someone in this list can share some thoughts.
> 
> My understand is the SSTable is per column family. But each column family could have
multi SSTable files. During the runtime, one row COULD split into more than one SSTable file,
even this is not good for performance, but it does happen, and Cassandra will try to merge
and store one row data into one SSTable file during compassion.
> 
> The question is when one row is split in multi SSTable files, what is the boundary? Or
let me ask this way, if one row exists in 2 SSTable files, if I run sstable2json tool to run
on both SSTable files individually:
> 
> 1) I will expect same row key could show up in both sstable2json output, as this one
row exists in both SSTable files, right?
> 2) If so, what is the boundary? Will Cassandra guarantee the column level as the boundary?
What I mean is that for one column's data, it will be guaranteed to be either in the first
file, or 2nd file, right? There is no chance that Cassandra will cut the data of one column
into 2 part, and one part stored in first SSTable file, and the other part stored in second
SSTable file. Is my understanding correct?
> 3) If what we are talking about are only the SSTable files in snapshot, incremental backup
SSTable files, exclude the runtime SSTable files, will anything change? For snapshot or incremental
backup SSTable files, first can one row data still may exist in more than one SSTable file?
And any boundary change in this case?
> 4) If I want to use incremental backup SSTable files as the way to catch data being changed,
is it a good way to do what I try to archive? In this case, what happen in the following example:
> 
> For column family A:
> at Time 0, one row key (key1) has some data. It is being stored and back up in SSTable
file 1.
> at Time 1, if any column for key1 has any change (a new column insert, a column updated/deleted,
or even whole row being deleted), I will expect this whole row exists in the any incremental
backup SSTable files after time 1, right?
> 
> What happen if the above row just happen to store in more than one SSTable file?
> at Time 0, one row key (key1) has some data, and it just is stored in SSTable file1 and
file2, and being backup.
> at Time 1, if one column is added in row key1, and the change in fact will happen in
SSTable file2 only in this case, and if we do a incremental backup after that, what SSTable
files should I expect in this backup? Both SSTable files? Or Just SSTable file 2?
> 
> I was thinking incremental backup SSTable files are good candidate for catching data
being changed, but as one row data could exist in multi SSTable file makes thing complex now.
Did anyone have any experience to use SSTable file in this way? What are the lessons?
> 
> Thanks
> 
> Yong
> 
 		 	   		  
Mime
View raw message