incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahab Yunus <shahab.yu...@gmail.com>
Subject Re: questions related to the SSTable file
Date Tue, 17 Sep 2013 13:54:52 GMT
java8964, basically are you asking that what will happen if we put large
amount of data in one column of one row at once? How will this blob of data
representing one column and one row i.e. cell will be split into multiple
SSTable? Or in such particular cases it will always be one extra large
SSTable? I am also interesting in knowing the answer.

Regards,
Shahab


On Tue, Sep 17, 2013 at 9:50 AM, java8964 java8964 <java8964@hotmail.com>wrote:

> Thanks Dean for clarification.
>
> But if I put hundreds of megabyte data of one row through one put, what
> you mean is Cassandra will put all of them into one SSTable, even the data
> is very big, right? Let's assume in this case the Memtables in memory
> reaches its limit by this change.
> What I want to know is if there is possibility 2 SSTables be generated in
> above case, what is the boundary.
>
> I understand if following changes apply to the same row key as above
> example, additional SSTable file could be generated. That is clear for me.
>
> Yong
>
> > From: Dean.Hiller@nrel.gov
> > To: user@cassandra.apache.org
> > Date: Tue, 17 Sep 2013 07:39:48 -0600
> > Subject: Re: questions related to the SSTable file
> >
> > You have to first understand the rules of
> >
> > 1. Sstables are immutable so Color-1-Data.db will not be modified and
> only deleted once compacted
> > 2. Memtables are flushed when reaching a limit so if Blue:{hex} is
> modified, it is done in the in-memory memtable that is eventually flushed
> > 3. Once flushed, it is an SSTable on disk and you have two values for
> "hex" both with two timestamps so we know which one is the current value
> >
> > When it finally compacts, the old value can go away.
> >
> > Dean
> >
> > From: java8964 java8964 <java8964@hotmail.com<mailto:
> java8964@hotmail.com>>
> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> > Date: Tuesday, September 17, 2013 7:32 AM
> > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> > Subject: RE: questions related to the SSTable file
> >
> > Hi, Takenori:
> >
> > Thanks for your quick reply. Your explain is clear for me understanding
> what compaction mean, and I also can understand now same row key will exist
> in multi SSTable file.
> >
> > But beyond that, I want to know what happen if one row data is too large
> to put in one SSTable file. In your example, the same row exist in multi
> SSTable files as it is keeping changing and flushing into the disk at
> runtime. That's fine, in this case, in every SSTable file of the 4, there
> is no single file contains whole data of that row, but each one does
> contain full picture of individual unit ( I don't know what I should call
> this unit, but it will be larger than one column, right?). Just in your
> example, there is no way in any time, we could have SSTable files like
> following, right:
> >
> > - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #0000}}]
> > - Color-1-Data_1.db: [{Blue: {hex:FF}}]
> > - Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]
> > - Color-3-Data.db: [{Aqua: {hex: #00FFFF}}, {Green: {hex2: #32CD32}},
> {Blue: {}}]
> > - Color-4-Data.db: [{Magenta: {hex: #FF00FF}}, {Gold: {hex: #FFD700}}]
> >
> > I don't see any reason Cassandra will ever do that, but just want to
> confirm, as your 'no' answer to my 2 question is confusion.
> >
> > Another question from my originally email, even though I may get the
> answer already from your example, but just want to confirm it.
> > Just use your example, let's say after the first 2 steps:
> >
> > - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #0000FF}}]
> > - Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]
> > There is a incremental backup. After that, there is following changes
> coming:
> >
> > - Add a column of (key, column, column_value = Green, hex2, #32CD32)
> > - Add a row of (key, column, column_value = Aqua, hex, #00FFFF)
> > - Delete a row of (key = Blue)
> > ---- memtable is flushed => Color-3-Data.db ----
> > Another incremental backup right now.
> >
> > Now in this case, my assumption is only Color-3-Data.db will be in this
> backup, right? Even though Color-1-Data.db and Color-2-Data.db contains the
> data of the same row key as Color-3-Data.db, but from a incremental backup
> point of view, only Color-3-Data.db will be stored.
> >
> > The reason I asked those question is that I am thinking to use MapReduce
> jobs to parse the incremental backup files, and rebuild the snapshot in
> Hadoop side. Of course, the column families I am doing is pure Fact data.
> So there is delete/update in Cassandra for these kind of data, just
> appending. But it is still important for me to understand the SSTable
> file's content.
> >
> > Thanks
> >
> > Yong
> >
> >
> > ________________________________
> > Date: Tue, 17 Sep 2013 11:12:01 +0900
> > From: tsato@cloudian.com<mailto:tsato@cloudian.com>
> > To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
> > Subject: Re: questions related to the SSTable file
> >
> > Hi,
> >
> > > 1) I will expect same row key could show up in both sstable2json
> output, as this one row exists in both SSTable files, right?
> >
> > Yes.
> >
> > > 2) If so, what is the boundary? Will Cassandra guarantee the column
> level as the boundary? What I mean is that for one column's data, it will
> be guaranteed to be either in the first file, or 2nd file, right? There is
> no chance that Cassandra will cut the data of one column into 2 part, and
> one part stored in first SSTable file, and the other part stored in second
> SSTable file. Is my understanding correct?
> >
> > No.
> >
> > > 3) If what we are talking about are only the SSTable files in
> snapshot, incremental backup SSTable files, exclude the runtime SSTable
> files, will anything change? For snapshot or incremental backup SSTable
> files, first can one row data still may exist in more than one SSTable
> file? And any boundary change in this case?
> > > 4) If I want to use incremental backup SSTable files as the way to
> catch data being changed, is it a good way to do what I try to archive? In
> this case, what happen in the following example:
> >
> > I don't fully understand, but snapshot will do. It will create hard
> links to all the SSTable files present at snapshot.
> >
> >
> > Let me explain how SSTable and compaction works.
> >
> > Suppose we have 4 files being compacted(the last one has bee just
> flushed, then which triggered compaction). Note that file names are
> simplified.
> >
> > - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #0000FF}}]
> > - Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]
> > - Color-3-Data.db: [{Aqua: {hex: #00FFFF}}, {Green: {hex2: #32CD32}},
> {Blue: {}}]
> > - Color-4-Data.db: [{Magenta: {hex: #FF00FF}}, {Gold: {hex: #FFD700}}]
> >
> > They are created by the following operations.
> >
> > - Add a row of (key, column, column_value = Blue, hex, #0000FF)
> > - Add a row of (key, column, column_value = Lavender, hex, #E6E6FA)
> > ---- memtable is flushed => Color-1-Data.db ----
> > - Add a row of (key, column, column_value = Green, hex, #008000)
> > - Add a column of (key, column, column_value = Blue, hex2, #2c86ff)
> > ---- memtable is flushed => Color-2-Data.db ----
> > - Add a column of (key, column, column_value = Green, hex2, #32CD32)
> > - Add a row of (key, column, column_value = Aqua, hex, #00FFFF)
> > - Delete a row of (key = Blue)
> > ---- memtable is flushed => Color-3-Data.db ----
> > - Add a row of (key, column, column_value = Magenta, hex, #FF00FF)
> > - Add a row of (key, column, column_value = Gold, hex, #FFD700)
> > ---- memtable is flushed => Color-4-Data.db ----
> >
> > Then, a compaction will merge all those fragments together into the
> latest ones as follows.
> >
> > - Color-5-Data.db: [{Lavender: {hex: #E6E6FA}, {Aqua: {hex: #00FFFF},
> {Green: {hex: #008000, hex2: #32CD32}}, {Magenta: {hex: #FF00FF}}, {Gold:
> {hex: #FFD700}}]
> > * assuming RandomPartitioner is used
> >
> > Hope they would help.
> >
> > - Takenori
> >
> > (2013/09/17 10:51), java8964 java8964 wrote:
> > Hi, I have some questions related to the SSTable in the Cassandra, as I
> am doing a project to use it and hope someone in this list can share some
> thoughts.
> >
> > My understand is the SSTable is per column family. But each column
> family could have multi SSTable files. During the runtime, one row COULD
> split into more than one SSTable file, even this is not good for
> performance, but it does happen, and Cassandra will try to merge and store
> one row data into one SSTable file during compassion.
> >
> > The question is when one row is split in multi SSTable files, what is
> the boundary? Or let me ask this way, if one row exists in 2 SSTable files,
> if I run sstable2json tool to run on both SSTable files individually:
> >
> > 1) I will expect same row key could show up in both sstable2json output,
> as this one row exists in both SSTable files, right?
> > 2) If so, what is the boundary? Will Cassandra guarantee the column
> level as the boundary? What I mean is that for one column's data, it will
> be guaranteed to be either in the first file, or 2nd file, right? There is
> no chance that Cassandra will cut the data of one column into 2 part, and
> one part stored in first SSTable file, and the other part stored in second
> SSTable file. Is my understanding correct?
> > 3) If what we are talking about are only the SSTable files in snapshot,
> incremental backup SSTable files, exclude the runtime SSTable files, will
> anything change? For snapshot or incremental backup SSTable files, first
> can one row data still may exist in more than one SSTable file? And any
> boundary change in this case?
> > 4) If I want to use incremental backup SSTable files as the way to catch
> data being changed, is it a good way to do what I try to archive? In this
> case, what happen in the following example:
> >
> > For column family A:
> > at Time 0, one row key (key1) has some data. It is being stored and back
> up in SSTable file 1.
> > at Time 1, if any column for key1 has any change (a new column insert, a
> column updated/deleted, or even whole row being deleted), I will expect
> this whole row exists in the any incremental backup SSTable files after
> time 1, right?
> >
> > What happen if the above row just happen to store in more than one
> SSTable file?
> > at Time 0, one row key (key1) has some data, and it just is stored in
> SSTable file1 and file2, and being backup.
> > at Time 1, if one column is added in row key1, and the change in fact
> will happen in SSTable file2 only in this case, and if we do a incremental
> backup after that, what SSTable files should I expect in this backup? Both
> SSTable files? Or Just SSTable file 2?
> >
> > I was thinking incremental backup SSTable files are good candidate for
> catching data being changed, but as one row data could exist in multi
> SSTable file makes thing complex now. Did anyone have any experience to use
> SSTable file in this way? What are the lessons?
> >
> > Thanks
> >
> > Yong
> >
>

Mime
View raw message