Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 655B1D943 for ; Mon, 17 Dec 2012 01:41:50 +0000 (UTC) Received: (qmail 61676 invoked by uid 500); 17 Dec 2012 01:41:47 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 61649 invoked by uid 500); 17 Dec 2012 01:41:47 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 61640 invoked by uid 99); 17 Dec 2012 01:41:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Dec 2012 01:41:47 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a93.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Dec 2012 01:41:38 +0000 Received: from homiemail-a93.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a93.g.dreamhost.com (Postfix) with ESMTP id 3CD3F84065 for ; Sun, 16 Dec 2012 17:41:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :content-type:message-id:mime-version:subject:date:references:to :in-reply-to; s=thelastpickle.com; bh=NTYjDalPpDsodyY1WwkRDd3T6g E=; b=zB+KE2gcTwxPYuUiYvGUsq+EXk++hws3Za0OQTwOIkhHeAeUx//OayAjPK 5jAg53TTceuAcx0f6gS26hJRphgXG0Z7j2kkWHJcT3hQOBeDeIVDNkrgF6RvcHiQ iJ+0BkT1QlABuQ8Ic3p18yL6gvfMIeTYFXXsJYUU/Er3G13OU= Received: from [192.168.2.13] (unknown [116.90.132.105]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a93.g.dreamhost.com (Postfix) with ESMTPSA id 797518405B for ; Sun, 16 Dec 2012 17:41:15 -0800 (PST) From: aaron morton Content-Type: multipart/alternative; boundary="Apple-Mail=_C74F1ED7-A8A7-45BF-918A-795CF34F5E0A" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Read operations resulting in a write? Date: Mon, 17 Dec 2012 14:41:13 +1300 References: <1355529494.21459.YahooMailNeo@web164602.mail.gq1.yahoo.com> To: Cassandra User In-Reply-To: <1355529494.21459.YahooMailNeo@web164602.mail.gq1.yahoo.com> X-Mailer: Apple Mail (2.1499) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_C74F1ED7-A8A7-45BF-918A-795CF34F5E0A Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 > 1) Am I reading things correctly? Yes.=20 If you do a read/slice by name and more than min compaction level nodes = where read the data is re-written so that the next read uses fewer = SSTables. > 2) What is really happening here? Essentially minor compactions can = occur between 4 and 32 memtable flushes. Looking through the code, this = seems to only effect a couple types of select statements (when selecting = a specific column on a specific key being one of them). During the time = between these two values, every "select" statement will perform a write. Yup, only for readying a row where the column names are specified. Remember minor compaction when using SizedTiered Compaction (the = default) works on buckets of the same size.=20 Imagine a row that had been around for a while and had fragments in more = than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd = tier and 2 sstables in the 1st. So it takes (potentially) 5 SSTable = reads. If this row is read it will get hoisted back up.=20 But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st = tier it will not hoisted.=20 There are a few short circuits in the SliceByName read path. One of them = is to end the search when we know that no other SSTables contain columns = that should be considered. So if the 4 columns you read frequently are = hoisted into the 1st bucket your reads will get handled by that one = bucket.=20 It's not every select. Just those that touched more the min compaction = sstables.=20 > 3) Is this desired behavior? Is there something else I should be = looking at that could be causing this behavior? Yes. https://issues.apache.org/jira/browse/CASSANDRA-2503 Cheers =20 ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/12/2012, at 12:58 PM, Michael Theroux wrote: > Hello, >=20 > We have an unusual situation that I believe I've reproduced, at least = temporarily, in a test environment. I also think I see where this issue = is occurring in the code. >=20 > We have a specific column family that is under heavy read and write = load on a nightly basis. For the purposes of this description, I'll = refer to this column family as "Bob". During this nightly processing, = sometimes Bob is under very write load, other times it is very heavy = read load. >=20 > The application is such that when something is written to Bob, a write = is made to one of two other tables. We've witnessed a situation where = the write count on Bob far outstrips the write count on either of the = other tables, by a factor of 3->10. This is based on the WriteCount = available on the column family JMX MBean. We have not been able to find = where in our code this is happening, and we have gone as far as tracing = our CQL calls to determine that the relationship between Bob and the = other tables are what we expect. >=20 > I brought up a test node to experiment, and see a situation where, = when a "select" statement is executed, a write will occur. >=20 > In my test, I perform the following (switching between nodetool and = cqlsh): >=20 > update bob set 'about'=3D'coworker' where key=3D''; =20 > nodetool flush > update bob set 'about'=3D'coworker' where key=3D''; =20 > nodetool flush > update bob set 'about'=3D'coworker' where key=3D''; =20 > nodetool flush > update bob set 'about'=3D'coworker' where key=3D''; =20 > nodetool flush > update bob set 'about'=3D'coworker' where key=3D''; =20 > nodetool flush >=20 > Then, for a period of time (before a minor compaction occurs), a = select statement that selects specific columns will cause writes to = occur in the write count of the column family: >=20 > select about,changed,data from bob where key=3D''; >=20 > This situation will continue until a minor compaction is completed. >=20 > I went into the code and added some traces to = CollationController.java: >=20 > private ColumnFamily collectTimeOrderedData() > { > logger.debug("collectTimeOrderedData"); >=20 > ... ... >=20 > ---> HERE logger.debug( "tables iterated: " + sstablesIterated + " = Min compact: " + cfs.getMinimumCompactionThreshold() ); > // "hoist up" the requested data into a more recent = sstable > if (sstablesIterated > cfs.getMinimumCompactionThreshold() > && !cfs.isCompactionDisabled() > && cfs.getCompactionStrategy() instanceof = SizeTieredCompactionStrategy) > { > RowMutation rm =3D new RowMutation(cfs.table.name, new = Row(filter.key, returnCF.cloneMe())); > try > { > ---> HERE logger.debug( "Apply hoisted up row mutation" = );=09 > // skipping commitlog and index updates is fine = since we're just de-fragmenting existing data > Table.open(rm.getTable()).apply(rm, false, false); > } > catch (IOException e) > { > // log and allow the result to be returned > logger.error("Error re-writing read results", e); > } > }=20 > ... ... >=20 > Performing the steps above, I see the following traces (in the test = environment I decreased the minimum compaction threshold to make this = easier to reproduce). After I do a couple of update/flush, I see this = in the log: >=20 > DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java = (line 117) Scheduling a background task check for bob with = SizeTieredCompactionStrategy >=20 > Then, until compaction occurs, I see (when performing a select): >=20 > DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java = (line 86) Disseminating load info ... > DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line = 1227) execute_cql_query > DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line = 445) CQL statement type: SELECT > DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) = Command/ConsistencyLevel is SliceByNamesReadCommand(table=3D'open', = key=3D804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, = columnParent=3D'QueryPath(columnFamilyName=3D'bob', = superColumnName=3D'null', columnName=3D'null')', = columns=3D[about,changed,data,])/ONE > DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) = Blockfor is 1; setting up requests to /10.0.4.20 > DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) = reading data locally > DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line = 813) LocalReadRunnable reading SliceByNamesReadCommand(table=3D'open', = key=3D804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, = columnParent=3D'QueryPath(columnFamilyName=3D'bob', = superColumnName=3D'null', columnName=3D'null')', = columns=3D[about,changed,data,]) > DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java = (line 68) In get top level columns: class = org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: = class org.apache.cassandra.db.marshal.BytesType > DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java = (line 84) collectTimeOrderedData > ---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 = CollationController.java (line 188) tables iterated: 4 Min compact: 2 > ----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 = CollationController.java (line 198) Apply hoisted up row mutation > DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395) = applying mutation of row = 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342 >=20 > The above traces will occur every time I repeat the above select = statement. >=20 > Minor compaction doesn't start until a few minutes after the request = was submitted above (note, this is an unloaded test node): >=20 > DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 = IntervalNode.java (line 45) Creating IntervalNode from = [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded1006957= 3003e6dbb1ce... >=20 > Once minor compaction occurs, the behavior around write count being = incremented stops, until more than the minimum compaction threshold = memtables are flush to disk. >=20 > So, my questions are: >=20 > 1) Am I reading things correctly? >=20 > 2) What is really happening here? Essentially minor compactions can = occur between 4 and 32 memtable flushes. Looking through the code, this = seems to only effect a couple types of select statements (when selecting = a specific column on a specific key being one of them). During the time = between these two values, every "select" statement will perform a write. >=20 > 3) Is this desired behavior? Is there something else I should be = looking at that could be causing this behavior? >=20 > We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy. =20 > Any help is appreciated, > Thanks, > -Mike >=20 >=20 > =20 --Apple-Mail=_C74F1ED7-A8A7-45BF-918A-795CF34F5E0A Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1
1) Am I reading things = correctly?
Yes. 
If you do a read/slice = by name and more than min compaction level nodes where read the data is = re-written so that the next read uses fewer = SSTables.

2) What is really happening here?  Essentially minor = compactions can occur between 4 and 32 memtable flushes.  Looking = through the code, this seems to only effect a couple types of select = statements (when selecting a specific column on a specific key being one = of them). During the time between these two values, every "select" = statement will perform a write.
Yup, only = for readying a row where the column names are = specified.
Remember minor compaction when using SizedTiered = Compaction (the default) works on buckets of the same = size. 

Imagine a row that had been around = for a while and had fragments in more than Min Compaction Threshold = sstables. Say it is 3 SSTables in the 2nd tier and 2 sstables in the = 1st. So it takes (potentially) 5 SSTable reads. If this row is read it = will get hoisted back up. 

But the row has = is in only 1 SSTable in the 2nd tier and 2 in the 1st tier it will not = hoisted. 

There are a few short circuits = in the SliceByName read path. One of them is to end the search when we = know that no other SSTables contain columns that should be considered. = So if the 4 columns you read frequently are hoisted into the 1st bucket = your reads will get handled by that one = bucket. 

It's not every select. Just those = that touched more the min compaction = sstables. 


3) Is this desired behavior? =  Is there something else I should be looking at that could be = causing this behavior?
Yes.

Chee= rs

   
http://www.thelastpickle.com

On 15/12/2012, at 12:58 PM, Michael Theroux <mtheroux2@yahoo.com> = wrote:

We have an unusual = situation that I believe I've reproduced, at least temporarily, in a = test environment.  I also think I see where this issue is occurring = in the code.

We have a specific = column family that is under heavy read and write load on a nightly = basis.   For the purposes of this description, I'll refer to this column family as "Bob".  During this nightly = processing, sometimes Bob is under very write load, other times it is = very heavy read load.

The application is = such that when something is written to Bob, a write is made to one of = two other tables.  We've witnessed a situation where the write = count on Bob far outstrips the write count on either of the other = tables, by a factor of 3->10.  This is based on the WriteCount = available on the column family JMX MBean.  We have not been able to = find where in our code this is happening, and we have gone as far as = tracing our CQL calls to determine that the relationship between Bob and the other tables are = what we expect.

I brought up a test = node to experiment, and see a situation where, when a "select" statement = is executed, a write will occur.

In my = test, I perform the following (switching between nodetool and cqlsh):

update bob set = 'about'=3D'coworker' where key=3D'<hex key>';   =  
nodetool flush
update bob set = 'about'=3D'coworker' where key=3D'<hex key>';   =  
nodetool flush
update bob set 'about'=3D'coworker' where key=3D'<hex = key>';    
nodetool flush
update bob set = 'about'=3D'coworker' where key=3D'<hex key>';   =  
nodetool flush
update bob set 'about'=3D'coworker' where key=3D'<hex = key>';    
nodetool = flush

Then, for a period of time (before a = minor compaction occurs), a select statement that selects specific = columns will cause writes to occur in the write count of the column = family:

select about,changed,data from bob = where key=3D'<hex key>';

This situation will continue until a minor compaction is = completed.

I went into the code and added some traces to = CollationController.java:

  private ColumnFamily =
collectTimeOrderedData()
    {
        logger.debug("collectTimeOrderedData");

      ... <snip> ...

            // =
"hoist up" the requested data into a more recent sstable
            if (sstablesIterated > =
cfs.getMinimumCompactionThreshold()
                && !cfs.isCompactionDisabled()
                && cfs.getCompactionStrategy() instanceof =
SizeTieredCompactionStrategy)
            {
                RowMutation rm =3D new RowMutation(cfs.table.name, new =
Row(filter.key, returnCF.cloneMe()));
                try
                {
---> HERE       =
        logger.debug( "Apply hoisted up row mutation" );	
                    // skipping commitlog and index updates =
is fine since we're just de-fragmenting existing data
                    Table.open(rm.getTable()).apply(rm, false, false);
                }
                catch (IOException e)
                {
                    // log and allow the result to be returned
                    logger.error("Error re-writing read results", e);
                }
            } 
... <snip> = ...

Performing the steps above, I = see the following traces (in the test environment I decreased the = minimum compaction threshold to make this easier to reproduce). After I = do a couple of update/flush, I see this in the = log:

DEBUG [FlushWriter:7] = 2012-12-14 22:54:40,106 CompactionManager.java (line 117) Scheduling a = background task check for bob with SizeTieredCompactionStrategy

Then, until = compaction occurs, I see (when performing a = select):

DEBUG [ScheduledTasks:1] = 2012-12-14 22:55:15,998 LoadBroadcaster.java (line 86) Disseminating = load info ...
DEBUG [Thrift:12] 2012-12-14 22:55:16,990 = CassandraServer.java (line 1227) execute_cql_query
DEBUG = [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) CQL = statement type: SELECT
DEBUG [Thrift:12] 2012-12-14 = 22:55:16,991 StorageProxy.java (line 653) Command/ConsistencyLevel is = SliceByNamesReadCommand(table=3D'open', = key=3D804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, = columnParent=3D'QueryPath(columnFamilyName=3D'bob', = superColumnName=3D'null', columnName=3D'null')', columns=3D[about,changed,data,])/ONE
DEBUG [Thrift:12] = 2012-12-14 22:55:16,992 ReadCallback.java (line 79) Blockfor is 1; = setting up requests to /10.0.4.20
DEBUG [Thrift:12] 2012-12-14 = 22:55:16,992 StorageProxy.java (line 669) reading data = locally
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 = StorageProxy.java (line 813) LocalReadRunnable reading = SliceByNamesReadCommand(table=3D'open', = key=3D804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, = columnParent=3D'QueryPath(columnFamilyName=3D'bob', = superColumnName=3D'null', columnName=3D'null')', = columns=3D[about,changed,data,])
DEBUG [ReadStage:61] = 2012-12-14 22:55:16,992 CollationController.java (line 68) In get top = level columns: class org.apache.cassandra.db.filter.NamesQueryFilter = type: Standard valid: class = org.apache.cassandra.db.marshal.BytesType
DEBUG [ReadStage:61] = 2012-12-14 22:55:16,992 CollationController.java (line 84) collectTimeOrderedData
---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 = CollationController.java (line 188) tables iterated: 4 Min compact: = 2
----> DEBUG [ReadStage:61] 2012-12-14 = 22:55:17,192 CollationController.java (line 198) Apply hoisted up row = mutation
DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 = Table.java (line 395) applying mutation of row = 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342

The above traces will occur every time I repeat the = above select statement.

Minor compaction = doesn't start until a few minutes after the request was submitted above = (note, this is an unloaded test node):

DEBUG = [CompactionExecutor:11] 2012-12-14 22:57:03,278 IntervalNode.java (line = 45) Creating IntervalNode from = [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded1006957= 3003e6dbb1ce...

Once minor compaction = occurs, the behavior around write count being incremented stops, until = more than the minimum compaction threshold memtables are flush to = disk.

So, my questions = are:

1) Am I reading things = correctly?

2) What is really happening here? =  Essentially minor compactions can occur between 4 and 32 memtable = flushes.  Looking through the code, this seems to only effect a = couple types of select statements (when selecting a specific column on a = specific key being one of them). During the time between these two = values, every "select" statement will perform a = write.

3) Is this desired behavior?  Is = there something else I should be looking at that could be causing this = behavior?

We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy. =  
Any help is = appreciated,
Thanks,
-Mike