incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Theroux <>
Subject Read operations resulting in a write?
Date Fri, 14 Dec 2012 23:58:14 GMT

We have an unusual situation that I believe I've reproduced, at least temporarily, in a test
environment.  I also think I see where this issue is occurring in the code.

We have a specific column family that is under heavy read and write load on a nightly basis.
  For the purposes of this description, I'll refer to this column family as "Bob".  During
this nightly processing, sometimes Bob is under very write load, other times it is very heavy
read load.

The application is such that when something is written to Bob, a write is made to one of two
other tables.  We've witnessed a situation where the write count on Bob far outstrips the
write count on either of the other tables, by a factor of 3->10.  This is based on the
WriteCount available on the column family JMX MBean.  We have not been able to find where
in our code this is happening, and we have gone as far as tracing our CQL calls to determine
that the relationship between Bob and the other tables are what we expect.

I brought up a test node to experiment, and see a situation where, when a "select" statement
is executed, a write will occur.

In my test, I perform the following (switching between nodetool and cqlsh):

update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush

Then, for a period of time (before a minor compaction occurs), a select statement that selects
specific columns will cause writes to occur in the write count of the column family:

select about,changed,data from bob where key='<hex key>';

This situation will continue until a minor compaction is completed.

I went into the code and added some traces to

private ColumnFamily collectTimeOrderedData() { logger.debug("collectTimeOrderedData"); ...
<snip> ... 

---> HERE   logger.debug( "tables iterated: " + sstablesIterated +  " Min compact: " +
cfs.getMinimumCompactionThreshold() );
// "hoist up" the requested data into a more recent sstable if (sstablesIterated > cfs.getMinimumCompactionThreshold()
&& !cfs.isCompactionDisabled() && cfs.getCompactionStrategy() instanceof SizeTieredCompactionStrategy)
{ RowMutation rm = new RowMutation(, new Row(filter.key, returnCF.cloneMe()));
try {
---> HERE  logger.debug( "Apply hoisted up row mutation" );
// skipping commitlog and index updates is fine since we're just de-fragmenting existing data, false, false); } catch (IOException e) { // log and allow
the result to be returned logger.error("Error re-writing read results", e); } } 
... <snip> ...

Performing the steps above, I see the following traces (in the test environment I decreased
the minimum compaction threshold to make this easier to reproduce).  After I do a couple of
update/flush, I see this in the log:

DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 (line 117) Scheduling
a background task check for bob with SizeTieredCompactionStrategy

Then, until compaction occurs, I see (when performing a select):

DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 (line 86) Disseminating
load info ...
DEBUG [Thrift:12] 2012-12-14 22:55:16,990 (line 1227) execute_cql_query
DEBUG [Thrift:12] 2012-12-14 22:55:16,991 (line 445) CQL statement type:
DEBUG [Thrift:12] 2012-12-14 22:55:16,991 (line 653) Command/ConsistencyLevel
is SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342,
columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')',
DEBUG [Thrift:12] 2012-12-14 22:55:16,992 (line 79) Blockfor is 1; setting
up requests to /
DEBUG [Thrift:12] 2012-12-14 22:55:16,992 (line 669) reading data locally
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 (line 813) LocalReadRunnable
reading SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342,
columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')',
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 (line 68) In get top
level columns: class org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid:
class org.apache.cassandra.db.marshal.BytesType
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 (line 84) collectTimeOrderedData
---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 (line 188) tables
iterated: 4 Min compact: 2

----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 (line 198)
Apply hoisted up row mutation
DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 (line 395) applying mutation of row

The above traces will occur every time I repeat the above select statement.

Minor compaction doesn't start until a few minutes after the request was submitted above (note,
this is an unloaded test node):

DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 (line 45) Creating
IntervalNode from [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce...

Once minor compaction occurs, the behavior around write count being incremented stops, until
more than the minimum compaction threshold memtables are flush to disk.

So, my questions are:

1) Am I reading things correctly?

2) What is really happening here?  Essentially minor compactions can occur between 4 and
32 memtable flushes.  Looking through the code, this seems to only effect a couple types
of select statements (when selecting a specific column on a specific key being one of them).
During the time between these two values, every "select" statement will perform a write.

3) Is this desired behavior?  Is there something else I should be looking at that could be
causing this behavior?

We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy.  
Any help is appreciated,
View raw message