cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-2660) BRAF.sync() bug can cause massive commit log write magnification
Date Tue, 17 May 2011 12:23:47 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Peter Schuller updated CASSANDRA-2660:
--------------------------------------

    Attachment: CASSANDRA-2660-075.txt

> BRAF.sync() bug can cause massive commit log write magnification
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-2660
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2660
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>         Attachments: CASSANDRA-2660-075.txt
>
>
> This was discovered, fixed and tested on 0.7.5. Cursory examination shows it should still
be an issue on trunk/0.8. If people otherwise agree with the patch I can rebase if necessary.
> Problem:
> BRAF.flush() is actually broken in the sense that it cannot be called without close co-operation
with the caller. rebuffer() does the co-op by adjusting bufferOffset and validateBufferBytes
appropriately, by sync() doesn't. This means sync() is broken, and sync() is used by the commit
log.
> The attached patch moves the bufferOffset/validateBufferBytes handling out into resetBuffer()
and has both flush() and rebuffer() call that. This makes sync() safe.
> What happened was that for batched commit log mode, every time sync() was called the
data buffered so far would get written to the OS and fsync():ed. But until rebuffer() is called
for other reasons as part of the write path, all subsequent sync():s would result in the very
same data (plus whatever was written since last time) being re-written and fsync():ed again.
So first you write+fsync N bytes, then N+N1, then N+N1+N2... (each N being a batch), until
at some point you trigger a rebuffer() and it starts all over again.
> The result is that you see *a lot* more writes to the commit log than are in fact written
to the BRAF. And these writes translate into actual real writes to the underlying storage
device due to fsync(). We had crazy numbers where we saw spikes upwards of 80 mb/second where
the actual throughput was more like ~ 1 mb second of data to the commit log.
> (One can make a possibly weak argument that it is also functionally incorrect as I can
imagine implementations where re-writing the same blocks does copy-on-write in such a way
that you're not necessarily guaranteed to see before-or-after data, particularly in case of
partial page writes. However that's probably not a practical issue.)
> Worthy of noting is that this probably causes added difficulties in fsync() latencies
since the average fsync() will contain a lot more data. Depending on I/O scheduler and underlying
device characteristics, the extra writes *may* not have a detrimental effect, but I think
it's pretty easy to point to cases where it will be detrimental - in particular if the commit
log is on a non-battery backed drive. Even with a nice battery backed RAID with the commit
log on, the size of the writes probably contributes to difficulty in making the write requests
propagate down without being starved by reads (but this is speculation, not tested, other
than that I've observed commit log writer starvation that seemed excessive).
> This isn't the first subtle BRAF bug. What are people's thoughts on creating separate
abstractions for streaming I/O that can perhaps be a lot more simple, and use BRAF only for
random reads in response to live traffic? (Not as part of this JIRA, just asking in general.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message