couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: About possibly reverting COUCHDB-767
Date Sun, 07 Nov 2010 20:49:59 GMT
On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:

> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <kocolosk@apache.org> wrote:
>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>> 
>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <kocolosk@apache.org> wrote:
>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>> 
>>>>> Also, with this patch I verified (on Solaris, with the 'zpool iostat
>>>>> 1' command) that when running a writes only test with relaximation
>>>>> (200 write processes), disk write activity is not continuous. Without
>>>>> this patch, there's continuous (every 1 second) write activity.
>>>> 
>>>> I'm confused by this statement. You must be talking about relaximation runs
with delayed_commits = true, right?  Why do you think you see larger intervals between write
activity with the optimization from COUCHDB-767?  Have you measured the time it takes to open
the extra FD?  In my tests that was a sub-millisecond operation, but maybe you've uncovered
something else.
>>> 
>>> No, it happens for tests with delayed_commits = false. The only
>>> possible explanation I see for the variance might be related to the
>>> Erlang VM scheduler decisions about when to start/run that process.
>>> Nevertheless, I dont know the exact cause, but the fsync run frequency
>>> varies a lot.
>> 
>> I think it's worth investigating.  I couldn't reproduce it on my plain-old spinning
disk MacBook with 200 writers in relaximation; the IOPS reported by iostat stayed very uniform.
>> 
>>>>> For the goal of not having readers getting blocked by fsync calls (and
>>>>> write calls), I would propose using a separate couch_file process just
>>>>> for read operations. I have a branch in my github for this (with
>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>>>>> tests are very positive, both reads and writes get better response
>>>>> times and throughput:
>>>>> 
>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads
>>>> 
>>>> I'd like to propose an alternative optimization, which is to keep a dedicated
file descriptor open in the couch_db_updater process and use that file descriptor for _all_
IO initiated by the db_updater.  The advantage is that the db_updater does not need to do
any message passing for disk IO, and thus does not slow down when the incoming message queue
is large.  A message queue much much larger than the number of concurrent writers can occur
if a user writes with batch=ok, and it can also happen rather easily in a BigCouch cluster.
>>> 
>>> I don't see how that will improve things, since all write operations
>>> will still be done in a serialized manner. Since only couch_db_updater
>>> writes to the DB file, and since access to the couch_db_updater is
>>> serialized, to me it only seems that you're solution avoids one level
>>> of indirection (the couch_file process). I don't see how, when using a
>>> couch_file only for writes, you get the message queue for that
>>> couc_file process full of write messages.
>> 
>> It's the db_updater which gets a large message queue, not the couch_file.  The db_updater
ends up with a big backlog of update_docs messages that get in the way when it needs to make
gen_server calls to the couch_file process for IO.  It's a significant problem in R13B, probably
less so in R14B because of some cool optimizations by the OTP team.
> 
> So, let me see if I get it. The couch_db_updater process is slow
> picking the results of the calls to the couch_file process because its
> mailbox is full of update_docs messages?

Correct.  Each call to the couch_file requires a selective receive on the part of the db_updater
in order to get the response, and prior to R14 that selective receive needed to match against
every message in the mailbox.  It's really a bigger problem in couch_server, which uses a
gen_server call to increment a reference counter before handing the #db{} to the client, since
every request to any DB has to talk to couch_server first.  Best,

Adam
Mime
View raw message