couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <>
Subject Re: About possibly reverting COUCHDB-767
Date Mon, 08 Nov 2010 20:04:45 GMT
Whoops. Hit send too early, but I think I got everything in there that
I wanted to say.

As for the ref counter bottleneck, I just pushed to
This branch uses a public ets for the ref_counter. I think I managed
to linear the updates over the {total, RefCtr} keys in the ets table
such that there should be no race conditions but please, please take a
look at this if you have time.

It seems to pass the ref_counter tests, but I still need to handle
giving away ownership of the ets table. Right now I use couch_server
as the heir so I can use only one ETS table for all couch_ref_counter
processes, but the couch_server just crashes if it actually receives
the 'ETS-TRANSFER' message. If I can't find an easy way to hand the
table to another couch_ref_counter whenever the owner exits I may just
break the encapsulation of the module a bit by leaving couch_server as
the owner and ignoring that message.

Thanks, guys. My gut says we're going to get some nice numbers when
all this is done.


On Mon, Nov 8, 2010 at 11:56, Randall Leeds <> wrote:
> Thanks to both of you for getting this conversation going again and
> for the work on the patch and testing, Filipe.
> On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski <> wrote:
>> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:
>>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <> wrote:
>>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <>
>>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>>>>> Also, with this patch I verified (on Solaris, with the 'zpool
>>>>>>> 1' command) that when running a writes only test with relaximation
>>>>>>> (200 write processes), disk write activity is not continuous.
>>>>>>> this patch, there's continuous (every 1 second) write activity.
>>>>>> I'm confused by this statement. You must be talking about relaximation
runs with delayed_commits = true, right?  Why do you think you see larger intervals between
write activity with the optimization from COUCHDB-767?  Have you measured the time it takes
to open the extra FD?  In my tests that was a sub-millisecond operation, but maybe you've
uncovered something else.
>>>>> No, it happens for tests with delayed_commits = false. The only
>>>>> possible explanation I see for the variance might be related to the
>>>>> Erlang VM scheduler decisions about when to start/run that process.
>>>>> Nevertheless, I dont know the exact cause, but the fsync run frequency
>>>>> varies a lot.
>>>> I think it's worth investigating.  I couldn't reproduce it on my plain-old
spinning disk MacBook with 200 writers in relaximation; the IOPS reported by iostat stayed
very uniform.
>>>>>>> For the goal of not having readers getting blocked by fsync calls
>>>>>>> write calls), I would propose using a separate couch_file process
>>>>>>> for read operations. I have a branch in my github for this (with
>>>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>>>>>>> tests are very positive, both reads and writes get better response
>>>>>>> times and throughput:
>>>>>> I'd like to propose an alternative optimization, which is to keep
a dedicated file descriptor open in the couch_db_updater process and use that file descriptor
for _all_ IO initiated by the db_updater.  The advantage is that the db_updater does not
need to do any message passing for disk IO, and thus does not slow down when the incoming
message queue is large.  A message queue much much larger than the number of concurrent writers
can occur if a user writes with batch=ok, and it can also happen rather easily in a BigCouch
>>>>> I don't see how that will improve things, since all write operations
>>>>> will still be done in a serialized manner. Since only couch_db_updater
>>>>> writes to the DB file, and since access to the couch_db_updater is
>>>>> serialized, to me it only seems that you're solution avoids one level
>>>>> of indirection (the couch_file process). I don't see how, when using
>>>>> couch_file only for writes, you get the message queue for that
>>>>> couc_file process full of write messages.
>>>> It's the db_updater which gets a large message queue, not the couch_file.
 The db_updater ends up with a big backlog of update_docs messages that get in the way when
it needs to make gen_server calls to the couch_file process for IO.  It's a significant problem
in R13B, probably less so in R14B because of some cool optimizations by the OTP team.
>>> So, let me see if I get it. The couch_db_updater process is slow
>>> picking the results of the calls to the couch_file process because its
>>> mailbox is full of update_docs messages?
>> Correct.  Each call to the couch_file requires a selective receive on the part of
the db_updater in order to get the response, and prior to R14 that selective receive needed
to match against every message in the mailbox.  It's really a bigger problem in couch_server,
which uses a gen_server call to increment a reference counter before handing the #db{} to
the client, since every request to any DB has to talk to couch_server first.  Best,
>> Adam
> Adam,
> I think the problem is made worse by a backed up db_updater, but the
> db_updater becomes backed up because it makes more synchronous calls
> to the couch_file than a reader does, handling only one update
> operation at a time while readers queue up on the couch_file in
> parallel.
> Filipe,
> Using a separate fd for writes at the couch_file level is not the
> answer. The db_updater has to read the btree before it can write,
> incurring multiple trips through the couch_file message queue between
> queuing append_term requests and processing its message queue for new
> updates. Using two file descriptors keeps the readers out of the way
> of the writers only if you select which fd to use at the db-operation
> level and not the file-operation level. Perhaps two couch_file
> processes is better. Fairness should be left to the operating system
> I/O scheduler once reads don'. This seems seems like the best way
> forward to me right now. Let's try to crunch some numbers on it soon.
> I couldn't find a solution I liked that was fair to readers and
> writers at any workload with only one file descriptor. The btree cache
> alleviates this problem a bit because the read path becomes much
> faster and therefore improves database reads and writes.
> As to the patch, I'd think we need the readers and writers separated
> into two separate couch_files. That way the updater can perform its
> reads on the "writer" fd, otherwise writers suffer starvation because
> readers go directly into the couch_file queue in parallel instead of
> serializing through something like db_updater.

View raw message