couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <>
Subject Re: About possibly reverting COUCHDB-767
Date Tue, 09 Nov 2010 01:18:06 GMT
On Nov 8, 2010, at 3:28 PM, Randall Leeds wrote:

> On Mon, Nov 8, 2010 at 12:22, Paul Davis <> wrote:
>> On Mon, Nov 8, 2010 at 3:04 PM, Randall Leeds <> wrote:
>>> Whoops. Hit send too early, but I think I got everything in there that
>>> I wanted to say.
>>> As for the ref counter bottleneck, I just pushed to
>>> This branch uses a public ets for the ref_counter. I think I managed
>>> to linear the updates over the {total, RefCtr} keys in the ets table
>>> such that there should be no race conditions but please, please take a
>>> look at this if you have time.
>>> It seems to pass the ref_counter tests, but I still need to handle
>>> giving away ownership of the ets table. Right now I use couch_server
>>> as the heir so I can use only one ETS table for all couch_ref_counter
>>> processes, but the couch_server just crashes if it actually receives
>>> the 'ETS-TRANSFER' message. If I can't find an easy way to hand the
>>> table to another couch_ref_counter whenever the owner exits I may just
>>> break the encapsulation of the module a bit by leaving couch_server as
>>> the owner and ignoring that message.
>>> Thanks, guys. My gut says we're going to get some nice numbers when
>>> all this is done.
>>> -Randall
>>> On Mon, Nov 8, 2010 at 11:56, Randall Leeds <> wrote:
>>>> Thanks to both of you for getting this conversation going again and
>>>> for the work on the patch and testing, Filipe.
>>>> On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski <>
>>>>> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:
>>>>>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <>
>>>>>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>>>>>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <>
>>>>>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>>>>>>>> Also, with this patch I verified (on Solaris, with
the 'zpool iostat
>>>>>>>>>> 1' command) that when running a writes only test
with relaximation
>>>>>>>>>> (200 write processes), disk write activity is not
continuous. Without
>>>>>>>>>> this patch, there's continuous (every 1 second) write
>>>>>>>>> I'm confused by this statement. You must be talking about
relaximation runs with delayed_commits = true, right?  Why do you think you see larger intervals
between write activity with the optimization from COUCHDB-767?  Have you measured the time
it takes to open the extra FD?  In my tests that was a sub-millisecond operation, but maybe
you've uncovered something else.
>>>>>>>> No, it happens for tests with delayed_commits = false. The
>>>>>>>> possible explanation I see for the variance might be related
to the
>>>>>>>> Erlang VM scheduler decisions about when to start/run that
>>>>>>>> Nevertheless, I dont know the exact cause, but the fsync
run frequency
>>>>>>>> varies a lot.
>>>>>>> I think it's worth investigating.  I couldn't reproduce it on
my plain-old spinning disk MacBook with 200 writers in relaximation; the IOPS reported by
iostat stayed very uniform.
>>>>>>>>>> For the goal of not having readers getting blocked
by fsync calls (and
>>>>>>>>>> write calls), I would propose using a separate couch_file
process just
>>>>>>>>>> for read operations. I have a branch in my github
for this (with
>>>>>>>>>> COUCHDB-767 reverted). It needs to be polished, but
the relaximation
>>>>>>>>>> tests are very positive, both reads and writes get
better response
>>>>>>>>>> times and throughput:
>>>>>>>>> I'd like to propose an alternative optimization, which
is to keep a dedicated file descriptor open in the couch_db_updater process and use that file
descriptor for _all_ IO initiated by the db_updater.  The advantage is that the db_updater
does not need to do any message passing for disk IO, and thus does not slow down when the
incoming message queue is large.  A message queue much much larger than the number of concurrent
writers can occur if a user writes with batch=ok, and it can also happen rather easily in
a BigCouch cluster.
>>>>>>>> I don't see how that will improve things, since all write
>>>>>>>> will still be done in a serialized manner. Since only couch_db_updater
>>>>>>>> writes to the DB file, and since access to the couch_db_updater
>>>>>>>> serialized, to me it only seems that you're solution avoids
one level
>>>>>>>> of indirection (the couch_file process). I don't see how,
when using a
>>>>>>>> couch_file only for writes, you get the message queue for
>>>>>>>> couc_file process full of write messages.
>>>>>>> It's the db_updater which gets a large message queue, not the
couch_file.  The db_updater ends up with a big backlog of update_docs messages that get in
the way when it needs to make gen_server calls to the couch_file process for IO.  It's a significant
problem in R13B, probably less so in R14B because of some cool optimizations by the OTP team.
>>>>>> So, let me see if I get it. The couch_db_updater process is slow
>>>>>> picking the results of the calls to the couch_file process because
>>>>>> mailbox is full of update_docs messages?
>>>>> Correct.  Each call to the couch_file requires a selective receive on
the part of the db_updater in order to get the response, and prior to R14 that selective receive
needed to match against every message in the mailbox.  It's really a bigger problem in couch_server,
which uses a gen_server call to increment a reference counter before handing the #db{} to
the client, since every request to any DB has to talk to couch_server first.  Best,
>>>>> Adam
>>>> Adam,
>>>> I think the problem is made worse by a backed up db_updater, but the
>>>> db_updater becomes backed up because it makes more synchronous calls
>>>> to the couch_file than a reader does, handling only one update
>>>> operation at a time while readers queue up on the couch_file in
>>>> parallel.
>>>> Filipe,
>>>> Using a separate fd for writes at the couch_file level is not the
>>>> answer. The db_updater has to read the btree before it can write,
>>>> incurring multiple trips through the couch_file message queue between
>>>> queuing append_term requests and processing its message queue for new
>>>> updates. Using two file descriptors keeps the readers out of the way
>>>> of the writers only if you select which fd to use at the db-operation
>>>> level and not the file-operation level. Perhaps two couch_file
>>>> processes is better. Fairness should be left to the operating system
>>>> I/O scheduler once reads don'. This seems seems like the best way
>>>> forward to me right now. Let's try to crunch some numbers on it soon.
>>>> I couldn't find a solution I liked that was fair to readers and
>>>> writers at any workload with only one file descriptor. The btree cache
>>>> alleviates this problem a bit because the read path becomes much
>>>> faster and therefore improves database reads and writes.
>>>> As to the patch, I'd think we need the readers and writers separated
>>>> into two separate couch_files. That way the updater can perform its
>>>> reads on the "writer" fd, otherwise writers suffer starvation because
>>>> readers go directly into the couch_file queue in parallel instead of
>>>> serializing through something like db_updater.
>> Wasn't there a branch or patch somehwere that just removed the
>> ref_counter code entirely and used monitors/links to make sure
>> everything behaved correctly? I'm not sure I ever saw it to see how
>> dramatic and/or scary it was, but it might be another approach to
>> consider.
> Adam should chime in. I think BigCouch got rid of the ref counter in
> favor of something else, but last I asked him about it he said there
> might be a small edge case race condition. How critical that is I
> can't say. It may be that that edge case is tolerable and recoverable.

BigCouch uses a protected couch_dbs ets table to store all open #dbs and a public couch_lru
ets table for LRU updates.  Reference counting is accomplished by having clients monitor and
demonitor the couch_file.  The least recently used database is determined by folding over
the couch_lru table.  The race condition Randall is referring to is the following:

1) client process grabs a #db{} record from the couch_dbs ets table
2) couch_server calculates that the LRU DB is the one the client just looked up
3) couch_server checks the monitored_by field for the #db.fd process and finds no clients
are monitoring it
4) couch_server kills the DB
5) client updates the LRU time (too late)
6) client monitors the #db.fd, but it's already dead

Practically speaking, it's been a non-issue, but it does exist.  For what it's worth, a similar
race condition existed in older (0.10.x and below) versions of CouchDB, where the client would
increment the ref counter after receiving the DB from couch_server.  The current implementation
in CouchDB avoids the race condition by having couch_server increment the ref counter on behalf
of the client, but that approach has serious performance and stability implications for BigCouch.

I'll try to take a look at Randall's github branch tonight.  Using monitors for reference
counting feels really right to me, but as it's not possible to monitor a process on behalf
of someone else I'm not sure it's possible to avoid the race that I described with monitors.

View raw message