couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <randall.le...@gmail.com>
Subject Re: About possibly reverting COUCHDB-767
Date Mon, 08 Nov 2010 20:28:44 GMT
On Mon, Nov 8, 2010 at 12:22, Paul Davis <paul.joseph.davis@gmail.com> wrote:
> On Mon, Nov 8, 2010 at 3:04 PM, Randall Leeds <randall.leeds@gmail.com> wrote:
>> Whoops. Hit send too early, but I think I got everything in there that
>> I wanted to say.
>>
>> As for the ref counter bottleneck, I just pushed to
>> https://github.com/tilgovi/couchdb/tree/ets_ref_count
>> This branch uses a public ets for the ref_counter. I think I managed
>> to linear the updates over the {total, RefCtr} keys in the ets table
>> such that there should be no race conditions but please, please take a
>> look at this if you have time.
>>
>> It seems to pass the ref_counter tests, but I still need to handle
>> giving away ownership of the ets table. Right now I use couch_server
>> as the heir so I can use only one ETS table for all couch_ref_counter
>> processes, but the couch_server just crashes if it actually receives
>> the 'ETS-TRANSFER' message. If I can't find an easy way to hand the
>> table to another couch_ref_counter whenever the owner exits I may just
>> break the encapsulation of the module a bit by leaving couch_server as
>> the owner and ignoring that message.
>>
>> Thanks, guys. My gut says we're going to get some nice numbers when
>> all this is done.
>>
>> -Randall
>>
>> On Mon, Nov 8, 2010 at 11:56, Randall Leeds <randall.leeds@gmail.com> wrote:
>>> Thanks to both of you for getting this conversation going again and
>>> for the work on the patch and testing, Filipe.
>>>
>>> On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski <kocolosk@apache.org> wrote:
>>>> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:
>>>>
>>>>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <kocolosk@apache.org>
wrote:
>>>>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>>>>>>
>>>>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <kocolosk@apache.org>
wrote:
>>>>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>>>>>>
>>>>>>>>> Also, with this patch I verified (on Solaris, with the
'zpool iostat
>>>>>>>>> 1' command) that when running a writes only test with
relaximation
>>>>>>>>> (200 write processes), disk write activity is not continuous.
Without
>>>>>>>>> this patch, there's continuous (every 1 second) write
activity.
>>>>>>>>
>>>>>>>> I'm confused by this statement. You must be talking about
relaximation runs with delayed_commits = true, right?  Why do you think you see larger intervals
between write activity with the optimization from COUCHDB-767?  Have you measured the time
it takes to open the extra FD?  In my tests that was a sub-millisecond operation, but maybe
you've uncovered something else.
>>>>>>>
>>>>>>> No, it happens for tests with delayed_commits = false. The only
>>>>>>> possible explanation I see for the variance might be related
to the
>>>>>>> Erlang VM scheduler decisions about when to start/run that process.
>>>>>>> Nevertheless, I dont know the exact cause, but the fsync run
frequency
>>>>>>> varies a lot.
>>>>>>
>>>>>> I think it's worth investigating.  I couldn't reproduce it on my
plain-old spinning disk MacBook with 200 writers in relaximation; the IOPS reported by iostat
stayed very uniform.
>>>>>>
>>>>>>>>> For the goal of not having readers getting blocked by
fsync calls (and
>>>>>>>>> write calls), I would propose using a separate couch_file
process just
>>>>>>>>> for read operations. I have a branch in my github for
this (with
>>>>>>>>> COUCHDB-767 reverted). It needs to be polished, but the
relaximation
>>>>>>>>> tests are very positive, both reads and writes get better
response
>>>>>>>>> times and throughput:
>>>>>>>>>
>>>>>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads
>>>>>>>>
>>>>>>>> I'd like to propose an alternative optimization, which is
to keep a dedicated file descriptor open in the couch_db_updater process and use that file
descriptor for _all_ IO initiated by the db_updater.  The advantage is that the db_updater
does not need to do any message passing for disk IO, and thus does not slow down when the
incoming message queue is large.  A message queue much much larger than the number of concurrent
writers can occur if a user writes with batch=ok, and it can also happen rather easily in
a BigCouch cluster.
>>>>>>>
>>>>>>> I don't see how that will improve things, since all write operations
>>>>>>> will still be done in a serialized manner. Since only couch_db_updater
>>>>>>> writes to the DB file, and since access to the couch_db_updater
is
>>>>>>> serialized, to me it only seems that you're solution avoids one
level
>>>>>>> of indirection (the couch_file process). I don't see how, when
using a
>>>>>>> couch_file only for writes, you get the message queue for that
>>>>>>> couc_file process full of write messages.
>>>>>>
>>>>>> It's the db_updater which gets a large message queue, not the couch_file.
 The db_updater ends up with a big backlog of update_docs messages that get in the way when
it needs to make gen_server calls to the couch_file process for IO.  It's a significant problem
in R13B, probably less so in R14B because of some cool optimizations by the OTP team.
>>>>>
>>>>> So, let me see if I get it. The couch_db_updater process is slow
>>>>> picking the results of the calls to the couch_file process because its
>>>>> mailbox is full of update_docs messages?
>>>>
>>>> Correct.  Each call to the couch_file requires a selective receive on the
part of the db_updater in order to get the response, and prior to R14 that selective receive
needed to match against every message in the mailbox.  It's really a bigger problem in couch_server,
which uses a gen_server call to increment a reference counter before handing the #db{} to
the client, since every request to any DB has to talk to couch_server first.  Best,
>>>>
>>>> Adam
>>>
>>> Adam,
>>> I think the problem is made worse by a backed up db_updater, but the
>>> db_updater becomes backed up because it makes more synchronous calls
>>> to the couch_file than a reader does, handling only one update
>>> operation at a time while readers queue up on the couch_file in
>>> parallel.
>>>
>>> Filipe,
>>> Using a separate fd for writes at the couch_file level is not the
>>> answer. The db_updater has to read the btree before it can write,
>>> incurring multiple trips through the couch_file message queue between
>>> queuing append_term requests and processing its message queue for new
>>> updates. Using two file descriptors keeps the readers out of the way
>>> of the writers only if you select which fd to use at the db-operation
>>> level and not the file-operation level. Perhaps two couch_file
>>> processes is better. Fairness should be left to the operating system
>>> I/O scheduler once reads don'. This seems seems like the best way
>>> forward to me right now. Let's try to crunch some numbers on it soon.
>>>
>>> I couldn't find a solution I liked that was fair to readers and
>>> writers at any workload with only one file descriptor. The btree cache
>>> alleviates this problem a bit because the read path becomes much
>>> faster and therefore improves database reads and writes.
>>>
>>> As to the patch, I'd think we need the readers and writers separated
>>> into two separate couch_files. That way the updater can perform its
>>> reads on the "writer" fd, otherwise writers suffer starvation because
>>> readers go directly into the couch_file queue in parallel instead of
>>> serializing through something like db_updater.
>>>
>>
>
> Wasn't there a branch or patch somehwere that just removed the
> ref_counter code entirely and used monitors/links to make sure
> everything behaved correctly? I'm not sure I ever saw it to see how
> dramatic and/or scary it was, but it might be another approach to
> consider.
>

Adam should chime in. I think BigCouch got rid of the ref counter in
favor of something else, but last I asked him about it he said there
might be a small edge case race condition. How critical that is I
can't say. It may be that that edge case is tolerable and recoverable.

Mime
View raw message