couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <randall.le...@gmail.com>
Subject Re: About possibly reverting COUCHDB-767
Date Mon, 08 Nov 2010 19:56:09 GMT
Thanks to both of you for getting this conversation going again and
for the work on the patch and testing, Filipe.

On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski <kocolosk@apache.org> wrote:
> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:
>
>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <kocolosk@apache.org> wrote:
>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>>>
>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <kocolosk@apache.org>
wrote:
>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>>>
>>>>>> Also, with this patch I verified (on Solaris, with the 'zpool iostat
>>>>>> 1' command) that when running a writes only test with relaximation
>>>>>> (200 write processes), disk write activity is not continuous. Without
>>>>>> this patch, there's continuous (every 1 second) write activity.
>>>>>
>>>>> I'm confused by this statement. You must be talking about relaximation
runs with delayed_commits = true, right?  Why do you think you see larger intervals between
write activity with the optimization from COUCHDB-767?  Have you measured the time it takes
to open the extra FD?  In my tests that was a sub-millisecond operation, but maybe you've
uncovered something else.
>>>>
>>>> No, it happens for tests with delayed_commits = false. The only
>>>> possible explanation I see for the variance might be related to the
>>>> Erlang VM scheduler decisions about when to start/run that process.
>>>> Nevertheless, I dont know the exact cause, but the fsync run frequency
>>>> varies a lot.
>>>
>>> I think it's worth investigating.  I couldn't reproduce it on my plain-old spinning
disk MacBook with 200 writers in relaximation; the IOPS reported by iostat stayed very uniform.
>>>
>>>>>> For the goal of not having readers getting blocked by fsync calls
(and
>>>>>> write calls), I would propose using a separate couch_file process
just
>>>>>> for read operations. I have a branch in my github for this (with
>>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>>>>>> tests are very positive, both reads and writes get better response
>>>>>> times and throughput:
>>>>>>
>>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads
>>>>>
>>>>> I'd like to propose an alternative optimization, which is to keep a dedicated
file descriptor open in the couch_db_updater process and use that file descriptor for _all_
IO initiated by the db_updater.  The advantage is that the db_updater does not need to do
any message passing for disk IO, and thus does not slow down when the incoming message queue
is large.  A message queue much much larger than the number of concurrent writers can occur
if a user writes with batch=ok, and it can also happen rather easily in a BigCouch cluster.
>>>>
>>>> I don't see how that will improve things, since all write operations
>>>> will still be done in a serialized manner. Since only couch_db_updater
>>>> writes to the DB file, and since access to the couch_db_updater is
>>>> serialized, to me it only seems that you're solution avoids one level
>>>> of indirection (the couch_file process). I don't see how, when using a
>>>> couch_file only for writes, you get the message queue for that
>>>> couc_file process full of write messages.
>>>
>>> It's the db_updater which gets a large message queue, not the couch_file.  The
db_updater ends up with a big backlog of update_docs messages that get in the way when it
needs to make gen_server calls to the couch_file process for IO.  It's a significant problem
in R13B, probably less so in R14B because of some cool optimizations by the OTP team.
>>
>> So, let me see if I get it. The couch_db_updater process is slow
>> picking the results of the calls to the couch_file process because its
>> mailbox is full of update_docs messages?
>
> Correct.  Each call to the couch_file requires a selective receive on the part of the
db_updater in order to get the response, and prior to R14 that selective receive needed to
match against every message in the mailbox.  It's really a bigger problem in couch_server,
which uses a gen_server call to increment a reference counter before handing the #db{} to
the client, since every request to any DB has to talk to couch_server first.  Best,
>
> Adam

Adam,
I think the problem is made worse by a backed up db_updater, but the
db_updater becomes backed up because it makes more synchronous calls
to the couch_file than a reader does, handling only one update
operation at a time while readers queue up on the couch_file in
parallel.

Filipe,
Using a separate fd for writes at the couch_file level is not the
answer. The db_updater has to read the btree before it can write,
incurring multiple trips through the couch_file message queue between
queuing append_term requests and processing its message queue for new
updates. Using two file descriptors keeps the readers out of the way
of the writers only if you select which fd to use at the db-operation
level and not the file-operation level. Perhaps two couch_file
processes is better. Fairness should be left to the operating system
I/O scheduler once reads don'. This seems seems like the best way
forward to me right now. Let's try to crunch some numbers on it soon.

I couldn't find a solution I liked that was fair to readers and
writers at any workload with only one file descriptor. The btree cache
alleviates this problem a bit because the read path becomes much
faster and therefore improves database reads and writes.

As to the patch, I'd think we need the readers and writers separated
into two separate couch_files. That way the updater can perform its
reads on the "writer" fd, otherwise writers suffer starvation because
readers go directly into the couch_file queue in parallel instead of
serializing through something like db_updater.

Mime
View raw message