Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of randall.leeds@gmail.com
 designates 209.85.161.52 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=Qx729aVPuWyw/piuhH3AM9/xwNzytlibClGlr87e9bk2s8iW0iUg0LB/Hgcwh8Khar
         0Ff/L4c3CSOfDtCwTjtdpnSyxuZdaVLugJpfhYoinUXUrrSDL7uoidhNGxKYefdAw/yL
         9q3C7tzOkmsVJttOiqFPcHm6STsswkJM3ilSY=
MIME-Version: 1.0
In-Reply-To: <DBB9F98E-ACA5-4CC9-B969-3DA91FA0DC80@apache.org>
References: <AANLkTin8iaiLpvnWxDUHEj_uMUTU3ZR0CT_zsgX-8K5o@mail.gmail.com>
	<DC92A24F-7872-47E1-9BA0-AF1CA70EBAD3@apache.org>
	<AANLkTikUMms9JeuUvHSVLOPMRuwekSm=DYanjmjYu=qF@mail.gmail.com>
	<E67CFA37-A35E-42B4-8151-4A2399F0BBF4@apache.org>
	<AANLkTi=1r-6e3LWqAyGAimrmSrut4-qO4Lo-+BxsPctC@mail.gmail.com>
	<48F6147B-1B71-4712-A3D6-C2781D762D60@apache.org>
	<AANLkTik7XaGe1oVOV-Ag7uFDf69_1Gb_nhpxcvYdSF_O@mail.gmail.com>
	<AANLkTinO7J8BPXv-VFSbig8LtFc70moFu-eyU4nJfcVf@mail.gmail.com>
	<AANLkTimL1G9T+24Mj28TOaGrCr71iYftFUaPgJOUUSJ_@mail.gmail.com>
	<AANLkTin5J3Wm__7voS5_900xWOMMR93LhnBbkT5-y+T+@mail.gmail.com>
	<DBB9F98E-ACA5-4CC9-B969-3DA91FA0DC80@apache.org>
Date: Mon, 8 Nov 2010 18:24:44 -0800
Message-ID: <AANLkTimsbJUtwnzjC5vmn_O9i7pevYczzDj48OyEu0Uf@mail.gmail.com>
Subject: Re: About possibly reverting COUCHDB-767
From: Randall Leeds <randall.leeds@gmail.com>
To: dev@couchdb.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Mon, Nov 8, 2010 at 17:18, Adam Kocoloski <kocolosk@apache.org> wrote:
> On Nov 8, 2010, at 3:28 PM, Randall Leeds wrote:
>
>> On Mon, Nov 8, 2010 at 12:22, Paul Davis <paul.joseph.davis@gmail.com> w=
rote:
>>> On Mon, Nov 8, 2010 at 3:04 PM, Randall Leeds <randall.leeds@gmail.com>=
 wrote:
>>>> Whoops. Hit send too early, but I think I got everything in there that
>>>> I wanted to say.
>>>>
>>>> As for the ref counter bottleneck, I just pushed to
>>>> https://github.com/tilgovi/couchdb/tree/ets_ref_count
>>>> This branch uses a public ets for the ref_counter. I think I managed
>>>> to linear the updates over the {total, RefCtr} keys in the ets table
>>>> such that there should be no race conditions but please, please take a
>>>> look at this if you have time.
>>>>
>>>> It seems to pass the ref_counter tests, but I still need to handle
>>>> giving away ownership of the ets table. Right now I use couch_server
>>>> as the heir so I can use only one ETS table for all couch_ref_counter
>>>> processes, but the couch_server just crashes if it actually receives
>>>> the 'ETS-TRANSFER' message. If I can't find an easy way to hand the
>>>> table to another couch_ref_counter whenever the owner exits I may just
>>>> break the encapsulation of the module a bit by leaving couch_server as
>>>> the owner and ignoring that message.
>>>>
>>>> Thanks, guys. My gut says we're going to get some nice numbers when
>>>> all this is done.
>>>>
>>>> -Randall
>>>>
>>>> On Mon, Nov 8, 2010 at 11:56, Randall Leeds <randall.leeds@gmail.com> =
wrote:
>>>>> Thanks to both of you for getting this conversation going again and
>>>>> for the work on the patch and testing, Filipe.
>>>>>
>>>>> On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski <kocolosk@apache.org> wr=
ote:
>>>>>> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:
>>>>>>
>>>>>>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <kocolosk@apache.org=
> wrote:
>>>>>>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>>>>>>>>
>>>>>>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <kocolosk@apache.o=
rg> wrote:
>>>>>>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>>>>>>>>
>>>>>>>>>>> Also, with this patch I verified (on Solaris, with the 'zpool i=
ostat
>>>>>>>>>>> 1' command) that when running a writes only test with relaximat=
ion
>>>>>>>>>>> (200 write processes), disk write activity is not continuous. W=
ithout
>>>>>>>>>>> this patch, there's continuous (every 1 second) write activity.
>>>>>>>>>>
>>>>>>>>>> I'm confused by this statement. You must be talking about relaxi=
mation runs with delayed_commits =3D true, right? =C2=A0Why do you think yo=
u see larger intervals between write activity with the optimization from CO=
UCHDB-767? =C2=A0Have you measured the time it takes to open the extra FD? =
=C2=A0In my tests that was a sub-millisecond operation, but maybe you've un=
covered something else.
>>>>>>>>>
>>>>>>>>> No, it happens for tests with delayed_commits =3D false. The only
>>>>>>>>> possible explanation I see for the variance might be related to t=
he
>>>>>>>>> Erlang VM scheduler decisions about when to start/run that proces=
s.
>>>>>>>>> Nevertheless, I dont know the exact cause, but the fsync run freq=
uency
>>>>>>>>> varies a lot.
>>>>>>>>
>>>>>>>> I think it's worth investigating. =C2=A0I couldn't reproduce it on=
 my plain-old spinning disk MacBook with 200 writers in relaximation; the I=
OPS reported by iostat stayed very uniform.
>>>>>>>>
>>>>>>>>>>> For the goal of not having readers getting blocked by fsync cal=
ls (and
>>>>>>>>>>> write calls), I would propose using a separate couch_file proce=
ss just
>>>>>>>>>>> for read operations. I have a branch in my github for this (wit=
h
>>>>>>>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaxim=
ation
>>>>>>>>>>> tests are very positive, both reads and writes get better respo=
nse
>>>>>>>>>>> times and throughput:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch=
_reads
>>>>>>>>>>
>>>>>>>>>> I'd like to propose an alternative optimization, which is to kee=
p a dedicated file descriptor open in the couch_db_updater process and use =
that file descriptor for _all_ IO initiated by the db_updater. =C2=A0The ad=
vantage is that the db_updater does not need to do any message passing for =
disk IO, and thus does not slow down when the incoming message queue is lar=
ge. =C2=A0A message queue much much larger than the number of concurrent wr=
iters can occur if a user writes with batch=3Dok, and it can also happen ra=
ther easily in a BigCouch cluster.
>>>>>>>>>
>>>>>>>>> I don't see how that will improve things, since all write operati=
ons
>>>>>>>>> will still be done in a serialized manner. Since only couch_db_up=
dater
>>>>>>>>> writes to the DB file, and since access to the couch_db_updater i=
s
>>>>>>>>> serialized, to me it only seems that you're solution avoids one l=
evel
>>>>>>>>> of indirection (the couch_file process). I don't see how, when us=
ing a
>>>>>>>>> couch_file only for writes, you get the message queue for that
>>>>>>>>> couc_file process full of write messages.
>>>>>>>>
>>>>>>>> It's the db_updater which gets a large message queue, not the couc=
h_file. =C2=A0The db_updater ends up with a big backlog of update_docs mess=
ages that get in the way when it needs to make gen_server calls to the couc=
h_file process for IO. =C2=A0It's a significant problem in R13B, probably l=
ess so in R14B because of some cool optimizations by the OTP team.
>>>>>>>
>>>>>>> So, let me see if I get it. The couch_db_updater process is slow
>>>>>>> picking the results of the calls to the couch_file process because =
its
>>>>>>> mailbox is full of update_docs messages?
>>>>>>
>>>>>> Correct. =C2=A0Each call to the couch_file requires a selective rece=
ive on the part of the db_updater in order to get the response, and prior t=
o R14 that selective receive needed to match against every message in the m=
ailbox. =C2=A0It's really a bigger problem in couch_server, which uses a ge=
n_server call to increment a reference counter before handing the #db{} to =
the client, since every request to any DB has to talk to couch_server first=
. =C2=A0Best,
>>>>>>
>>>>>> Adam
>>>>>
>>>>> Adam,
>>>>> I think the problem is made worse by a backed up db_updater, but the
>>>>> db_updater becomes backed up because it makes more synchronous calls
>>>>> to the couch_file than a reader does, handling only one update
>>>>> operation at a time while readers queue up on the couch_file in
>>>>> parallel.
>>>>>
>>>>> Filipe,
>>>>> Using a separate fd for writes at the couch_file level is not the
>>>>> answer. The db_updater has to read the btree before it can write,
>>>>> incurring multiple trips through the couch_file message queue between
>>>>> queuing append_term requests and processing its message queue for new
>>>>> updates. Using two file descriptors keeps the readers out of the way
>>>>> of the writers only if you select which fd to use at the db-operation
>>>>> level and not the file-operation level. Perhaps two couch_file
>>>>> processes is better. Fairness should be left to the operating system
>>>>> I/O scheduler once reads don'. This seems seems like the best way
>>>>> forward to me right now. Let's try to crunch some numbers on it soon.
>>>>>
>>>>> I couldn't find a solution I liked that was fair to readers and
>>>>> writers at any workload with only one file descriptor. The btree cach=
e
>>>>> alleviates this problem a bit because the read path becomes much
>>>>> faster and therefore improves database reads and writes.
>>>>>
>>>>> As to the patch, I'd think we need the readers and writers separated
>>>>> into two separate couch_files. That way the updater can perform its
>>>>> reads on the "writer" fd, otherwise writers suffer starvation because
>>>>> readers go directly into the couch_file queue in parallel instead of
>>>>> serializing through something like db_updater.
>>>>>
>>>>
>>>
>>> Wasn't there a branch or patch somehwere that just removed the
>>> ref_counter code entirely and used monitors/links to make sure
>>> everything behaved correctly? I'm not sure I ever saw it to see how
>>> dramatic and/or scary it was, but it might be another approach to
>>> consider.
>>>
>>
>> Adam should chime in. I think BigCouch got rid of the ref counter in
>> favor of something else, but last I asked him about it he said there
>> might be a small edge case race condition. How critical that is I
>> can't say. It may be that that edge case is tolerable and recoverable.
>
> BigCouch uses a protected couch_dbs ets table to store all open #dbs and =
a public couch_lru ets table for LRU updates. =C2=A0Reference counting is a=
ccomplished by having clients monitor and demonitor the couch_file. =C2=A0T=
he least recently used database is determined by folding over the couch_lru=
 table. =C2=A0The race condition Randall is referring to is the following:
>
> 1) client process grabs a #db{} record from the couch_dbs ets table
> 2) couch_server calculates that the LRU DB is the one the client just loo=
ked up
> 3) couch_server checks the monitored_by field for the #db.fd process and =
finds no clients are monitoring it
> 4) couch_server kills the DB
> 5) client updates the LRU time (too late)
> 6) client monitors the #db.fd, but it's already dead
>
> Practically speaking, it's been a non-issue, but it does exist. =C2=A0For=
 what it's worth, a similar race condition existed in older (0.10.x and bel=
ow) versions of CouchDB, where the client would increment the ref counter a=
fter receiving the DB from couch_server. =C2=A0The current implementation i=
n CouchDB avoids the race condition by having couch_server increment the re=
f counter on behalf of the client, but that approach has serious performanc=
e and stability implications for BigCouch.
>
> I'll try to take a look at Randall's github branch tonight. =C2=A0Using m=
onitors for reference counting feels really right to me, but as it's not po=
ssible to monitor a process on behalf of someone else I'm not sure it's pos=
sible to avoid the race that I described with monitors. =C2=A0Regards,
>
> Adam

That seems like a pretty easy thing to detect (monitor fails) and
handle gracefully, but check over my code and let me know what you all
want to do.
I just pushed some changes to squash out the last little edge problems with=
 it.
I added one extra test case to see that it cleans itself up properly
and now make check and futon both pass.

I should be around on IRC most of the night (GMT -8) so ping me if
something doesn't make sense.

Cheers,
Randall