Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 27906 invoked from network); 9 Nov 2010 02:24:42 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Nov 2010 02:24:42 -0000 Received: (qmail 77580 invoked by uid 500); 9 Nov 2010 02:25:12 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 77503 invoked by uid 500); 9 Nov 2010 02:25:12 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 77495 invoked by uid 99); 9 Nov 2010 02:25:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Nov 2010 02:25:11 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of randall.leeds@gmail.com designates 209.85.161.52 as permitted sender) Received: from [209.85.161.52] (HELO mail-fx0-f52.google.com) (209.85.161.52) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Nov 2010 02:25:04 +0000 Received: by fxm8 with SMTP id 8so634895fxm.11 for ; Mon, 08 Nov 2010 18:24:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=RTBaiNpUmH674NnMvejFYDJEM8Nj+9SGi/rpJdLajb0=; b=eTGZvE0L+lSP0dory/01qS7WVzQuJ6r01HodXm3Vcwmuc1zeMWZcOKaEhMzJ4gc+Gf Fnh+aX7S08l0ET/gbsKJZ27Dg+OpsPLjadLLMkQi3d3pI+sXgzU2sTJ4/a8PaZD18eih f3CGEKAJDm/UTz4fY2kSjAqgXt1fKEy2DN6kg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Qx729aVPuWyw/piuhH3AM9/xwNzytlibClGlr87e9bk2s8iW0iUg0LB/Hgcwh8Khar 0Ff/L4c3CSOfDtCwTjtdpnSyxuZdaVLugJpfhYoinUXUrrSDL7uoidhNGxKYefdAw/yL 9q3C7tzOkmsVJttOiqFPcHm6STsswkJM3ilSY= MIME-Version: 1.0 Received: by 10.223.112.1 with SMTP id u1mr4463515fap.109.1289269484351; Mon, 08 Nov 2010 18:24:44 -0800 (PST) Received: by 10.223.79.13 with HTTP; Mon, 8 Nov 2010 18:24:44 -0800 (PST) In-Reply-To: References: <48F6147B-1B71-4712-A3D6-C2781D762D60@apache.org> Date: Mon, 8 Nov 2010 18:24:44 -0800 Message-ID: Subject: Re: About possibly reverting COUCHDB-767 From: Randall Leeds To: dev@couchdb.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Mon, Nov 8, 2010 at 17:18, Adam Kocoloski wrote: > On Nov 8, 2010, at 3:28 PM, Randall Leeds wrote: > >> On Mon, Nov 8, 2010 at 12:22, Paul Davis w= rote: >>> On Mon, Nov 8, 2010 at 3:04 PM, Randall Leeds = wrote: >>>> Whoops. Hit send too early, but I think I got everything in there that >>>> I wanted to say. >>>> >>>> As for the ref counter bottleneck, I just pushed to >>>> https://github.com/tilgovi/couchdb/tree/ets_ref_count >>>> This branch uses a public ets for the ref_counter. I think I managed >>>> to linear the updates over the {total, RefCtr} keys in the ets table >>>> such that there should be no race conditions but please, please take a >>>> look at this if you have time. >>>> >>>> It seems to pass the ref_counter tests, but I still need to handle >>>> giving away ownership of the ets table. Right now I use couch_server >>>> as the heir so I can use only one ETS table for all couch_ref_counter >>>> processes, but the couch_server just crashes if it actually receives >>>> the 'ETS-TRANSFER' message. If I can't find an easy way to hand the >>>> table to another couch_ref_counter whenever the owner exits I may just >>>> break the encapsulation of the module a bit by leaving couch_server as >>>> the owner and ignoring that message. >>>> >>>> Thanks, guys. My gut says we're going to get some nice numbers when >>>> all this is done. >>>> >>>> -Randall >>>> >>>> On Mon, Nov 8, 2010 at 11:56, Randall Leeds = wrote: >>>>> Thanks to both of you for getting this conversation going again and >>>>> for the work on the patch and testing, Filipe. >>>>> >>>>> On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski wr= ote: >>>>>> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote: >>>>>> >>>>>>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski wrote: >>>>>>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote: >>>>>>>> >>>>>>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski wrote: >>>>>>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote: >>>>>>>>>> >>>>>>>>>>> Also, with this patch I verified (on Solaris, with the 'zpool i= ostat >>>>>>>>>>> 1' command) that when running a writes only test with relaximat= ion >>>>>>>>>>> (200 write processes), disk write activity is not continuous. W= ithout >>>>>>>>>>> this patch, there's continuous (every 1 second) write activity. >>>>>>>>>> >>>>>>>>>> I'm confused by this statement. You must be talking about relaxi= mation runs with delayed_commits =3D true, right? =C2=A0Why do you think yo= u see larger intervals between write activity with the optimization from CO= UCHDB-767? =C2=A0Have you measured the time it takes to open the extra FD? = =C2=A0In my tests that was a sub-millisecond operation, but maybe you've un= covered something else. >>>>>>>>> >>>>>>>>> No, it happens for tests with delayed_commits =3D false. The only >>>>>>>>> possible explanation I see for the variance might be related to t= he >>>>>>>>> Erlang VM scheduler decisions about when to start/run that proces= s. >>>>>>>>> Nevertheless, I dont know the exact cause, but the fsync run freq= uency >>>>>>>>> varies a lot. >>>>>>>> >>>>>>>> I think it's worth investigating. =C2=A0I couldn't reproduce it on= my plain-old spinning disk MacBook with 200 writers in relaximation; the I= OPS reported by iostat stayed very uniform. >>>>>>>> >>>>>>>>>>> For the goal of not having readers getting blocked by fsync cal= ls (and >>>>>>>>>>> write calls), I would propose using a separate couch_file proce= ss just >>>>>>>>>>> for read operations. I have a branch in my github for this (wit= h >>>>>>>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaxim= ation >>>>>>>>>>> tests are very positive, both reads and writes get better respo= nse >>>>>>>>>>> times and throughput: >>>>>>>>>>> >>>>>>>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch= _reads >>>>>>>>>> >>>>>>>>>> I'd like to propose an alternative optimization, which is to kee= p a dedicated file descriptor open in the couch_db_updater process and use = that file descriptor for _all_ IO initiated by the db_updater. =C2=A0The ad= vantage is that the db_updater does not need to do any message passing for = disk IO, and thus does not slow down when the incoming message queue is lar= ge. =C2=A0A message queue much much larger than the number of concurrent wr= iters can occur if a user writes with batch=3Dok, and it can also happen ra= ther easily in a BigCouch cluster. >>>>>>>>> >>>>>>>>> I don't see how that will improve things, since all write operati= ons >>>>>>>>> will still be done in a serialized manner. Since only couch_db_up= dater >>>>>>>>> writes to the DB file, and since access to the couch_db_updater i= s >>>>>>>>> serialized, to me it only seems that you're solution avoids one l= evel >>>>>>>>> of indirection (the couch_file process). I don't see how, when us= ing a >>>>>>>>> couch_file only for writes, you get the message queue for that >>>>>>>>> couc_file process full of write messages. >>>>>>>> >>>>>>>> It's the db_updater which gets a large message queue, not the couc= h_file. =C2=A0The db_updater ends up with a big backlog of update_docs mess= ages that get in the way when it needs to make gen_server calls to the couc= h_file process for IO. =C2=A0It's a significant problem in R13B, probably l= ess so in R14B because of some cool optimizations by the OTP team. >>>>>>> >>>>>>> So, let me see if I get it. The couch_db_updater process is slow >>>>>>> picking the results of the calls to the couch_file process because = its >>>>>>> mailbox is full of update_docs messages? >>>>>> >>>>>> Correct. =C2=A0Each call to the couch_file requires a selective rece= ive on the part of the db_updater in order to get the response, and prior t= o R14 that selective receive needed to match against every message in the m= ailbox. =C2=A0It's really a bigger problem in couch_server, which uses a ge= n_server call to increment a reference counter before handing the #db{} to = the client, since every request to any DB has to talk to couch_server first= . =C2=A0Best, >>>>>> >>>>>> Adam >>>>> >>>>> Adam, >>>>> I think the problem is made worse by a backed up db_updater, but the >>>>> db_updater becomes backed up because it makes more synchronous calls >>>>> to the couch_file than a reader does, handling only one update >>>>> operation at a time while readers queue up on the couch_file in >>>>> parallel. >>>>> >>>>> Filipe, >>>>> Using a separate fd for writes at the couch_file level is not the >>>>> answer. The db_updater has to read the btree before it can write, >>>>> incurring multiple trips through the couch_file message queue between >>>>> queuing append_term requests and processing its message queue for new >>>>> updates. Using two file descriptors keeps the readers out of the way >>>>> of the writers only if you select which fd to use at the db-operation >>>>> level and not the file-operation level. Perhaps two couch_file >>>>> processes is better. Fairness should be left to the operating system >>>>> I/O scheduler once reads don'. This seems seems like the best way >>>>> forward to me right now. Let's try to crunch some numbers on it soon. >>>>> >>>>> I couldn't find a solution I liked that was fair to readers and >>>>> writers at any workload with only one file descriptor. The btree cach= e >>>>> alleviates this problem a bit because the read path becomes much >>>>> faster and therefore improves database reads and writes. >>>>> >>>>> As to the patch, I'd think we need the readers and writers separated >>>>> into two separate couch_files. That way the updater can perform its >>>>> reads on the "writer" fd, otherwise writers suffer starvation because >>>>> readers go directly into the couch_file queue in parallel instead of >>>>> serializing through something like db_updater. >>>>> >>>> >>> >>> Wasn't there a branch or patch somehwere that just removed the >>> ref_counter code entirely and used monitors/links to make sure >>> everything behaved correctly? I'm not sure I ever saw it to see how >>> dramatic and/or scary it was, but it might be another approach to >>> consider. >>> >> >> Adam should chime in. I think BigCouch got rid of the ref counter in >> favor of something else, but last I asked him about it he said there >> might be a small edge case race condition. How critical that is I >> can't say. It may be that that edge case is tolerable and recoverable. > > BigCouch uses a protected couch_dbs ets table to store all open #dbs and = a public couch_lru ets table for LRU updates. =C2=A0Reference counting is a= ccomplished by having clients monitor and demonitor the couch_file. =C2=A0T= he least recently used database is determined by folding over the couch_lru= table. =C2=A0The race condition Randall is referring to is the following: > > 1) client process grabs a #db{} record from the couch_dbs ets table > 2) couch_server calculates that the LRU DB is the one the client just loo= ked up > 3) couch_server checks the monitored_by field for the #db.fd process and = finds no clients are monitoring it > 4) couch_server kills the DB > 5) client updates the LRU time (too late) > 6) client monitors the #db.fd, but it's already dead > > Practically speaking, it's been a non-issue, but it does exist. =C2=A0For= what it's worth, a similar race condition existed in older (0.10.x and bel= ow) versions of CouchDB, where the client would increment the ref counter a= fter receiving the DB from couch_server. =C2=A0The current implementation i= n CouchDB avoids the race condition by having couch_server increment the re= f counter on behalf of the client, but that approach has serious performanc= e and stability implications for BigCouch. > > I'll try to take a look at Randall's github branch tonight. =C2=A0Using m= onitors for reference counting feels really right to me, but as it's not po= ssible to monitor a process on behalf of someone else I'm not sure it's pos= sible to avoid the race that I described with monitors. =C2=A0Regards, > > Adam That seems like a pretty easy thing to detect (monitor fails) and handle gracefully, but check over my code and let me know what you all want to do. I just pushed some changes to squash out the last little edge problems with= it. I added one extra test case to see that it cleans itself up properly and now make check and futon both pass. I should be around on IRC most of the night (GMT -8) so ping me if something doesn't make sense. Cheers, Randall