Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 44987 invoked from network); 15 Nov 2010 10:46:01 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 15 Nov 2010 10:46:01 -0000 Received: (qmail 27623 invoked by uid 500); 15 Nov 2010 10:46:32 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 27454 invoked by uid 500); 15 Nov 2010 10:46:32 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 27445 invoked by uid 99); 15 Nov 2010 10:46:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Nov 2010 10:46:31 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of msddsm@gmail.com designates 209.85.212.52 as permitted sender) Received: from [209.85.212.52] (HELO mail-vw0-f52.google.com) (209.85.212.52) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Nov 2010 10:46:22 +0000 Received: by vws3 with SMTP id 3so889012vws.11 for ; Mon, 15 Nov 2010 02:46:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:date:from:subject:to :x-priority:message-id:mime-version:content-type :content-transfer-encoding:x-mailer; bh=8MUIIeQaYkRlfxBUUnw/sT1xB8tu5mod5zL5zeJRR9w=; b=mUgEax6K9wBDf8Sw/heMRKGegKq47wGPZYnMgvxFatdBJyBIUEDAaxT4BmoLUXyljU /EjsmEuu29CL21bvW2Pzm6GU2ZBdos/8QClJt0xLptfFqBkyLl/cKpD56WfNIMcWPS3J YgAKBBA6SLO+1zacjlw6pmb0ArhpQzQyTQgHM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:subject:to:x-priority:message-id:mime-version :content-type:content-transfer-encoding:x-mailer; b=cDt1eENAnzDqSpiOnaWvMJnONWx7Bzp6pw3UbzAgxsNMFt5rY8oRW+Vw1BKtcdsbNW szmDH1Fy4+bc6mbvbwcxwAcFVQmayMgeCgyjLrXkFnmlqALDyV1qQYu8J7+dbYF57BpX jfAGnlzfY+MLkx/vhFz2y9Y1/ojZ4Yx0KKv34= Received: by 10.220.165.203 with SMTP id j11mr1452323vcy.137.1289817961426; Mon, 15 Nov 2010 02:46:01 -0800 (PST) Received: from mocha.local (user-112umga.biz.mindspring.com [66.47.90.10]) by mx.google.com with ESMTPS id o17sm2182245vbi.2.2010.11.15.02.45.59 (version=SSLv3 cipher=RC4-MD5); Mon, 15 Nov 2010 02:46:00 -0800 (PST) Date: Mon, 15 Nov 2010 05:45:59 -0500 From: Matthew Sinclair-Day Subject: Change filters and heartbeats To: dev@couchdb.apache.org X-Priority: 3 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Mailer: Mailsmith 2.2.5 X-Virus-Checked: Checked by ClamAV on apache.org Hi, I think this issue was raised a couple months ago but I am not=20 sure what, if anything, has been done to address it or even=20 whether it was formally entered into JIRA. Given the push for a 1.0.2 release, I wanted to raise it as an=20 issue for consideration for the 1.0.2 release. The problem is basically this: when a change filter is busy=20 processing documents, the heartbeat will not be sent across the=20 wire. This plays havoc with clients listening on those feeds=20 because the sockets can eventually time out. This might seem like an unlikely pathological situation, but=20 consider a network of couch servers front ended by one or more=20 application servers. The app servers listen on change=20 notifications to pick up messages from other apps/couch=20 servers. An app server does not want to "read" its own writes,=20 and so a filter is used. In a busy system in which one server is producing most or all of=20 the documents, the change feed can die quickly and often. In a busy system, this failure mode can have unintended=20 consequences by creating feedback loops on the app and database=20 servers, as well as the system as a whole, when the clients=20 attempt to reconnect (and then fail eventually and then=20 reconnect and so on). Obviously, I am describing a specific system implementation,=20 which needs more built-in protections against this failure mode,=20 but solving the heartbeat problem would prevent that mode in the=20 first place :) I'm eager to have this problem fixed and would be willing to=20 take a shot at it if someone could provide a bit of guidance. The problem is easily reproducible in a pathological scenario=20 like this: 1. Create a database named "test" 2. Attach a filter to the database: { "language": "erlang", "views": { }, "filters": { "test": "fun({Doc}, _) ->\n false end." } } 3. Open up a continuous feed using that filter. curl "localhost:5984/test/_changes?filter=3Dtest/test&feed=3Dcontinuous&hea= rtbeat=3D5000" 4. Write many docs into the database (one of my test scripts) #!/bin/sh COUNTER=3D0 while [ $COUNTER -lt 3000 ]; do ITS=3D0 while [ $ITS -lt 5 ]; do PAYLOAD=3D"{\"docType\":\"CS_MANIFEST\",=20 \"contentSetPath\":\"/some/path/${COUNTER}/${ITS}\",=20 \"versionName\":\"$COUNTER\", \"linkDate\":123456789${COUNTER}}" curl -X POST -d "$PAYLOAD" -H=20 "Content-Type:application/json" http://localhost:5984/test let ITS=3DITS+1 done let COUNTER=3DCOUNTER+1 done 5. Notice the heartbeat stops until the script has completed Matt