Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of msddsm@gmail.com designates
 209.85.212.52 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=date:from:subject:to:x-priority:message-id:mime-version
         :content-type:content-transfer-encoding:x-mailer;
        b=cDt1eENAnzDqSpiOnaWvMJnONWx7Bzp6pw3UbzAgxsNMFt5rY8oRW+Vw1BKtcdsbNW
         szmDH1Fy4+bc6mbvbwcxwAcFVQmayMgeCgyjLrXkFnmlqALDyV1qQYu8J7+dbYF57BpX
         jfAGnlzfY+MLkx/vhFz2y9Y1/ojZ4Yx0KKv34=
Date: Mon, 15 Nov 2010 05:45:59 -0500
From: Matthew Sinclair-Day <msddsm@gmail.com>
Subject: Change filters and heartbeats
To: dev@couchdb.apache.org
Message-ID: <r314Ps-1064i-C2C84F4CE9F04DCE895ECC9A9856B680@mocha.local>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Hi,

I think this issue was raised a couple months ago but I am not=20
sure what, if anything, has been done to address it or even=20
whether it was formally entered into JIRA.

Given the push for a 1.0.2 release, I wanted to raise it as an=20
issue for consideration for the 1.0.2 release.

The problem is basically this: when a change filter is busy=20
processing documents, the heartbeat will not be sent across the=20
wire.  This plays havoc with clients listening on those feeds=20
because the sockets can eventually time out.

This might seem like an unlikely pathological situation, but=20
consider a network of couch servers front ended by one or more=20
application servers.  The app servers listen on change=20
notifications to pick up messages from other apps/couch=20
servers.  An app server does not want to "read" its own writes,=20
and so a filter is used.

In a busy system in which one server is producing most or all of=20
the documents, the change feed can die quickly and often.

In a busy system, this failure mode can have unintended=20
consequences by creating feedback loops on the app and database=20
servers, as well as the system as a whole, when the clients=20
attempt to reconnect (and then fail eventually and then=20
reconnect and so on).

Obviously, I am describing a specific system implementation,=20
which needs more built-in protections against this failure mode,=20
but solving the heartbeat problem would prevent that mode in the=20
first place :)

I'm eager to have this problem fixed and would be willing to=20
take a shot at it if someone could provide a bit of guidance.

The problem is easily reproducible in a pathological scenario=20
like this:

1. Create a database named "test"

2. Attach a filter to the database:

{
    "language": "erlang",
    "views": {
    },
    "filters": {
        "test": "fun({Doc}, _) ->\n false end."
    }
}

3. Open up a continuous feed using that filter.

curl "localhost:5984/test/_changes?filter=3Dtest/test&feed=3Dcontinuous&hea=
rtbeat=3D5000"

4. Write many docs into the database

(one of my test scripts)

#!/bin/sh

COUNTER=3D0
while [  $COUNTER -lt 3000 ]; do
     ITS=3D0
     while [  $ITS -lt 5 ]; do
         PAYLOAD=3D"{\"docType\":\"CS_MANIFEST\",=20
\"contentSetPath\":\"/some/path/${COUNTER}/${ITS}\",=20
\"versionName\":\"$COUNTER\", \"linkDate\":123456789${COUNTER}}"
         curl -X POST -d "$PAYLOAD" -H=20
"Content-Type:application/json" http://localhost:5984/test
         let ITS=3DITS+1
     done
     let COUNTER=3DCOUNTER+1
done


5. Notice the heartbeat stops until the script has completed


Matt