From user-return-14098-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Tue Dec 07 06:19:03 2010 Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 96899 invoked from network); 7 Dec 2010 06:19:02 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Dec 2010 06:19:02 -0000 Received: (qmail 6802 invoked by uid 500); 7 Dec 2010 06:19:01 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 6512 invoked by uid 500); 7 Dec 2010 06:19:01 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 6504 invoked by uid 99); 7 Dec 2010 06:19:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Dec 2010 06:19:00 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [204.209.205.13] (HELO defout.telus.net) (204.209.205.13) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Dec 2010 06:18:51 +0000 Received: from edmwaa03.telusplanet.net ([75.152.223.106]) by priv-edmwes23.telusplanet.net (InterMail vM.8.01.03.00 201-2260-125-20100507) with ESMTP id <20101207061829.XYPP2941.priv-edmwes23.telusplanet.net@edmwaa03.telusplanet.net> for ; Mon, 6 Dec 2010 23:18:29 -0700 Received: from [192.168.110.105] (d75-152-223-106.abhsia.telus.net [75.152.223.106]) by edmwaa03.telusplanet.net (BorderWare Security Platform) with ESMTP id 8F373264E3E088FA for ; Mon, 6 Dec 2010 23:18:29 -0700 (MST) Message-ID: <4CFDD1B4.9040709@phantomware.ca> Date: Mon, 06 Dec 2010 23:18:28 -0700 From: Matt Adams Organization: Radical Dynamic Inc. User-Agent: Mozilla-Thunderbird 2.0.0.24 (X11/20100328) MIME-Version: 1.0 To: user@couchdb.apache.org Subject: Best practices for scaling (many small databases vs. a large one) Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Cloudmark-Analysis: v=1.1 cv=dX91RFRPut01PWtTMwOKLSvJuI9qODIFPTqTY2y8/UQ= c=1 sm=0 a=QTV4QDCyg6UA:10 a=8nJEP1OIZ-IA:10 a=xYI12KoLFo3yltk6aWX3rw==:17 a=Fbf_HT-jOzC26JkuprUA:9 a=iZziii9DIMwJbiIcLasA:7 a=h9fhXFcuOrFcE-eTjejBDKcV3qMA:4 a=wPNLvfGTeEIA:10 a=HpAAvcLHHh0Zw7uRqdWCyQ==:117 X-Virus-Checked: Checked by ClamAV on apache.org Hi folks, I am writing with regards to best practices for scaling and the relative impacts of choosing to use many small databases vs. one (or more) very large databases. Given the scenario with which I am working my original intent was to use many small databases. In this situation users either need access to an entire database or not at all so the native CouchDB access permissions and/or a simple proxy would work quite well to secure data without the need for a more complicated authentication filter. This also means that replication is an either/or thing (I would not need to worry about partial replication of databases). There are other reasons why I lean towards many small databases but these are probably the primary ones (i.e., many smaller databases are simpler for me to implement for the purposes of getting CouchDB into play). In this scenario most of the databases would be quite small (in the <1GB range) so we're not dealing with large data sets and the ratio of users to databases is also fairly low. If users were to instead share one very large database (solely for the purpose of making things easier to cluster) they would usually only be accessing a very small portion of the database (e.g., a lot of the data would really belong to many small sets of users and not likely of interest to the user in question) and I would not want them to have any access to the remainder. Problems arise in my mind when I start thinking about many thousands of these small databases. What are the clustering implications? Am I going to be busier dealing with the reality of replicating thousands of smaller databases for fail-over than simply biting the bullet now and planning for a somewhat more complex setup? Are things like BigCouch really more suited to clustering (fewer) very large databases or do they thrive in environments where there are many small databases? Hopefully this will be enough information for anyone who wishes to chime in and give me some thoughts or other things to consider. I am not looking for specific solutions at this point but instead trying to weigh the pros and cons of moving in a particular direction. Thanks very much, Matt