couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: Best practices for scaling (many small databases vs. a large one)
Date Tue, 07 Dec 2010 15:36:20 GMT
On Dec 7, 2010, at 1:18 AM, Matt Adams wrote:

> Hi folks,
> 
> I am writing with regards to best practices for scaling and the relative impacts of choosing
to use many small databases vs. one (or more) very large databases.
> 
> Given the scenario with which I am working my original intent was to use many small databases.
 In this situation users either need access to an entire database or not at all so the native
CouchDB access permissions and/or a simple proxy would work quite well to secure data without
the need for a more complicated authentication filter.  This also means that replication is
an either/or thing (I would not need to worry about partial replication of databases).  There
are other reasons why I lean towards many small databases but these are probably the primary
ones (i.e., many smaller databases are simpler for me to implement for the purposes of getting
CouchDB into play).
> 
> In this scenario most of the databases would be quite small (in the <1GB range) so
we're not dealing with large data sets and the ratio of users to databases is also fairly
low.
> 
> If users were to instead share one very large database (solely for the purpose of making
things easier to cluster) they would usually only be accessing a very small portion of the
database (e.g., a lot of the data would really belong to many small sets of users and not
likely of interest to the user in question) and I would not want them to have any access to
the remainder.
> 
> Problems arise in my mind when I start thinking about many thousands of these small databases.
 What are the clustering implications?  Am I going to be busier dealing with the reality of
replicating thousands of smaller databases for fail-over than simply biting the bullet now
and planning for a somewhat more complex setup?   Are things like BigCouch really more suited
to clustering (fewer) very large databases or do they thrive in environments where there are
many small databases?
> 
> 
> Hopefully this will be enough information for anyone who wishes to chime in and give
me some thoughts or other things to consider.  I am not looking for specific solutions at
this point but instead trying to weigh the pros and cons of moving in a particular direction.
> 
> 
> Thanks very much,
> 
> Matt

Hi Matt, in general I would recommend the many small databases option for your situation.
 The only downside I see is that there's a limit to how many databases the server keeps open
at any given time (default = 100).  After that it begins closing the LRU one.  If all 100
open databases are in active use (i.e. in the middle of serving a request) CouchDB will respond
with an error.  Only an issue if you anticipate massive concurrency.

Triggering lots of individual replications can be a chore, but it's not too bad to loop over
_all_dbs and fire off a job for each one.

BigCouch can deal with lots of small databases.  For instance, you might decide that the databases
don't need to be sharded at all.  In that case you could create them with q=1 and BigCouch
will evenly distribute the individual database files across the nodes in the cluster.  Of
course, you can still have your data stored redundantly regardless of the value of q that
you choose.  Best,

Adam


Mime
View raw message