incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Newson <robert.new...@gmail.com>
Subject Re: Best practices for scaling (many small databases vs. a large one)
Date Tue, 07 Dec 2010 17:08:14 GMT
As another data point, I recently worked with a modest dataset (about
10 Tib) where the same choice could be made between lots of small ones
and a small number of very large ones; there were two fields that were
natural partitions and they differed in scope by a few orders of
magnitude.

I much preferred the small db variant and I also used q=1 as the small
databases did not need sharding. The advantage here is that databases
that are rarely referenced get pushed entirely out of the disk cache,
leaving more space for hotter data, it also made it easier to deal
with the inherent issues of multi-tenancy. The scope chosen for the
database coincided with the scope of access for the application, so
granting read/write permission on a database level worked out very
well.

Best,
B.

On Tue, Dec 7, 2010 at 3:36 PM, Adam Kocoloski <kocolosk@apache.org> wrote:
> On Dec 7, 2010, at 1:18 AM, Matt Adams wrote:
>
>> Hi folks,
>>
>> I am writing with regards to best practices for scaling and the relative impacts
of choosing to use many small databases vs. one (or more) very large databases.
>>
>> Given the scenario with which I am working my original intent was to use many small
databases.  In this situation users either need access to an entire database or not at all
so the native CouchDB access permissions and/or a simple proxy would work quite well to secure
data without the need for a more complicated authentication filter.  This also means that
replication is an either/or thing (I would not need to worry about partial replication of
databases).  There are other reasons why I lean towards many small databases but these are
probably the primary ones (i.e., many smaller databases are simpler for me to implement for
the purposes of getting CouchDB into play).
>>
>> In this scenario most of the databases would be quite small (in the <1GB range)
so we're not dealing with large data sets and the ratio of users to databases is also fairly
low.
>>
>> If users were to instead share one very large database (solely for the purpose of
making things easier to cluster) they would usually only be accessing a very small portion
of the database (e.g., a lot of the data would really belong to many small sets of users and
not likely of interest to the user in question) and I would not want them to have any access
to the remainder.
>>
>> Problems arise in my mind when I start thinking about many thousands of these small
databases.  What are the clustering implications?  Am I going to be busier dealing with
the reality of replicating thousands of smaller databases for fail-over than simply biting
the bullet now and planning for a somewhat more complex setup?   Are things like BigCouch
really more suited to clustering (fewer) very large databases or do they thrive in environments
where there are many small databases?
>>
>>
>> Hopefully this will be enough information for anyone who wishes to chime in and give
me some thoughts or other things to consider.  I am not looking for specific solutions at
this point but instead trying to weigh the pros and cons of moving in a particular direction.
>>
>>
>> Thanks very much,
>>
>> Matt
>
> Hi Matt, in general I would recommend the many small databases option for your situation.
 The only downside I see is that there's a limit to how many databases the server keeps open
at any given time (default = 100).  After that it begins closing the LRU one.  If all 100
open databases are in active use (i.e. in the middle of serving a request) CouchDB will respond
with an error.  Only an issue if you anticipate massive concurrency.
>
> Triggering lots of individual replications can be a chore, but it's not too bad to loop
over _all_dbs and fire off a job for each one.
>
> BigCouch can deal with lots of small databases.  For instance, you might decide that
the databases don't need to be sharded at all.  In that case you could create them with q=1
and BigCouch will evenly distribute the individual database files across the nodes in the
cluster.  Of course, you can still have your data stored redundantly regardless of the value
of q that you choose.  Best,
>
> Adam
>
>

Mime
View raw message