couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Smith <G...@primeexalia.com>
Subject A couple of couchdb questions.
Date Tue, 24 Mar 2009 21:14:23 GMT
Hello, 

I'm working to implement a document warehouse for about 15m documents.  These documents range
between 10kb - 500kb (legacy archived pdf's).  Currently we do this by maintaining a mysql
database with the document stored on a variety of servers (consisting of about 6TB).  Most
of the problems that we encounter are a) backups and b) physical access to the documents as
they are on a private network.  Since not much changes backups aren't really that much of
a problem (but restoring is very slow).  We are now looking to add documents to this regularly
(about 10k per week).  So we are looking to implement something new, or at least, more useful.

So we thought about using Amazon S3 for storage but these documents fall under HIPA constraints
so we have decided to do this in house.  

Looking at couchdb, it pretty much does what we are looking to do.  We really only want to
store a document and maybe some very basic metadata (which we currently do by having both
a PDF and a metadata file).  Implementation doesnt look like a problem with the documented
API.

So, the questions.

I would like to break this down into multiple servers and incorporate replication at the same
time.  The document says that pull is recommended over push but doesn't mention why.  Does
push replication require the slave (or other node) to accept the put/post request as completed?
 

If we choose pull replication instead of push, I assume that this is something we will need
to crontab out to schedule it, or does it have a background process that constantly syncs?
 API looks like just a single get request.

Either way, here is what we are looking to do at this time.  At two seperate locations we
will have multiple servers, setup in a master/master configuration.  We should not run into
any conflicts are updates are not allowed.  ID's are unique (MD5 checksum and some other unique
information).  

We wanted to use 4 servers at each location, partly because each server has 4TB of space (actually
3TB of raid 5).  Each server will hold files based upon the first digit of the MD5 checksum
(0-3 on server A, 4-7 on server B, 8-A on server C, and B-F on server D).  We were thinking
of using Apache's URL rewriting to proxy the request to the proper server.  This should work
for both get/put/post.

We will also have the backup server at the second location (which will be the active for their
location) using the same ideology.

What would be most useful is to be able to ensure that before a commit is accepted on a server
we could gaurentee that it has been replicated to a second box.

Any ideas or suggestions on that?
Mime
View raw message