couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Candler <>
Subject Re: Managing synchronisation with an external data source
Date Wed, 25 Nov 2009 08:46:26 GMT
On Mon, Nov 23, 2009 at 01:10:26PM +1100, Patrick Barnes wrote:
> The external data is delivered as a file containing all the records for  
> each data type, daily.
> I need to keep the information in the database current:
>  - if there is a user defined in the current data that doesn't yet exist 
> in the database, a document needs to be created.
>  - if there a feed-created user in the database that doesn't exist in  
> the current data, that user document needs to be set 'inactive'.
> Similar logic needs to exist for the groups and roles. (Whether roles  
> should be stored in the user document or separately, I'm not yet sure)
> Given that there are ~200k users, ~150k roles, and ~3k groups, how would  
> you suggest this update process be approached?

Sounds like you need a merge. Taking users as an example:

- have a couchdb view which emits users keyed by username
- sort the incoming feed so that it is also keyed by username
- take the first record from the view and the first record from the feed

Then repeat the following:
- if they have identical usernames, skip to next in both view and feed
- if the view username < feed username, mark view record as 'inactive'
  and advance to next view record
- if the view username > feed username, create a new user in database
  and advance to next feed record

This solution uses constant RAM and scales indefinitely. Even though a
couchdb view generates a single JSON object, you can "stream" it easily
because each record within it is delimited by a newline.

OTOH, if your 200K users can be stored in an 'acceptable' amount of memory,
and you don't expect it to grow much larger, you could just read the whole
lot into RAM and process it there. At 1K per user you'd use 200MB of RAM,
which might be acceptable.

View raw message