couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Marca <jma...@translab.its.uci.edu>
Subject Re: Mass updates
Date Thu, 09 May 2013 05:18:12 GMT
On Wed, May 08, 2013 at 11:24:37PM -0400, Charles S. Koppelman-Milstein wrote:
> I am trying to understand whether Couch is the way to go to meet some of
> my organization's needs.  It seems pretty terrific.
> The main concern I have is maintaining a consistent state across code
> releases.  Presumably, our data model will change over the course of
> time, and when it does, we need to make the several million old
> documents conform to the new model.
> 
> Although I would love to pipe a view through an update handler and call
> it a day, I don't believe that option exists.  The two ways I
> understandto do this are:
> 
> 1. Query all documents, update each doc client-side, and PUT those
> changes in the _bulk_docs API (presumably this should be done in batches)
> 2. Query the ids for all docs, and one at a time, PUT them through an
> update handler

I don't see much difference between those two options, but what I
would do is something like this, in node or perl or java or whatever
you like using something like (I have node.js code that does something
very similar, so I am cutting and pasting small stuff below)

(I apologize this sort of rambles and may be less helpful that nothing)

  pick a batchsize, start with 100 or so and ramp up. depending on how
  big your documents are, asking for too many at once could be a RAM issue.

  var batchsize = 100
  var querysize = batchsize+1 //(I borrowed this trick from an old posting by jchris, I think)
  var query = {limit:limit
              ,include_docs:true
              }
  var state=get_state_from_couchdb() // use couchdb to store progress
  query.startkey=state

  function get_docs (query,callback){ // generic  boiler plate to send
                                      // a get request to couchdb
				    }
  get_docs(query,function(err,resp){
     // get the plus 1 row just to get its id
     last_fetched= rows.pop()
     // save it to couchdb for some other process to use
     save_state_to_couchdb(last_fetched._id)
     process_rows(rows)
  }

Then as you said, when you are done you can use _bulk_docs to put the
new docs back into the old database, or probably better, write them to
a new database, so that you can keep your old database pristine in
case you break something along the way and want to start over.

This is slow.  No way around that.  It would be slow and dangerous if
you used some sort of view to in-place update your db, but it would
still be slow.

If the processing time is high, you can speed things up by running one
or two threads using the "couchdb as state machine" trick, but
probably doc updating will be super quick and the the limiter will be
disk I/O, so one thread is safest.  I'd still use the state-machine
trick so you can stop and restart without pulling your hair out.

And keep in mind that once you update each doc, then all of your views
will need to get rebuilt against the entire db...there is no way for
the view to know that your change was trivial, etc.  Another reason to
keep the old db in place until your new version's views have all been
rebuilt.

Another option is to ignore the bulk update, and just store a version
tag in the documents.  If the document is version 1.1, and it should
be 4.3, then you know you have to update that document before you do
anything crazy, but it may be that you don't need to do anything...it
is application specific.  If I'm doing traffic counts and version 1.1
of a doc has fields a,b and c, and 1.2 has a new field 'd', I can't go
back and collect 'd' from the older counts, so I don't bother changing
the old docs.  Instead, if a view needs that 'd' field, then I make
sure the version check for 1.2 passes inside of that view.

Hope that helps with your decision.

James Marca

> 
> Are these options reasonably performant?  If we have to do a mass-update
> once a deployment, it's not terrible if it's not lightning-speed, but it
> shouldn't take terribly long.  Also, I have read that update handlers
> have indexes built against them.  If this is a fire-once option, is that
> worthwhile?   
> 
> Which option is better?  Is there an even better way?
> 
> Thanks,
> Charles


Mime
View raw message