incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Candler <B.Cand...@pobox.com>
Subject Re: Bulk CSV import?
Date Fri, 15 Jan 2010 08:45:54 GMT
On Thu, Jan 14, 2010 at 02:15:54PM -0800, Chris Anderson wrote:
> > More difficult would be to allow bulk *updates* via this mechanism, because
> > having parsed out the IDs you'd need to be able to fetch existing docs,
> > modify and write back.
> >
> 
> If the CSV source was responsible for tracking _revs then it could work easily.

What I mean is, couchdb itself can't parse out the _id and _rev as the
stream comes in (since the CSV parsing isn't built into couchdb), so it
can't pre-fetch the docs. The doc fetch requests would have to be bounced
back to couchdb core. e.g.


data over HTTP            opaque data
-------------> couchdb ----------------> updater function
                         _ids and _revs
                       <----------------
                          original docs
                       ---------------->
                          updated docs
                       <----------------

But if we allow streaming that's going to be awkward; the 'opaque data'
stream may have to be interleaved with the 'original docs' stream.

Then after updating the docs, what is couchdb going to do with the results
of each save, i.e.  success/fail and new _revs?  It could send them back to
the client in JSON format like the result of a _bulk_save, but that won't
mean much to must users. So you probably also want:

                         save statuses
                       ---------------->
                        response stream or HTML status page
                       <----------------

If you want to stream all this, and you don't want couchjs functions to be
able to make asynchronous callbacks to couchdb, you could run three separate
couchjs processes in parallel:

data over HTTP            opaque data
-------------> couchdb ----------------> parser function
                         _ids,_revs and updates
                       <----------------

                         JSON docs+updates
                       ----------------> updater function
                          updated docs
                       <----------------

                          doc statuses
                       ----------------> results list function
                          opaque data
                       <----------------

Maybe there's a way to do this multipass load using some sort of staging
docs in the database itself. Imagine saving '_bulk_docs' requests and
responses as docs themselves, then spooling them out using a list function.

It could be simpler without streaming:

                      -------> blob
                      <------- _all_docs request
                      -------> _all_docs response
                      <------- _bulk_save request
                      -------> _bulk_save response
                      <------- blob

That would let you import 10MB of data via a couchapp, but for 10GB you'd
need a custom app in front.

Mime
View raw message