Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of lancecarlson@gmail.com
 designates 209.85.223.180 as permitted sender)
References: <518B16F5.2040306@alumni.gwu.edu>
 <CAJ_m3YDamHmbJK2pfxtTJ=Jy3OA9MxqRwX_QWAhqqnV+vMorCQ@mail.gmail.com>
 <CAG+HO1yx-0SNziDPYkwNT7KCH7LsSZx7-Yvb724mBDthLMs6oQ@mail.gmail.com>
 <CABvT1DEMTgyDvEBH3UEwFVFxZCH+YM1z7rYUBGDB3FMSY+aX8w@mail.gmail.com>
From: Lance Carlson <lancecarlson@gmail.com>
In-Reply-To: 
 <CABvT1DEMTgyDvEBH3UEwFVFxZCH+YM1z7rYUBGDB3FMSY+aX8w@mail.gmail.com>
Mime-Version: 1.0 (1.0)
Date: Thu, 9 May 2013 08:31:38 -0400
Message-ID: <8471588272415020431@unknownmsgid>
Subject: Re: Mass updates
To: "user@couchdb.apache.org" <user@couchdb.apache.org>
Content-Type: text/plain; charset=ISO-8859-1

This is a very common use case. I've been banging my head against a
wall with it a bit too. I think my most ideal and optimal setup would
be to stream all of my relevant docs into Redis (key is the ID of the
document, value is some json blob). A million docs should only use
150MB ish if they are average sized docs. Then grab said updated data
source, update docs that need updating and attach a _deleted flag on
the docs that aren't in the new data set anymore and create new keys
for new docs (I always try to come up with an ID naming convention for
my docs. If your new docs don't require IDs and you just want couch to
generate them, it might be a good idea to just make a Redis key that
is prefixed with new and a UUID) . Then run another batch script that
collects some number of documents at a time and bulk saves them back
into couch.

Perhaps your use case doesn't require bulk deleting like my case, but
when the proposal to start creating new databases came up, I figured
I'd include my alternative method since I've gone down that path
before too and it can be a pain in the arse to have to track what
database is the most up to date.

Sent from my iPhone

On May 9, 2013, at 7:19 AM, Robert Newson <rnewson@apache.org> wrote:

> http://wiki.apache.org/couchdb/How_to_deploy_view_changes_in_a_live_environment
>
>
> On 9 May 2013 12:16, Andrey Kuprianov <andrey.kouprianov@gmail.com> wrote:
>> Rebuilding the views mentioned by James is hell! And the more docs and
>> views you have, the longer your views will have to catch up with the
>> updates. We dont have the best of the servers, but ours (dedicated) took
>> several hours to rebuild our views (not too many as well) after we inserted
>> ~150k documents (we use full text search with Lucene as well, so it also
>> contributed to the overall sever slowdown).
>>
>> So my suggestion is:
>>
>> 1. Once you want to migrate your stuff, make a copy of your db.
>> 2. Do migration on the copy
>> 3. Allow for views to rebuild (you need to query each desing's document
>> single view once to trigger for views to start catching up with the
>> updates). You'd probably ask, if it was possible to limit resource usage of
>> Couch, when views are rebuilding, but i dont have answer to that question.
>> Maybe someone else can help here...
>> 4. Switch database pointer from one DB to another.
>>
>>
>>
>>
>> On Thu, May 9, 2013 at 1:41 PM, Paul Davis <paul.joseph.davis@gmail.com>wrote:
>>
>>> On Wed, May 8, 2013 at 10:24 PM, Charles S. Koppelman-Milstein
>>> <ckoppel@alumni.gwu.edu> wrote:
>>>> I am trying to understand whether Couch is the way to go to meet some of
>>>> my organization's needs.  It seems pretty terrific.
>>>> The main concern I have is maintaining a consistent state across code
>>>> releases.  Presumably, our data model will change over the course of
>>>> time, and when it does, we need to make the several million old
>>>> documents conform to the new model.
>>>>
>>>> Although I would love to pipe a view through an update handler and call
>>>> it a day, I don't believe that option exists.  The two ways I
>>>> understandto do this are:
>>>>
>>>> 1. Query all documents, update each doc client-side, and PUT those
>>>> changes in the _bulk_docs API (presumably this should be done in batches)
>>>> 2. Query the ids for all docs, and one at a time, PUT them through an
>>>> update handler
>>>
>>> You are correct that there's no server side way to do a migration like
>>> you're asking for server side.
>>>
>>> The general pattern for these things is to write a view that only
>>> includes the documents that need to be changed and then write
>>> something that goes through and processes each doc in the view to the
>>> desired form (that removes it from the view). This way you can easily
>>> know when you're done working. Its definitely possible to write
>>> something that stores state and/or just brute force a db scan each
>>> time you write run the migration.
>>>
>>> Performance wise, your first suggestion would probably be the most
>>> performant although depending on document sizes and latencies it may
>>> be possible to get better numbers using an update handler but I doubt
>>> it unless you have huge docs and a super slow connection with high
>>> latencies.
>>>
>>>> Are these options reasonably performant?  If we have to do a mass-update
>>>> once a deployment, it's not terrible if it's not lightning-speed, but it
>>>> shouldn't take terribly long.  Also, I have read that update handlers
>>>> have indexes built against them.  If this is a fire-once option, is that
>>>> worthwhile?
>>>
>>> I'm not sure what you mean that update handlers have indexes built
>>> against them. That doesn't match anything that currently exist in
>>> CouchDB.
>>>
>>>> Which option is better?  Is there an even better way?
>>>
>>> There's nothing better than you're general ideas listed.
>>>
>>>> Thanks,
>>>> Charles
>>>