Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C0B9911CA2 for ; Thu, 9 May 2013 12:32:10 +0000 (UTC) Received: (qmail 22220 invoked by uid 500); 9 May 2013 12:32:09 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 22053 invoked by uid 500); 9 May 2013 12:32:06 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 22031 invoked by uid 99); 9 May 2013 12:32:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 May 2013 12:32:05 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of lancecarlson@gmail.com designates 209.85.223.180 as permitted sender) Received: from [209.85.223.180] (HELO mail-ie0-f180.google.com) (209.85.223.180) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 May 2013 12:31:59 +0000 Received: by mail-ie0-f180.google.com with SMTP id ar20so5167629iec.25 for ; Thu, 09 May 2013 05:31:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:references:from:in-reply-to:mime-version:date:message-id :subject:to:content-type; bh=xNC4c5nmZMp5uAqtQZgibfg2ZdJ4+8TtCTAki+NUooc=; b=KSpVMzCgxA4C8e40MEMrhY5/rS6pElyvhQICJWTUP8bANpwvUOiizlRmmCF5rSCD1p GJemsjFnmyG5DMWORGnl3/xnZxgExURL+8PNuioeCzOpUC+pG7ddOyl06ui9xSvQsja5 DiY+Ke2OIQhvXpuFQmqz347SxM5HZgZflhVIjynlBAlCIgFUrodUmbYfVPrf6AWeujML B7/9NwWGwK0XsK9geaVZejaemI6AlSOz4HD6BROtNy25EkooyDUbYEm57S5X4mxd0jlM seOWHa8Ic6Uh+1p24XurL15HfobIS8/oIt6pSkRM9hGNv8kDOXjjCQ5spJwHXi5GvHmy XSJQ== X-Received: by 10.50.97.7 with SMTP id dw7mr8065752igb.37.1368102698891; Thu, 09 May 2013 05:31:38 -0700 (PDT) References: <518B16F5.2040306@alumni.gwu.edu> From: Lance Carlson In-Reply-To: Mime-Version: 1.0 (1.0) Date: Thu, 9 May 2013 08:31:38 -0400 Message-ID: <8471588272415020431@unknownmsgid> Subject: Re: Mass updates To: "user@couchdb.apache.org" Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org This is a very common use case. I've been banging my head against a wall with it a bit too. I think my most ideal and optimal setup would be to stream all of my relevant docs into Redis (key is the ID of the document, value is some json blob). A million docs should only use 150MB ish if they are average sized docs. Then grab said updated data source, update docs that need updating and attach a _deleted flag on the docs that aren't in the new data set anymore and create new keys for new docs (I always try to come up with an ID naming convention for my docs. If your new docs don't require IDs and you just want couch to generate them, it might be a good idea to just make a Redis key that is prefixed with new and a UUID) . Then run another batch script that collects some number of documents at a time and bulk saves them back into couch. Perhaps your use case doesn't require bulk deleting like my case, but when the proposal to start creating new databases came up, I figured I'd include my alternative method since I've gone down that path before too and it can be a pain in the arse to have to track what database is the most up to date. Sent from my iPhone On May 9, 2013, at 7:19 AM, Robert Newson wrote: > http://wiki.apache.org/couchdb/How_to_deploy_view_changes_in_a_live_environment > > > On 9 May 2013 12:16, Andrey Kuprianov wrote: >> Rebuilding the views mentioned by James is hell! And the more docs and >> views you have, the longer your views will have to catch up with the >> updates. We dont have the best of the servers, but ours (dedicated) took >> several hours to rebuild our views (not too many as well) after we inserted >> ~150k documents (we use full text search with Lucene as well, so it also >> contributed to the overall sever slowdown). >> >> So my suggestion is: >> >> 1. Once you want to migrate your stuff, make a copy of your db. >> 2. Do migration on the copy >> 3. Allow for views to rebuild (you need to query each desing's document >> single view once to trigger for views to start catching up with the >> updates). You'd probably ask, if it was possible to limit resource usage of >> Couch, when views are rebuilding, but i dont have answer to that question. >> Maybe someone else can help here... >> 4. Switch database pointer from one DB to another. >> >> >> >> >> On Thu, May 9, 2013 at 1:41 PM, Paul Davis wrote: >> >>> On Wed, May 8, 2013 at 10:24 PM, Charles S. Koppelman-Milstein >>> wrote: >>>> I am trying to understand whether Couch is the way to go to meet some of >>>> my organization's needs. It seems pretty terrific. >>>> The main concern I have is maintaining a consistent state across code >>>> releases. Presumably, our data model will change over the course of >>>> time, and when it does, we need to make the several million old >>>> documents conform to the new model. >>>> >>>> Although I would love to pipe a view through an update handler and call >>>> it a day, I don't believe that option exists. The two ways I >>>> understandto do this are: >>>> >>>> 1. Query all documents, update each doc client-side, and PUT those >>>> changes in the _bulk_docs API (presumably this should be done in batches) >>>> 2. Query the ids for all docs, and one at a time, PUT them through an >>>> update handler >>> >>> You are correct that there's no server side way to do a migration like >>> you're asking for server side. >>> >>> The general pattern for these things is to write a view that only >>> includes the documents that need to be changed and then write >>> something that goes through and processes each doc in the view to the >>> desired form (that removes it from the view). This way you can easily >>> know when you're done working. Its definitely possible to write >>> something that stores state and/or just brute force a db scan each >>> time you write run the migration. >>> >>> Performance wise, your first suggestion would probably be the most >>> performant although depending on document sizes and latencies it may >>> be possible to get better numbers using an update handler but I doubt >>> it unless you have huge docs and a super slow connection with high >>> latencies. >>> >>>> Are these options reasonably performant? If we have to do a mass-update >>>> once a deployment, it's not terrible if it's not lightning-speed, but it >>>> shouldn't take terribly long. Also, I have read that update handlers >>>> have indexes built against them. If this is a fire-once option, is that >>>> worthwhile? >>> >>> I'm not sure what you mean that update handlers have indexes built >>> against them. That doesn't match anything that currently exist in >>> CouchDB. >>> >>>> Which option is better? Is there an even better way? >>> >>> There's nothing better than you're general ideas listed. >>> >>>> Thanks, >>>> Charles >>>