Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 88A0FD4EC for ; Wed, 15 May 2013 06:26:55 +0000 (UTC) Received: (qmail 52015 invoked by uid 500); 15 May 2013 06:26:54 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 51974 invoked by uid 500); 15 May 2013 06:26:54 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 51957 invoked by uid 99); 15 May 2013 06:26:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 May 2013 06:26:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lancecarlson@gmail.com designates 209.85.219.50 as permitted sender) Received: from [209.85.219.50] (HELO mail-oa0-f50.google.com) (209.85.219.50) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 May 2013 06:26:49 +0000 Received: by mail-oa0-f50.google.com with SMTP id l20so1669336oag.23 for ; Tue, 14 May 2013 23:26:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=ammfPnfKZM4X/s2epiCh3jQVcZLymb52lfliAz/2DLs=; b=wiYyHoJlQlZ7w7ocxU19VY8U+URocFR4A7nGKW4Zwx9F7i/Bp2XHqOXJjK2vGNT93U jFsVbRE6ADfel1EWWBrvGC2ioaTQRvlNaVdEWOYeSBgIdf4Kd0NuLl4mhbfASXSI+BjU UKkmE2SaKbMIKiNuw5y1lkBbDjsjkM78E46v6GqYXjQJ+zKcu8Iaq5MC/j6RJT9BZJuq uv8fwhjBqPnV9LavDXmX2oV9o7r2qg4tlB0zocCe5E4PgqxD4ARxTJlmBX+ngD12mOVX 4OdF0UJ0hn1bq+/O4ohgbjsPap8/FQXM1PnzaISuQkHl6rprjFsmpgxDDvYoj6UKjqR7 VV1g== X-Received: by 10.60.131.6 with SMTP id oi6mr17860966oeb.19.1368599189113; Tue, 14 May 2013 23:26:29 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.93.99 with HTTP; Tue, 14 May 2013 23:26:09 -0700 (PDT) In-Reply-To: <20130515061754.GG21091@translab.its.uci.edu> References: <518B16F5.2040306@alumni.gwu.edu> <5A772B81-F184-4E2D-82E9-A96D094632D9@gmail.com> <20130515061754.GG21091@translab.its.uci.edu> From: Lance Carlson Date: Wed, 15 May 2013 02:26:09 -0400 Message-ID: Subject: Re: Mass updates To: "user@couchdb.apache.org" Content-Type: multipart/alternative; boundary=089e013cb81a81d20e04dcbbd460 X-Virus-Checked: Checked by ClamAV on apache.org --089e013cb81a81d20e04dcbbd460 Content-Type: text/plain; charset=ISO-8859-1 I use Redis to stick docs into RAM. Once they're in RAM, I like to use node to parse the docs in the way I want them, then purge the dataset. Couchout pulls them into RAM using Redis, couchin bulk_saves back into couchdb from Redis. I tried to make the couchout/in tools language agnostic. Anyway, you can certainly use whatever language you want and load all of the docs into memory. Typically though if you're dealing with a non statically compiled language, you're going to run into situations where Redis would be more efficient. On Wed, May 15, 2013 at 2:17 AM, James Marca wrote: > On Mon, May 13, 2013 at 02:24:50AM -0400, Lance Carlson wrote: > > Oops, urls: > > > > https://github.com/lancecarlson/couchin.go > > https://github.com/lancecarlson/couchout.go > > > > Feedback appreciated! > > > > I don't understand the use case here, so I'd appreciate an example. > If you can define a view or use all_docs to pull docs from couch and > into redis, why use redis at all? Why not just use couch directly, > load docs into ram, and process them? > > I feel like I'm missing something obvious. > > Also, I've never stressed Redis much. What happens when you bump up > against ram limits? > > James > > > > On Mon, May 13, 2013 at 2:24 AM, Lance Carlson >wrote: > > > > > Made a lot of updates to my couchout project. It now includes a couchin > > > project as well. Might create another project for updating, but it's > pretty > > > easy for someone to script a node js script (or any language for that > > > matter) that connects to redis, decodes and encodes base64. > > > > > > > > > On Sat, May 11, 2013 at 2:27 AM, Andrey Kuprianov < > > > andrey.kouprianov@gmail.com> wrote: > > > > > >> We do that and we have a cron to touch view every 5 min. Its just > that at > > >> that particular time we had to insert those 150k in one go (we were > > >> migrating from mysql) > > >> > > >> Sent from my iPhone > > >> > > >> On 11 May, 2013, at 1:02 PM, Benoit Chesneau > wrote: > > >> > > >> > On May 9, 2013 1:17 PM, "Andrey Kuprianov" < > andrey.kouprianov@gmail.com > > >> > > > >> > wrote: > > >> >> > > >> >> Rebuilding the views mentioned by James is hell! And the more docs > and > > >> >> views you have, the longer your views will have to catch up with > the > > >> >> updates. We dont have the best of the servers, but ours (dedicated) > > >> took > > >> >> several hours to rebuild our views (not too many as well) after we > > >> > inserted > > >> >> ~150k documents (we use full text search with Lucene as well, so it > > >> also > > >> >> contributed to the overall sever slowdown). > > >> >> > > >> >> So my suggestion is: > > >> >> > > >> >> 1. Once you want to migrate your stuff, make a copy of your db. > > >> >> 2. Do migration on the copy > > >> >> 3. Allow for views to rebuild (you need to query each desing's > document > > >> >> single view once to trigger for views to start catching up with the > > >> >> updates). You'd probably ask, if it was possible to limit resource > > >> usage > > >> > of > > >> >> Couch, when views are rebuilding, but i dont have answer to that > > >> question. > > >> >> Maybe someone else can help here... > > >> >> 4. Switch database pointer from one DB to another. > > >> > > > >> > You don' t need to wait that all the docs are here to triggerthe > > >> viewupdat, > > >> > Jus trigger it more often. So view calculation will happen on > smaller > > >> set. > > >> > > > >> > You caneven make it //by using different ddocs. > > >> >> > > >> >> > > >> >> On Thu, May 9, 2013 at 1:41 PM, Paul Davis < > > >> paul.joseph.davis@gmail.com > > >> >> wrote: > > >> >> > > >> >>> On Wed, May 8, 2013 at 10:24 PM, Charles S. Koppelman-Milstein > > >> >>> wrote: > > >> >>>> I am trying to understand whether Couch is the way to go to meet > some > > >> > of > > >> >>>> my organization's needs. It seems pretty terrific. > > >> >>>> The main concern I have is maintaining a consistent state across > code > > >> >>>> releases. Presumably, our data model will change over the > course of > > >> >>>> time, and when it does, we need to make the several million old > > >> >>>> documents conform to the new model. > > >> >>>> > > >> >>>> Although I would love to pipe a view through an update handler > and > > >> > call > > >> >>>> it a day, I don't believe that option exists. The two ways I > > >> >>>> understandto do this are: > > >> >>>> > > >> >>>> 1. Query all documents, update each doc client-side, and PUT > those > > >> >>>> changes in the _bulk_docs API (presumably this should be done in > > >> > batches) > > >> >>>> 2. Query the ids for all docs, and one at a time, PUT them > through an > > >> >>>> update handler > > >> >>> > > >> >>> You are correct that there's no server side way to do a migration > like > > >> >>> you're asking for server side. > > >> >>> > > >> >>> The general pattern for these things is to write a view that only > > >> >>> includes the documents that need to be changed and then write > > >> >>> something that goes through and processes each doc in the view to > the > > >> >>> desired form (that removes it from the view). This way you can > easily > > >> >>> know when you're done working. Its definitely possible to write > > >> >>> something that stores state and/or just brute force a db scan each > > >> >>> time you write run the migration. > > >> >>> > > >> >>> Performance wise, your first suggestion would probably be the most > > >> >>> performant although depending on document sizes and latencies it > may > > >> >>> be possible to get better numbers using an update handler but I > doubt > > >> >>> it unless you have huge docs and a super slow connection with high > > >> >>> latencies. > > >> >>> > > >> >>>> Are these options reasonably performant? If we have to do a > > >> > mass-update > > >> >>>> once a deployment, it's not terrible if it's not > lightning-speed, but > > >> > it > > >> >>>> shouldn't take terribly long. Also, I have read that update > handlers > > >> >>>> have indexes built against them. If this is a fire-once option, > is > > >> > that > > >> >>>> worthwhile? > > >> >>> > > >> >>> I'm not sure what you mean that update handlers have indexes built > > >> >>> against them. That doesn't match anything that currently exist in > > >> >>> CouchDB. > > >> >>> > > >> >>>> Which option is better? Is there an even better way? > > >> >>> > > >> >>> There's nothing better than you're general ideas listed. > > >> >>> > > >> >>>> Thanks, > > >> >>>> Charles > > >> >>> > > >> > > > > > > > > --089e013cb81a81d20e04dcbbd460--