Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 81902 invoked from network); 7 Apr 2009 05:18:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Apr 2009 05:18:35 -0000 Received: (qmail 44998 invoked by uid 500); 7 Apr 2009 05:18:34 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 44939 invoked by uid 500); 7 Apr 2009 05:18:34 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 44929 invoked by uid 99); 7 Apr 2009 05:18:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Apr 2009 05:18:34 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of monch1962@gmail.com designates 209.85.198.233 as permitted sender) Received: from [209.85.198.233] (HELO rv-out-0506.google.com) (209.85.198.233) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Apr 2009 05:18:27 +0000 Received: by rv-out-0506.google.com with SMTP id k40so2024794rvb.35 for ; Mon, 06 Apr 2009 22:18:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=zN+K5iJ5P2qFjRgZbrXo4xD8Rbea/Z0kXZT7ShHOupU=; b=MlEFEbGGZnRXBek7yu6qy8dfgE1+keQQ8M9r4hANy1hjd/dIyE+T08RPLH949e0Te+ wir04DjTeVQ/YbcjY5ioKGZNAe1fj8N8GC4fQ4WSG2HFMuNuAxLQKukpWhkhG2ExYcxz T7Cac3byDXlJUwp9sP3+cj7wtoBp4W2FYlHBQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=pxb7tovjWlxmrSedinVHpHE33+U/awVM/GKv9agQO/PyAMu2JjJ2BEIU9g9iVyZ26l l4TZrLDm3hqIz/EizEpRZyUbnIebTwsh1LoN6W3uEy3b+4lBih2M41CsoaQjisJP8xdK A2qXzRevgTmjfIcT/7arvGjyUKDqwz2lfKdFA= MIME-Version: 1.0 Received: by 10.143.31.4 with SMTP id i4mr1509238wfj.102.1239081486905; Mon, 06 Apr 2009 22:18:06 -0700 (PDT) In-Reply-To: References: Date: Tue, 7 Apr 2009 16:18:06 +1100 Message-ID: Subject: Re: How to implement bulk loading with a "foreign key" involved? From: David Mitchell To: user@couchdb.apache.org Content-Type: multipart/alternative; boundary=001636e0b4f1dff3a40466f0235e X-Virus-Checked: Checked by ClamAV on apache.org --001636e0b4f1dff3a40466f0235e Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Thanks Paul, Sounds like I was worrying about a non-problem - wouldn't be the first time! I'll go ahead and give it a shot. Thanks and regards Dave M. 2009/4/7 Paul Davis > On Mon, Apr 6, 2009 at 8:54 PM, David Mitchell > wrote: > > Hello everyone, > > I'm having trouble working out how to implement bulk loading of data in a > > particular application I'm working on. > > > > Making it as simple to understand as possible, I've got two types of > > document that need to be stored: > > - "accounts", which includes all the information about a user account > (e.g. > > name of the user, account ID, address, account creation date, ...) > > - "transactions", which are tied to a specific account. Transaction > > information would include e.g. a transaction ID, account ID, transaction > > date, ... Importantly, I don't know which fields I'll receive in my > > "transaction" records, so I need a schema-less storage model > > > > If this was SQL, I'd have 2 separate tables, with a foreign key from each > > "transactions" record pointing to a record in "accounts". Nice and > simple > > for my SQL-trained brain to work with, but the data I have to work with > is > > inherently schema-less so a RDBMS isn't going to work. > > > > With CouchDB, I've got this data going into a single database. This > seems > > to be the accepted best practice, and makes sense for this specific > > application. This works fine now, but is taking too long - I can't keep > up > > with the rate of incoming data as long as I'm loading it in one record at > a > > time. > > > > Assume the following naming conventions: > > - A1 is account number 1, A2 is account number 2, ... > > - T1A1 is the first transaction against account number 1, T3A4 is the > third > > transaction against account number 4 > > > > The data I'm loading may come in the following sequence: A1, T1A1, A2, > T2A1, > > A3, T3A1, T1A2, T1A3, A4, T2A2, ... In other words, I'm receiving new > > account data intermixed with new transaction data. I'll never receive a > > transaction for an account that doesn't already exist. Again, nothing > > unusual for a real life application. > > > > I'd really like to be bulk-loading in the data, as the need to load it > > quickly overrides all other requirements at this point. However, as I > > understand it, bulk loading the data will require that accounts already > > exist for any transactions, and that's difficult giving the intermixing > of > > account and transaction data coming in. > > > > There is absolutely nothing in CouchDB that would prevent you from > storing a transaction that referenced a non-existant account. CouchDB > does absolutely no referential integrity checks. > > > One possibility is that I could conceivably force the end of a bulk load > > "transaction" every time I see a new account number; doing that would > ensure > > that I'm never trying to generate a transaction against an account that > > isn't already in the database. However, I'm wondering if this is the > best > > way of dealing with this situation, which is presumably fairly common. > > > > If nothing in your logic requires editing of a previous record then > there's nothing you need to worry about. You'd just put a buffer in > front of couchdb that waits for N docs or M seconds and then throw the > buffer at _bulk_docs. > > It sounds like you're going to want to look into the "reduce > transactions against account" stuff which basically involves a map > function like: > > function(doc) { > if(doc.type == "account") emit([doc.account_id], null); > if(doc.type == "transaction") emit([doc.account_id, > doc.transaction_id], null); > } > > Your reduce would include whatever domain specific stuff you need to > generate a single view of the account. The general example for this is > bank transactions in that the reduce gives you a current balance of > the account. > > > Any thoughts/ideas/suggestions welcome. > > > > Thanks in advance > > > > Dave M. > > > > HTH, > Paul Davis > --001636e0b4f1dff3a40466f0235e--