From user-return-4326-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Tue Apr 07 00:54:52 2009 Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 13597 invoked from network); 7 Apr 2009 00:54:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Apr 2009 00:54:52 -0000 Received: (qmail 82527 invoked by uid 500); 7 Apr 2009 00:54:50 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 82438 invoked by uid 500); 7 Apr 2009 00:54:50 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 82428 invoked by uid 99); 7 Apr 2009 00:54:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Apr 2009 00:54:50 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of monch1962@gmail.com designates 209.85.198.228 as permitted sender) Received: from [209.85.198.228] (HELO rv-out-0506.google.com) (209.85.198.228) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Apr 2009 00:54:44 +0000 Received: by rv-out-0506.google.com with SMTP id k40so1951052rvb.35 for ; Mon, 06 Apr 2009 17:54:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=DQu2fcbQ18mijeeniKYrzmIUR5vsrVtc5FGNJ8+v8Iw=; b=duB8/ELiT9rIttR1vgkb7CJYzb30hUohBxUnNmo/RsZIepTWPQPcimXu77cPuTpute 90Di0wrcvs6Smv2u1WUGDAo06WjEvWdHoZF0EG/xobqml7R63M1VGCJw2V8IetJw30oo 700Rg4IbRUpn95/cj+YX3ZW2IBUXVpZ5PPPhQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=vmICeGOdCGrxAwjSfd8oO5kAg+d7JjTTCsj1mpE3QrSQ9uB9H/IymwT8jgbBDU9YTb 7iDcnIoYAFmvReY2k/rX9n491oLTXxGDCstWuhrKoDTsj0EhLv2Uq4/1EfsPhxQsp+tO 9cITQEwpNqcvy3MXc1fUShq8W3mTL04Bg4uLw= MIME-Version: 1.0 Received: by 10.142.193.10 with SMTP id q10mr1437866wff.48.1239065663565; Mon, 06 Apr 2009 17:54:23 -0700 (PDT) Date: Tue, 7 Apr 2009 11:54:23 +1100 Message-ID: Subject: How to implement bulk loading with a "foreign key" involved? From: David Mitchell To: user@couchdb.apache.org Content-Type: multipart/alternative; boundary=000e0cd17cfcbaf02a0466ec7461 X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd17cfcbaf02a0466ec7461 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hello everyone, I'm having trouble working out how to implement bulk loading of data in a particular application I'm working on. Making it as simple to understand as possible, I've got two types of document that need to be stored: - "accounts", which includes all the information about a user account (e.g. name of the user, account ID, address, account creation date, ...) - "transactions", which are tied to a specific account. Transaction information would include e.g. a transaction ID, account ID, transaction date, ... Importantly, I don't know which fields I'll receive in my "transaction" records, so I need a schema-less storage model If this was SQL, I'd have 2 separate tables, with a foreign key from each "transactions" record pointing to a record in "accounts". Nice and simple for my SQL-trained brain to work with, but the data I have to work with is inherently schema-less so a RDBMS isn't going to work. With CouchDB, I've got this data going into a single database. This seems to be the accepted best practice, and makes sense for this specific application. This works fine now, but is taking too long - I can't keep up with the rate of incoming data as long as I'm loading it in one record at a time. Assume the following naming conventions: - A1 is account number 1, A2 is account number 2, ... - T1A1 is the first transaction against account number 1, T3A4 is the third transaction against account number 4 The data I'm loading may come in the following sequence: A1, T1A1, A2, T2A1, A3, T3A1, T1A2, T1A3, A4, T2A2, ... In other words, I'm receiving new account data intermixed with new transaction data. I'll never receive a transaction for an account that doesn't already exist. Again, nothing unusual for a real life application. I'd really like to be bulk-loading in the data, as the need to load it quickly overrides all other requirements at this point. However, as I understand it, bulk loading the data will require that accounts already exist for any transactions, and that's difficult giving the intermixing of account and transaction data coming in. One possibility is that I could conceivably force the end of a bulk load "transaction" every time I see a new account number; doing that would ensure that I'm never trying to generate a transaction against an account that isn't already in the database. However, I'm wondering if this is the best way of dealing with this situation, which is presumably fairly common. Any thoughts/ideas/suggestions welcome. Thanks in advance Dave M. --000e0cd17cfcbaf02a0466ec7461--