Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of monch1962@gmail.com
 designates 209.85.198.233 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=pxb7tovjWlxmrSedinVHpHE33+U/awVM/GKv9agQO/PyAMu2JjJ2BEIU9g9iVyZ26l
         l4TZrLDm3hqIz/EizEpRZyUbnIebTwsh1LoN6W3uEy3b+4lBih2M41CsoaQjisJP8xdK
         A2qXzRevgTmjfIcT/7arvGjyUKDqwz2lfKdFA=
MIME-Version: 1.0
In-Reply-To: <e2111bbb0904061814i2449823by28f214539e873f94@mail.gmail.com>
References: <f6508a860904061754q57826b5n7b262b05745edd90@mail.gmail.com>
	 <e2111bbb0904061814i2449823by28f214539e873f94@mail.gmail.com>
Date: Tue, 7 Apr 2009 16:18:06 +1100
Message-ID: <f6508a860904062218l75caed40sae873ac1ebd0da98@mail.gmail.com>
Subject: Re: How to implement bulk loading with a "foreign key" involved?
From: David Mitchell <monch1962@gmail.com>
To: user@couchdb.apache.org
Content-Type: multipart/alternative; boundary=001636e0b4f1dff3a40466f0235e

--001636e0b4f1dff3a40466f0235e
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Thanks Paul,
Sounds like I was worrying about a non-problem - wouldn't be the first time!

I'll go ahead and give it a shot.

Thanks and regards

Dave M.

2009/4/7 Paul Davis <paul.joseph.davis@gmail.com>

> On Mon, Apr 6, 2009 at 8:54 PM, David Mitchell <monch1962@gmail.com>
> wrote:
> > Hello everyone,
> > I'm having trouble working out how to implement bulk loading of data in a
> > particular application I'm working on.
> >
> > Making it as simple to understand as possible, I've got two types of
> > document that need to be stored:
> > - "accounts", which includes all the information about a user account
> (e.g.
> > name of the user, account ID, address, account creation date, ...)
> > - "transactions", which are tied to a specific account.  Transaction
> > information would include e.g. a transaction ID, account ID, transaction
> > date, ...  Importantly, I don't know which fields I'll receive in my
> > "transaction" records, so I need a schema-less storage model
> >
> > If this was SQL, I'd have 2 separate tables, with a foreign key from each
> > "transactions" record pointing to a record in "accounts".  Nice and
> simple
> > for my SQL-trained brain to work with, but the data I have to work with
> is
> > inherently schema-less so a RDBMS isn't going to work.
> >
> > With CouchDB, I've got this data going into a single database.  This
> seems
> > to be the accepted best practice, and makes sense for this specific
> > application.  This works fine now, but is taking too long - I can't keep
> up
> > with the rate of incoming data as long as I'm loading it in one record at
> a
> > time.
> >
> > Assume the following naming conventions:
> > - A1 is account number 1, A2 is account number 2, ...
> > - T1A1 is the first transaction against account number 1, T3A4 is the
> third
> > transaction against account number 4
> >
> > The data I'm loading may come in the following sequence: A1, T1A1, A2,
> T2A1,
> > A3, T3A1, T1A2, T1A3, A4, T2A2, ...  In other words, I'm receiving new
> > account data intermixed with new transaction data.  I'll never receive a
> > transaction for an account that doesn't already exist.  Again, nothing
> > unusual for a real life application.
> >
> > I'd really like to be bulk-loading in the data, as the need to load it
> > quickly overrides all other requirements at this point.  However, as I
> > understand it, bulk loading the data will require that accounts already
> > exist for any transactions, and that's difficult giving the intermixing
> of
> > account and transaction data coming in.
> >
>
> There is absolutely nothing in CouchDB that would prevent you from
> storing a transaction that referenced a non-existant account. CouchDB
> does absolutely no referential integrity checks.
>
> > One possibility is that I could conceivably force the end of a bulk load
> > "transaction" every time I see a new account number; doing that would
> ensure
> > that I'm never trying to generate a transaction against an account that
> > isn't already in the database.  However, I'm wondering if this is the
> best
> > way of dealing with this situation, which is presumably fairly common.
> >
>
> If nothing in your logic requires editing of a previous record then
> there's nothing you need to worry about. You'd just put a buffer in
> front of couchdb that waits for N docs or M seconds and then throw the
> buffer at _bulk_docs.
>
> It sounds like you're going to want to look into the "reduce
> transactions against account" stuff which basically involves a map
> function like:
>
> function(doc) {
>    if(doc.type == "account") emit([doc.account_id], null);
>    if(doc.type == "transaction") emit([doc.account_id,
> doc.transaction_id], null);
> }
>
> Your reduce would include whatever domain specific stuff you need to
> generate a single view of the account. The general example for this is
> bank transactions in that the reduce gives you a current balance of
> the account.
>
> > Any thoughts/ideas/suggestions welcome.
> >
> > Thanks in advance
> >
> > Dave M.
> >
>
> HTH,
> Paul Davis
>

--001636e0b4f1dff3a40466f0235e--