incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: How to implement bulk loading with a "foreign key" involved?
Date Tue, 07 Apr 2009 01:14:19 GMT
On Mon, Apr 6, 2009 at 8:54 PM, David Mitchell <monch1962@gmail.com> wrote:
> Hello everyone,
> I'm having trouble working out how to implement bulk loading of data in a
> particular application I'm working on.
>
> Making it as simple to understand as possible, I've got two types of
> document that need to be stored:
> - "accounts", which includes all the information about a user account (e.g.
> name of the user, account ID, address, account creation date, ...)
> - "transactions", which are tied to a specific account.  Transaction
> information would include e.g. a transaction ID, account ID, transaction
> date, ...  Importantly, I don't know which fields I'll receive in my
> "transaction" records, so I need a schema-less storage model
>
> If this was SQL, I'd have 2 separate tables, with a foreign key from each
> "transactions" record pointing to a record in "accounts".  Nice and simple
> for my SQL-trained brain to work with, but the data I have to work with is
> inherently schema-less so a RDBMS isn't going to work.
>
> With CouchDB, I've got this data going into a single database.  This seems
> to be the accepted best practice, and makes sense for this specific
> application.  This works fine now, but is taking too long - I can't keep up
> with the rate of incoming data as long as I'm loading it in one record at a
> time.
>
> Assume the following naming conventions:
> - A1 is account number 1, A2 is account number 2, ...
> - T1A1 is the first transaction against account number 1, T3A4 is the third
> transaction against account number 4
>
> The data I'm loading may come in the following sequence: A1, T1A1, A2, T2A1,
> A3, T3A1, T1A2, T1A3, A4, T2A2, ...  In other words, I'm receiving new
> account data intermixed with new transaction data.  I'll never receive a
> transaction for an account that doesn't already exist.  Again, nothing
> unusual for a real life application.
>
> I'd really like to be bulk-loading in the data, as the need to load it
> quickly overrides all other requirements at this point.  However, as I
> understand it, bulk loading the data will require that accounts already
> exist for any transactions, and that's difficult giving the intermixing of
> account and transaction data coming in.
>

There is absolutely nothing in CouchDB that would prevent you from
storing a transaction that referenced a non-existant account. CouchDB
does absolutely no referential integrity checks.

> One possibility is that I could conceivably force the end of a bulk load
> "transaction" every time I see a new account number; doing that would ensure
> that I'm never trying to generate a transaction against an account that
> isn't already in the database.  However, I'm wondering if this is the best
> way of dealing with this situation, which is presumably fairly common.
>

If nothing in your logic requires editing of a previous record then
there's nothing you need to worry about. You'd just put a buffer in
front of couchdb that waits for N docs or M seconds and then throw the
buffer at _bulk_docs.

It sounds like you're going to want to look into the "reduce
transactions against account" stuff which basically involves a map
function like:

function(doc) {
    if(doc.type == "account") emit([doc.account_id], null);
    if(doc.type == "transaction") emit([doc.account_id,
doc.transaction_id], null);
}

Your reduce would include whatever domain specific stuff you need to
generate a single view of the account. The general example for this is
bank transactions in that the reduce gives you a current balance of
the account.

> Any thoughts/ideas/suggestions welcome.
>
> Thanks in advance
>
> Dave M.
>

HTH,
Paul Davis

Mime
View raw message