Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: couchdb-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of jchris@gmail.com designates
 64.233.166.177 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:sender:to:subject:in-reply-to:mime-version
         :content-type:content-transfer-encoding:content-disposition
         :references:x-google-sender-auth;
        b=mefwSzkQ2+svInoXuAYhvaxPsauOQO+Ncmh+c0remBicupvyfVUpF+Y5Iqv8CK6boA
         PjzvSITlv0mRgaMknUovuWYsd9SbLd8o/br5mjAKaLyyGymzWTqvrStKVK/6l8dCmgdf
         qoPYzbEzqqsdp9LG7tjrScs1foTOpN7Rv7SsA=
Message-ID: <e282921e0809141534g3de9e35w91fd3a5e1c845c31@mail.gmail.com>
Date: Sun, 14 Sep 2008 18:34:25 -0400
From: "Chris Anderson" <jchris@apache.org>
Sender: jchris@gmail.com
To: couchdb-user@incubator.apache.org
Subject: Re: Bulk Load
In-Reply-To: <5871b9da0809141449o396a14fehac44316cdeac435c@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <48CBEAC7.1030403@yahoo.com>
	 <e282921e0809131020v67b9c61bhb5f01fb29f907020@mail.gmail.com>
	 <48CC5768.8050405@yahoo.com>
	 <5871b9da0809131724x1edb1fb9r7430e01ffb97bdd5@mail.gmail.com>
	 <7c40ded80809131837u6c3cfb61s2036a2e7679daece@mail.gmail.com>
	 <5871b9da0809140245j6a3933b8r6755f801e592b05d@mail.gmail.com>
	 <076CE1E1-568E-4891-92BF-84E79082A7AB@apache.org>
	 <5871b9da0809141449o396a14fehac44316cdeac435c@mail.gmail.com>

I think the application-level versioning system is the way to go. For
starters, applications will have different needs.

In the financial example we're discussing, maybe it would be best to
keep the current (most recently saved) version of the bond as a
document with all fields present. For history, each time before saving
updates to the bond document, record a separate change-set doc. The
change-set doc would hold only the fields which are changed between
the previous and current versions.

This gives the ability to reconstruct the doc at any time, as well as
quick access to the current state.

Like git or mpeg, you could be pragmatic about storing snapshots of
the full doc once every 50 saves, or something, so that reconstruction
of a given version would take a guaranteed number of loads.

CouchDB's current map/reduce functionality isn't quite suited to the
version-reconstruction query, but Damien has said on a few occasions
that there can and will be other view engines. I think that the
"remap" we discussed a few months ago -- essentially reduce that never
contemplates large ranges of keys, only view rows that have the same
key, eg group=true -- would be a fine way to reconstruct the doc at
any date, with minimal query time data-transfer and calculation.

Chris

On Sun, Sep 14, 2008 at 5:49 PM, Ronny Hanssen <super.ronny@gmail.com> wrote:
> Thanks for your reply, Jan.
>
> I do remember the discussion in the mailinglist, but at the time I didn't
> understand the argumentation. Maybe because I really didn't have time to
> dive into the matter back then. But, it seriously has puzzled me since. Then
> this post appears and I jump at the chance to get this cleared out (sorry
> for being slow - which makes me the opposite of arrogant I guess :D).
>
> But, I don't have a solution. I guess you are right in that sense. I just
> fail to see that making new docs are making life easier? I believe it makes
> the single node case worse and probably equally difficult (or worse) for the
> distributed multiple node architecture. Reading from what you say, there is
> "evil" lurking in the replication process no matter which way we handle
> this. I mean, for multiple nodes the replication would probably be slower
> than the return to the users changing the same doc on two different nodes to
> be informed. This would result in multiple versions of the same doc being
> around, at least until replication - when couchdb would find out that two
> competing versions exist. I might be wrong about this, but the users can't
> be left waiting for an "ok-saved" reply from couchdb "forever", right? So,
> couchdb would have to decide which version "wins" during replication, right?
>
>
> Considering the effects you are hinting about, I'd personally want a single
> node couchdb for writes, with extra nodes for reading and serving views...
> Maybe additional write-nodes for different doc-types (one write-node pr
> doc-type)... Just to "ensure" that there cannot be two+ docs updated at two+
> nodes simultaneously. That is, in the beginning I'd really rather go for a
> single node, with a replicated backup/failover. As (if) system stress
> increase I'd opt for splitting write and reads on nodes and/or creating
> write-nodes designated for different doc-types. This is still not perfect,
> but distributed never will be, really.
>
> Unless... If the couchdb data was stored in a distributed file-system (NAS
> or SAN), each copy of the couchdb process would be operating on the same
> disk. This doesn't mean more data-reliability and also imposes delays in
> reads and writes. But, it would mean that couchdb would be scalable
> (multiple (vurtual" nodes work on same physical disk). Other "physical"
> nodes could be created that would replicate as couchdb is set up to do
> already. So, allowing "virtual" nodes could work out as a nice addition I
> think.
>
> But, then again, my knowledge in distributed file-systems (NAS or SAN) are
> really limited... And, I might have missed out on alot more than that - so
> all this might of course just be stupid :)
>
> Thank's for reading.
>
> ~Ronny
>
> 2008/9/14 Jan Lehnardt <jan@apache.org>
>
>> Hi Ronny,
>> On Sep 14, 2008, at 11:45, Ronny Hanssen wrote:
>>
>>> Or have I seriously missed out on some vital information?  Because, based
>>> on
>>> the above I still feel very confused about why we cannot use the built-in
>>> rev-control mechanism.
>>>
>>
>> You correctly identify that adding revision control to a single node
>> instance of
>> CouchDB is not that hard (a quick search through the archives would have
>> told
>> you, too :-) Making all that work in a distributed environment with
>> replication conflict
>> detection and all is mighty hard. If you can come up with a nice an clean
>> solution to
>> make proper revision control work with CouchDB's replication including all
>> the weird
>> edge cases I don't even know about (aren't I arrogant this morning? :), we
>> are happy
>> to hear about it.
>>
>> Cheers
>> Jan
>> --
>>
>>
>>
>>
>>
>>>
>>> ~Ronny
>>>
>>> 2008/9/14 Jeremy Wall <jwall@google.com>
>>>
>>>  Two reasons.
>>>> * First as I understand it the revisions are not changes between
>>>> documents.
>>>> They are actual full copies of the document.
>>>> * Second revisions get blown away when doing a database compact.
>>>> Something
>>>> you will more than likely want to do since it eats up database space
>>>> fairly
>>>> quickly. (see above for the reason why)
>>>>
>>>> That said there is nothing preventing you from storing revisions in
>>>> CouchDB.
>>>> You could store a changeset for each document revision is a seperate
>>>> revision document that accompanies your main document. It would be really
>>>> easy and designing views to take advantage of them to show a revision
>>>> history for you document would be really easy.
>>>>
>>>> I suppose you could use the revisions that CouchDB stores but that
>>>> wouldn't
>>>> be very efficient since each one is a complete copy of the document. And
>>>> you
>>>> couldn't depend on that "feature not changing behaviour on you in later
>>>> versions since it's not intended for revision history as a feature.
>>>>
>>>> On Sat, Sep 13, 2008 at 7:24 PM, Ronny Hanssen <super.ronny@gmail.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>  Why is the revision control system in couchdb inadequate for, well,
>>>>> revision
>>>>> control? I thought that this feature indeed was a feature, not just an
>>>>> internal mechanism for resolving conflicts?
>>>>> Ronny
>>>>>
>>>>> 2008/9/14 Calum Miller <calum_miller@yahoo.com>
>>>>>
>>>>>  Hi Chris,
>>>>>>
>>>>>> Many thanks for your prompt response.
>>>>>>
>>>>>> Storing  a complete new version of each bond/instrument every day seems
>>>>>>
>>>>> a
>>>>
>>>>> tad excessive. You can imagine how fast the database will grow overtime
>>>>>>
>>>>> if a
>>>>>
>>>>>> unique version of each instrument must be saved, rather than just the
>>>>>> individual changes. This must be a common pattern, not confined to
>>>>>> investment banking. Any ideas how this pattern can be accommodated
>>>>>>
>>>>> within
>>>>
>>>>> CouchDB?
>>>>>>
>>>>>> Calum Miller
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Chris Anderson wrote:
>>>>>>
>>>>>>  Calum,
>>>>>>>
>>>>>>> CouchDB should be easily able to handle this load.
>>>>>>>
>>>>>>> Please note that the built-in revision system is not designed for
>>>>>>> document history. Its sole purpose is to manage conflicting documents
>>>>>>> that result from edits done in separate copies of the DB, which are
>>>>>>> subsequently replicated into a single DB.
>>>>>>>
>>>>>>> If you allow CouchDB to create a new document for each daily import of
>>>>>>> each security, and create a view which makes these documents available
>>>>>>> by security and date, you should be able to access securities history
>>>>>>> fairly simply.
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>> On Sat, Sep 13, 2008 at 12:31 PM, Calum Miller <
>>>>>>>
>>>>>> calum_miller@yahoo.com>
>>>>
>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>  Hi,
>>>>>>>>
>>>>>>>> I trying to evaluate CouchDB for use within investment banking, yes
>>>>>>>>
>>>>>>> some
>>>>>
>>>>>> of
>>>>>>>> these banks still exist. I want to load 500,000 bonds into the
>>>>>>>>
>>>>>>> database
>>>>
>>>>> with
>>>>>>>> each bond containing around 100 fields. I would be looking to bulk
>>>>>>>>
>>>>>>> load
>>>>
>>>>> a
>>>>>
>>>>>> similar amount of these bonds every day whilst maintaining a history
>>>>>>>>
>>>>>>> via
>>>>>
>>>>>> the
>>>>>>>> revision feature. Are there any bulk load features available for
>>>>>>>>
>>>>>>> CouchDB
>>>>>
>>>>>> and
>>>>>>>> any tips on how to manage regular loads of this volume?
>>>>>>>>
>>>>>>>> Many thanks in advance and best of luck with this project.
>>>>>>>>
>>>>>>>> Calum Miller
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>


-- 
Chris Anderson
http://jchris.mfdz.com