incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Davis" <paul.joseph.da...@gmail.com>
Subject Re: Bulk Load
Date Wed, 17 Sep 2008 15:34:39 GMT
Alternatively something like the following might work:

Get original
Edit copy of original
Set copy.previous_version = new_uuid()
set original._id = copy.previous_version
Post both copies to _bulk_docs

In theory that should avoid race conditions for compaction as well as
make the whole process atomic in the face of possible simultaneous
writes to the original doc.

Keep an eye on the specifics of _bulk_docs though. There have been
requests to make it non-atomic, but I think in the face of something
like this we might make non-atomic _bulk_docs a non-default or some
such.

HTH,
Paul

On Wed, Sep 17, 2008 at 11:11 AM, Jan Lehnardt <jan@apache.org> wrote:
> Hi Ronny,
>
> sorry, late reply.
>
> Once way to re-introduce optimistic locking is saving the new revision
> over the latest one and then copying the previous revision of the doc
> into a new doc. You can't run compaction in between, but since you
> control it, JUST DON'T CALL IT ;-).
>
> Cheers
> Jan
> --
> On Sep 14, 2008, at 23:49, Ronny Hanssen wrote:
>
>> Thanks for your reply, Jan.
>>
>> I do remember the discussion in the mailinglist, but at the time I didn't
>> understand the argumentation. Maybe because I really didn't have time to
>> dive into the matter back then. But, it seriously has puzzled me since.
>> Then
>> this post appears and I jump at the chance to get this cleared out (sorry
>> for being slow - which makes me the opposite of arrogant I guess :D).
>>
>> But, I don't have a solution. I guess you are right in that sense. I just
>> fail to see that making new docs are making life easier? I believe it
>> makes
>> the single node case worse and probably equally difficult (or worse) for
>> the
>> distributed multiple node architecture. Reading from what you say, there
>> is
>> "evil" lurking in the replication process no matter which way we handle
>> this. I mean, for multiple nodes the replication would probably be slower
>> than the return to the users changing the same doc on two different nodes
>> to
>> be informed. This would result in multiple versions of the same doc being
>> around, at least until replication - when couchdb would find out that two
>> competing versions exist. I might be wrong about this, but the users can't
>> be left waiting for an "ok-saved" reply from couchdb "forever", right? So,
>> couchdb would have to decide which version "wins" during replication,
>> right?
>>
>>
>> Considering the effects you are hinting about, I'd personally want a
>> single
>> node couchdb for writes, with extra nodes for reading and serving views...
>> Maybe additional write-nodes for different doc-types (one write-node pr
>> doc-type)... Just to "ensure" that there cannot be two+ docs updated at
>> two+
>> nodes simultaneously. That is, in the beginning I'd really rather go for a
>> single node, with a replicated backup/failover. As (if) system stress
>> increase I'd opt for splitting write and reads on nodes and/or creating
>> write-nodes designated for different doc-types. This is still not perfect,
>> but distributed never will be, really.
>>
>> Unless... If the couchdb data was stored in a distributed file-system (NAS
>> or SAN), each copy of the couchdb process would be operating on the same
>> disk. This doesn't mean more data-reliability and also imposes delays in
>> reads and writes. But, it would mean that couchdb would be scalable
>> (multiple (vurtual" nodes work on same physical disk). Other "physical"
>> nodes could be created that would replicate as couchdb is set up to do
>> already. So, allowing "virtual" nodes could work out as a nice addition I
>> think.
>>
>> But, then again, my knowledge in distributed file-systems (NAS or SAN) are
>> really limited... And, I might have missed out on alot more than that - so
>> all this might of course just be stupid :)
>>
>> Thank's for reading.
>>
>> ~Ronny
>>
>> 2008/9/14 Jan Lehnardt <jan@apache.org>
>>
>>> Hi Ronny,
>>> On Sep 14, 2008, at 11:45, Ronny Hanssen wrote:
>>>
>>>> Or have I seriously missed out on some vital information?  Because,
>>>> based
>>>> on
>>>> the above I still feel very confused about why we cannot use the
>>>> built-in
>>>> rev-control mechanism.
>>>>
>>>
>>> You correctly identify that adding revision control to a single node
>>> instance of
>>> CouchDB is not that hard (a quick search through the archives would have
>>> told
>>> you, too :-) Making all that work in a distributed environment with
>>> replication conflict
>>> detection and all is mighty hard. If you can come up with a nice an clean
>>> solution to
>>> make proper revision control work with CouchDB's replication including
>>> all
>>> the weird
>>> edge cases I don't even know about (aren't I arrogant this morning? :),
>>> we
>>> are happy
>>> to hear about it.
>>>
>>> Cheers
>>> Jan
>>> --
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> ~Ronny
>>>>
>>>> 2008/9/14 Jeremy Wall <jwall@google.com>
>>>>
>>>> Two reasons.
>>>>>
>>>>> * First as I understand it the revisions are not changes between
>>>>> documents.
>>>>> They are actual full copies of the document.
>>>>> * Second revisions get blown away when doing a database compact.
>>>>> Something
>>>>> you will more than likely want to do since it eats up database space
>>>>> fairly
>>>>> quickly. (see above for the reason why)
>>>>>
>>>>> That said there is nothing preventing you from storing revisions in
>>>>> CouchDB.
>>>>> You could store a changeset for each document revision is a seperate
>>>>> revision document that accompanies your main document. It would be
>>>>> really
>>>>> easy and designing views to take advantage of them to show a revision
>>>>> history for you document would be really easy.
>>>>>
>>>>> I suppose you could use the revisions that CouchDB stores but that
>>>>> wouldn't
>>>>> be very efficient since each one is a complete copy of the document.
>>>>> And
>>>>> you
>>>>> couldn't depend on that "feature not changing behaviour on you in later
>>>>> versions since it's not intended for revision history as a feature.
>>>>>
>>>>> On Sat, Sep 13, 2008 at 7:24 PM, Ronny Hanssen <super.ronny@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>> Why is the revision control system in couchdb inadequate for, well,
>>>>>>
>>>>>> revision
>>>>>> control? I thought that this feature indeed was a feature, not just
an
>>>>>> internal mechanism for resolving conflicts?
>>>>>> Ronny
>>>>>>
>>>>>> 2008/9/14 Calum Miller <calum_miller@yahoo.com>
>>>>>>
>>>>>> Hi Chris,
>>>>>>>
>>>>>>> Many thanks for your prompt response.
>>>>>>>
>>>>>>> Storing  a complete new version of each bond/instrument every
day
>>>>>>> seems
>>>>>>>
>>>>>> a
>>>>>
>>>>>> tad excessive. You can imagine how fast the database will grow
>>>>>> overtime
>>>>>>>
>>>>>> if a
>>>>>>
>>>>>>> unique version of each instrument must be saved, rather than
just the
>>>>>>> individual changes. This must be a common pattern, not confined
to
>>>>>>> investment banking. Any ideas how this pattern can be accommodated
>>>>>>>
>>>>>> within
>>>>>
>>>>>> CouchDB?
>>>>>>>
>>>>>>> Calum Miller
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Chris Anderson wrote:
>>>>>>>
>>>>>>> Calum,
>>>>>>>>
>>>>>>>> CouchDB should be easily able to handle this load.
>>>>>>>>
>>>>>>>> Please note that the built-in revision system is not designed
for
>>>>>>>> document history. Its sole purpose is to manage conflicting
>>>>>>>> documents
>>>>>>>> that result from edits done in separate copies of the DB,
which are
>>>>>>>> subsequently replicated into a single DB.
>>>>>>>>
>>>>>>>> If you allow CouchDB to create a new document for each daily
import
>>>>>>>> of
>>>>>>>> each security, and create a view which makes these documents
>>>>>>>> available
>>>>>>>> by security and date, you should be able to access securities
>>>>>>>> history
>>>>>>>> fairly simply.
>>>>>>>>
>>>>>>>> Chris
>>>>>>>>
>>>>>>>> On Sat, Sep 13, 2008 at 12:31 PM, Calum Miller <
>>>>>>>>
>>>>>>> calum_miller@yahoo.com>
>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I trying to evaluate CouchDB for use within investment
banking, yes
>>>>>>>>>
>>>>>>>> some
>>>>>>
>>>>>>> of
>>>>>>>>>
>>>>>>>>> these banks still exist. I want to load 500,000 bonds
into the
>>>>>>>>>
>>>>>>>> database
>>>>>
>>>>>> with
>>>>>>>>>
>>>>>>>>> each bond containing around 100 fields. I would be looking
to bulk
>>>>>>>>>
>>>>>>>> load
>>>>>
>>>>>> a
>>>>>>
>>>>>>> similar amount of these bonds every day whilst maintaining a
history
>>>>>>>>>
>>>>>>>> via
>>>>>>
>>>>>>> the
>>>>>>>>>
>>>>>>>>> revision feature. Are there any bulk load features available
for
>>>>>>>>>
>>>>>>>> CouchDB
>>>>>>
>>>>>>> and
>>>>>>>>>
>>>>>>>>> any tips on how to manage regular loads of this volume?
>>>>>>>>>
>>>>>>>>> Many thanks in advance and best of luck with this project.
>>>>>>>>>
>>>>>>>>> Calum Miller
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>
>

Mime
View raw message