incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: distributed map/reduce in BigCouch
Date Mon, 05 Aug 2013 22:18:55 GMT
I'm not a Couchbase expert so take this with a grain of salt.  XDCR in Couchbase is more analogous
to running replication between BigCouch clusters than it is to the distributed Erlang goop
that happens inside a BigCouch cluster.

I believe one significant difference between XDCR and CouchDB replication is that XDCR has
no ability to preserve conflicting versions of documents that show up in the two clusters.
 It deterministically chooses one version (i.e., both clusters will choose the same winner)
and discards the other.  When faced with this situation a BigCouch cluster will also deterministically
choose a "winner", but both edits are preserved and can be presented to clients upon request.
 
Adam

On Aug 5, 2013, at 9:15 PM, Stanley Iriele <siriele@breaktimestudios.com> wrote:

> hey,
> 
> that...was tremendously helpful. I assumed thats was how it all worked but
> in my head it seemed "To good to be true" so I had to really make sure. I
> saw the news about cochdb merging the bigcouch code into the code base. So
> I take it it is safe for me t start my project using couchdb 1.3.0 as I
> have been doing and just stand up bigcouch when i need to.
> 
> I have to play with Bigcouch a little more. while I am here...what is the
> difference between Bigcouch's clustering techniques and couchbase's XDCR? I
> won't be using couchbase because it doesn't have the flexibility and
> functionality that couchdb does. but I was generally curious what the
> difference was.
> 
> again..many thanks...I can now go full steam ahead with my project
> regards,
> 
> Stanley
> 
> 
> 
> 
> On Sun, Aug 4, 2013 at 10:29 AM, Adam Kocoloski <adam.kocoloski@gmail.com>wrote:
> 
>> The 'q' value is the number of unique shards that comprise the single
>> logical database.  BigCouch will keep 'n' replicas of each shard, so in
>> total a database will have q*n .couch files associated with it.
>> 
>> The number of nodes in the cluster is an independent concern; BigCouch
>> will happily store multiple shards on a single machine or leave some
>> machines in the cluster without any data from a particular clustered
>> database.  The number of nodes must only be greater than or equal to the
>> 'n' value for all databases (BigCouch will not store multiple copies of a
>> shard on the same node in the cluster).
>> 
>> Adam
>> 
>> On Aug 4, 2013, at 4:32 PM, Stanley Iriele <siriele@breaktimestudios.com>
>> wrote:
>> 
>>> Thanks joan,
>>> 
>>> I worded part of that poorly...When I said hard drives I really meant
>>> physical machines.. But the bottleneck was disk space; so I said hard
>>> disks... But That completely answered my question actually.... I
>> appreciate
>>> that...what is this 'q' value?..is that the number of nodes?.. Or is that
>>> the r + w > n thing... Either way that answers questions thank you!
>>> On Aug 4, 2013 7:18 AM, "Joan Touzet" <wohali@apache.org> wrote:
>>> 
>>>> Hi Stanley,
>>>> 
>>>> Let me provide a simplistic explanation, and others can help refine it
>>>> as necessary.
>>>> 
>>>> On Sat, Aug 03, 2013 at 09:34:48AM -0700, Stanley Iriele wrote:
>>>>> How then does distributed map/reduce work?
>>>> 
>>>> Each BigCouch node with a shard of the database also keeps that shard of
>>>> the view. When a request is made for a view, sufficient nodes are
>>>> queried to retrieve the view result, with the reduce operation occurring
>>>> as part of the return.
>>>> 
>>>>> if not all nodes have replications of all things how does that
>>>> coordination
>>>>> happen during view building?
>>>> 
>>>> This is not true, all nodes do not have replications of all things. If
>>>> you ask a node for a view on a database it does not have at all, it will
>>>> use information in the partition map to ask that question of a node that
>>>> has at least one shard of the database in question, which will in turn
>>>> complete the scatter/gather request to other nodes participating in that
>>>> database.
>>>> 
>>>>> also its sharded right so certain nodes have a certain range of keys.
>>>> 
>>>> Right, see above.
>>>> 
>>>>> My problem is this. I need a solution that can incrementally scale
>> across
>>>>> many hard disks...does big couch do this? with views and such?..if
>>>>> so..how?...and what are the drawbacks?
>>>> 
>>>> I wouldn't necessarily recommend running 1 BigCouch process per HD you
>>>> have on a single machine, but there's no reason that it wouldn't work.
>>>> 
>>>> The real challenge is that a database's partition map is determined
>>>> staticly at the time of database creation. If you choose to add more HDs
>>>> after this creation time, you will have to create a new database with
>>>> more shards, then replicate data to this new database to use the new
>>>> disks. The other option would be to use a very high number for 'q', then
>>>> rebalance the shard map onto the new disks and BigCouch processes. There
>>>> is a StackOverflow answer from Robert Newson that describes the process
>>>> for performing this operation.
>>>> 
>>>> In short, neither is trivial nor automated. For a single-machine system,
>>>> you'd do far better to use some sort of Logical Volume Manager to deal
>>>> with expanding storage over time, such as Linux's lvm, some HW raid
>>>> cards, ZFS or similar features in OSX and Windows.
>>>> 
>>>>> Thanks for any kind of response.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Stanley
>>>> 
>>>> --
>>>> Joan Touzet | joant@atypical.net | wohali everywhere else
>>>> 
>> 
>> 


Mime
View raw message