Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 301C291B1 for ; Wed, 28 Dec 2011 13:37:13 +0000 (UTC) Received: (qmail 78001 invoked by uid 500); 28 Dec 2011 13:37:11 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 77912 invoked by uid 500); 28 Dec 2011 13:37:11 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 77904 invoked by uid 99); 28 Dec 2011 13:37:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Dec 2011 13:37:11 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sean.copenhaver@gmail.com designates 209.85.160.180 as permitted sender) Received: from [209.85.160.180] (HELO mail-gy0-f180.google.com) (209.85.160.180) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Dec 2011 13:37:05 +0000 Received: by ghrr20 with SMTP id r20so4899851ghr.11 for ; Wed, 28 Dec 2011 05:36:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=subject:references:from:content-type:x-mailer:in-reply-to :message-id:date:to:content-transfer-encoding:mime-version; bh=94Y6hKKtycLiOnBYIgyHkyoh5uEs8f23hs8ihSntIQQ=; b=dOLvMXG1hx/Q2VYN89zh/y7z8+cCGgUZbyHheu3M4CmkCKR+I3SSeUgVOWubbA9/ej m8qwnmsJ838ZOFcl9uqLprVjHgHd+Hqmb+Pq3HT89H4esd8dMObcdBA9Bq646l8B5vnO RjG5rPy52gGiZK7NY8Bfkmoy679cKTYYsTbAw= Received: by 10.236.91.84 with SMTP id g60mr3363686yhf.90.1325079404508; Wed, 28 Dec 2011 05:36:44 -0800 (PST) Received: from [10.138.190.176] ([166.137.12.85]) by mx.google.com with ESMTPS id j11sm14821348anl.8.2011.12.28.05.36.41 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 28 Dec 2011 05:36:42 -0800 (PST) Subject: Re: Database design question References: <4EFA392F.8000708@stevenringo.com> <4EFB0098.5010605@stevenringo.com> From: Sean Copenhaver Content-Type: text/plain; charset=utf-8 X-Mailer: iPhone Mail (9A405) In-Reply-To: <4EFB0098.5010605@stevenringo.com> Message-Id: Date: Wed, 28 Dec 2011 08:36:37 -0500 To: "user@couchdb.apache.org" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Paul that was a great follow up that made my little couchdb lightbulb bright= en. - Sean Copenhaver On Dec 28, 2011, at 6:42 AM, Steven Ringo wrote: > Thanks for this comprehensive explanation. Have an even deeper understandi= ng now. Good to see I am not the only one that had to start over with the me= ntal models of how a system might work. >=20 > You make some very interesting points about the planner (or lack thereof i= n CouchDB). >=20 > And I see you have also experienced what happens when an ORM abstraction i= s misunderstood in terms of its underlying execution. I see this a lot with R= ails. As an aside I have also noticed must CouchDB to Rails ORMs trying to m= ake CouchDB perform like a SQL database. This is precisely what I was trying= to avoid. >=20 > You also make some great points around compromises. I now am on the road t= o having a much better idea as to what types of data and their exchanges bet= ween one another are best suited to SQL and NoSQL respectively. >=20 > The app I am referring to would have worked very well with a conventional S= QL store if it were server-only. SQL's design and normalisation would be its= strength, and would require much less code. However as soon as you throw mo= bile replication and offline use into the mix, the SQL store becomes its wea= kness. Its here where NoSQL shines, and the minor tradeoffs of denormalisati= on pale in comparison to the benefits. >=20 > You make an analogy between static and dynamic languages. I think there's a= nother great analogy: centralised vs decentralised version control systems (= or Git vs Subversion if you like). I have often described CouchDB as the Git= of the database world. Replication and peer-to-peer is ingrained into Git a= s it is to Couch. SQL and Subversion are both strong single server solutions= . Git has tradeoffs including a steeper learning curve and some conceptual i= ssues to get one's head around at the beginning. These are very nicely trade= d for cheap and easy branching, offline use and peer-replication. Git is ter= rible at managing large binary attachments, whereas Subversion handles this w= ith aplomb. However try use Subversion at 30 000 feet in an aluminium tube o= ver the Atlantic and you're stuck... >=20 > I am glad this discussion has created some food for thought. Thanks again f= or your contribution. >=20 > Paul Davis wrote: >> I'll prefix this with the fact that Sean does an excellent point of >> responding your core concerns. I've had a bit of this brewing as of >> late for a various number of reasons. But your email provokes a bit of >> a response so I think this is as good of a place as any to write down >> some thoughts. >>=20 >> Also, to preface, my nascent background was to be taught SQL and the >> relational model by a couple guys that are deep into it. Before I >> moved to non-relational datastores I was deeply entrenched in a purist >> relational model. Purist beyond even most DBA's because I was taught >> from practical application of theory. Things like "null !=3D null" and >> "null values are a hack around Codd's rules" are relatively benign but >> points that I find few people have completely groked in the real >> world. I was lucky to have these two guys teach me from the "this is >> proper and this is why we ignore that" point of view. >>=20 >> On Tue, Dec 27, 2011 at 3:31 PM, Steven Ringo wr= ote: >>> Hi all, >>>=20 >>> (I originally posted this on the mobile-couchbase list, but was recommen= ded >>> to post it here instead, since this is more of a general couchdb questio= n) >>>=20 >>> I am struggling with something that I know would be quite trivial to do i= n >>> SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 yea= rs >>> of experience with SQL is clouding my judgement and so I am trying rathe= r to >>> break away from the SQL/relational mindset and understand how this would= be >>> better designed in couch. >>>=20 >>=20 >> This is good. The first thing to realize in learning a new tool is >> that its not the old tool. New tools behave differently and >> differences should be embraced rather than rejected outright. >>=20 >>> I am developing an iOS art gallery app, and wish to display information >>> about artworks, artists and galleries. >>>=20 >>> The menu screen requires a list of all artworks together with the galler= y >>> they are currently housed at and the artist that created the artwork: >>>=20 >>> * artwork title >>> * artwork medium >>> * artwork dimensions >>> * artist full_name >>> * gallery name, >>> * gallery city >>>=20 >>> for each artwork, respectively. >>>=20 >>> Imagine the following objects in my database (I am not including all the= >>> fields for sake of brevity. i.e. only the ones that are important). >>>=20 >>> | Artist >>> ------ >>> _id >>> full_name >>> birth_year >>> birth_place >>>=20 >>> Gallery >>> ------- >>> _id >>> name >>> city >>>=20 >>> Artwork >>> ------- >>> _id >>> title >>> medium >>> dimensions >>> artist_id_ (foreign key) >>> gallery_id_ (foreign key) >>> | >>>=20 >>> In SQL it would be as easy as one query with two joins: >>>=20 >>> |SELECT >>> artwork.title, >>> artwork.medium, >>> artwork.dimensions, >>> artist.full_name, >>> gallery.name, >>> gallery.city >>> FROM >>> artwork INNER JOIN artist >>> ON artwork.artist_id_ =3D artist._id INNER JOIN gallery >>> ON artwork.gallery_id_ =3D gallery._id >>> | >>>=20 >>=20 >> The fact that you took the time to specify INNER JOIN is a good tip >> you have a thorough understanding of SQL. >>=20 >>> =46rom this I would be able to get the data I need in one query (i.e. on= e call >>> to the database). >>>=20 >>> | Guernica | Pablo Picasso | Museo Reina Sofia, Madrid, Spain >>> Mona Lisa | Leonardo da Vinci | Mus=C3=A9e du Louvre, Paris, France >>> David | Michelangelo | Uffizi Gallery, Florence, Italy. >>> | >>>=20 >>> Using couchdb I am storing the data in a very similar way to the way I w= ould >>> in a relational database, represented as something along the lines of: >>>=20 >>=20 >> Roughly speaking, this is your first mistake. >>=20 >>> |{ >>> "type" : "Artwork", >>> "_id" : "d665afaedde12b44c5159cf1782c5c1e", >>> "_rev" : "1-446201af8a1ddfa3759164a01be899de", >>> "artist_id" : "d665afaedde12b44c5159cf1782c2d0b", >>> "gallery_id" : "d665afaedde12b44c5159cf1782d44af", >>> "title" : "Guernica", >>> "medium" : "Oil on canvas", >>> "dimensions" : "349 cm =C3=97 776 cm (137.4 in =C3=97 305.5 in)" >>> } >>>=20 >>> { >>> "type" : "Artist", >>> "_id" : "d665afaedde12b44c5159cf1782c2d0b", >>> "_rev" : "1-8fa8a6dd50d9d9072b08db1a840128b1", >>> "full_name" : "Pablo Picasso", >>> "birth_year" : "1881" >>> } >>>=20 >>> { >>> "type" : "Gallery", >>> "_id" : "d665afaedde12b44c5159cf1782d44af", >>> "_rev" : "1-27a139e40a7f88d6465812eec3c73a0f", >>> "name" : "Museo Reina Sofia" >>> } >>> | >>>=20 >>> To get the same result in couch, I would have to do one query to get all= the >>> artworks, e.g. >>>=20 >>> |function(doc) { >>> if(doc.type&& doc.type =3D=3D 'Artist') { >>> emit(doc._id,doc); >>> } >>> } >>> | >>>=20 >>> then I would need to do a |GET| for the _id of each |Gallery| and |Artis= t| >>> returned per |Artwork| respectively to fill in the missing information. T= his >>> is usually known as the |n+1| queries problem (in this case |2n+1|). >>>=20 >>> I have always worked on the assumption that the database should be lever= aged >>> to do as much of the heavy lifting as possible with as few network hits a= s >>> possible (i.e. SQL is always faster than client manipulation). With a li= st >>> of 120 artworks, this means 241 HTTP GETs to the database. >>>=20 >>=20 >> This is your second mistake. And more to the point, this is the first >> mistake I had to overcome myself when switching from relational to >> non-relational data stores. Even more specifically, my stumbling block >> was exactly the same as yours. The assumption that the number of >> requests/queries is to be minimized. >>=20 >> I was taught quite specifically to minimize the number of queries per >> page load. Current ORM layers are still abhorrent to me because of the >> ease in which they translate directly to your worst case, each row in >> the list ends up being three queries per row. Specifically, people >> must understand both the relational model and their ORM layer to avoid >> it from doing the Wrong Thing™. >>=20 >> The conceptual realization that we need to make here is that SQL isn't >> magical. When we issue a query, it goes through the planner and ends >> up doing (seriously advanced) optimizations over the indexes and data >> retrieval methods it has available. The "minimize number of queries" >> mantra is a bit overly broad here. In practical terms it has the >> effect of making sure that we're making the best use of indexes and >> look up patterns. As an analogy, its akin to writing better C because >> we might have some intuition on how the assembler translation might be >> generated. >>=20 >>> I am aware of "linked documents" in views >>> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_doc= uments), >>> however this seems to only work with one related row. >>>=20 >>> I can't really see how "view collation" >>> (http://wiki.apache.org/couchdb/View_collation) would work here either. W= ith >>> view collation the correlated information ends up being on different row= s, >>> and I think that would be better suited for where say a single Artists o= r >>> Gallery is displayed with associated Artworks. My case is the opposite, >>> whereby Artworks plus corresponding information from Artists and Galleri= es >>> is required. >>>=20 >>=20 >> These are definitely tools in the arsenal, but don't really speak to >> the core issue. I'll come back to this in a bit. >>=20 >>> I am also aware that one could embed each gallery or artist in each artw= ork >>> or vice versa, but that seems to be a lot of duplication, and would make= it >>> difficult to change artist or gallery information if required. Maybe thi= s is >>> how it should be done? >>>=20 >>=20 >> As Sean pointed out, the second thing to overcome is data duplication. >> Storage is cheaper than CPU and network costs. If you don't expect >> data to change, you should try and exploit that. >>=20 >>> For this small app, this is not going to be a performance issue. I am tr= ying >>> rather to break away from a SQL/relational mindset and understand how th= is >>> would be better designed in a database like couch. >>>=20 >>=20 >> Your honest questions are quite refreshing because they remind me of >> my own initial exploration. There's lots of people talking but I've >> never found a really concise description of the difference between the >> approaches. There's lots of anecdotal blog posts about "scaling" and >> the such, but nothing I've seen really addresses the core issues. >>=20 >>> Thanks, >>>=20 >>> Steve >>>=20 >>=20 >> So, the bottom line is that the difference between relational and >> non-relational is two things. The first issue that people encounter >> when moving from relational to non is that the query planner is gone. >> This seems benign at fist, but affects a huge number of moving pieces >> in anything but non-trivial applications. If the query planner is >> gone, then all queries are less than optimal. Think as a DBA if >> *every* query was a full table scan. Most DBA's I know would scream at >> the thought. And also be highly confused why anyone would give that >> up. >>=20 >> There are two answers here. One is about "horizontal scalability". >> Basically, SQL query planner optimizations don't generally work well >> when there's a network connection involved. Note that I'm not saying >> that SQL won't work in a clustered environment. There are companies >> that have proven otherwise. The issue is that the difference isn't >> clear to most people. Sure you can make SQL work on multiple machines, >> but the tradeoffs between scaling SQL horizontally vs vertically are >> complex and generally not well understood. >>=20 >> The second issue is that non-relational stores have less information >> to do query optimizations. If you're in the crowd that says "I don't >> need scaling, I need to not be constrained by a schema" you're in this >> group. When you don't define a schema you're not allowing for lots of >> important information to be fed into some really awesome algorithms >> for data retrieval. >>=20 >> There's a rough analogy here between statically and dynamically typed >> programming languages. Anyone that says one is better than the other >> is someone you shouldn't listen to. There are extremes where both are >> obviously better, but in the middle it gets into a complicated mess >> where its more important on the observer's point of view than any sort >> of theoretical points. If you have a well defined data usage, then >> static might be best. A rule of thumb is, if you can't print your data >> model on paper and have it be relevant for more than a couple months, >> then you're probably forcing things a bit. >>=20 >> Granted, there are some obvious optimizations that non-relational >> stores could make by introspecting data, but its quite important to >> note the difference here. Relational stores regardless of scaling have >> the enormous benefit of having the data model defined up front. >>=20 >> The best way I've learned to express this is that relational stores >> require you to invest in defining how you get data into the store. >> Non-relational means you have to invest in how to get data out. >>=20 >> So, the end of the day decision is if the traditional relational >> stores provide you the ability to meet your customer's demands. For a >> lot of people over the last thirty or more years, this is true. It >> might be less than the optimal, but it generally works. Some people >> have recently realized that some of the assumptions in the relational >> model are artificially constraining. NOSQL is basically a response to >> this. A sudden realization of "If I ignore this one thing, my problem >> is a lot easier." This is the issue that drives people. "What >> constraint am I willing to give up?" The single piece of advice I >> would give any relational-to-non convert is, "Analyze you're >> requirements and ask any new data store what they give up in return >> for their promise." >>=20 >> CouchDB gives up isolation and consistency (in the traditional sense >> of transactions), for distribution through replication. We prioritize >> being consistent over being fast. And we use a funky programming >> language to make it happen. >>=20 >> Other data stores have different reasons for existing. Redis, which >> I'm quite fond of, is roughly our opposite in that its geared towards >> RAM and being fast. There are problems I would solve with either, but >> no single problem for which I would use both interchangeably.