Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of sean.copenhaver@gmail.com
 designates 209.85.160.180 as permitted sender)
Subject: Re: Database design question
References: <4EFA392F.8000708@stevenringo.com>
 <CAJ_m3YDtV8dV9VeeYw1KGikOkjVr7QW2c=EmGNqhh_ecnGBC2w@mail.gmail.com>
 <4EFB0098.5010605@stevenringo.com>
From: Sean Copenhaver <sean.copenhaver@gmail.com>
Content-Type: text/plain;
	charset=utf-8
In-Reply-To: <4EFB0098.5010605@stevenringo.com>
Message-Id: <CE590820-24DB-4CC0-875E-B6C4A89D569C@gmail.com>
Date: Wed, 28 Dec 2011 08:36:37 -0500
To: "user@couchdb.apache.org" <user@couchdb.apache.org>
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (1.0)

Paul that was a great follow up that made my little couchdb lightbulb bright=
en.


- Sean Copenhaver

On Dec 28, 2011, at 6:42 AM, Steven Ringo <google@stevenringo.com> wrote:

> Thanks for this comprehensive explanation. Have an even deeper understandi=
ng now. Good to see I am not the only one that had to start over with the me=
ntal models of how a system might work.
>=20
> You make some very interesting points about the planner (or lack thereof i=
n CouchDB).
>=20
> And I see you have also experienced what happens when an ORM abstraction i=
s misunderstood in terms of its underlying execution. I see this a lot with R=
ails. As an aside I have also noticed must CouchDB to Rails ORMs trying to m=
ake CouchDB perform like a SQL database. This is precisely what I was trying=
 to avoid.
>=20
> You also make some great points around compromises. I now am on the road t=
o having a much better idea as to what types of data and their exchanges bet=
ween one another are best suited to SQL and NoSQL respectively.
>=20
> The app I am referring to would have worked very well with a conventional S=
QL store if it were server-only. SQL's design and normalisation would be its=
 strength, and would require much less code. However as soon as you throw mo=
bile replication and offline use into the mix, the SQL store becomes its wea=
kness. Its here where NoSQL shines, and the minor tradeoffs of denormalisati=
on pale in comparison to the benefits.
>=20
> You make an analogy between static and dynamic languages. I think there's a=
nother great analogy: centralised vs decentralised version control systems (=
or Git vs Subversion if you like). I have often described CouchDB as the Git=
 of the database world. Replication and peer-to-peer is ingrained into Git a=
s it is to Couch. SQL and Subversion are both strong single server solutions=
. Git has tradeoffs including a steeper learning curve and some conceptual i=
ssues to get one's head around at the beginning. These are very nicely trade=
d for cheap and easy branching, offline use and peer-replication. Git is ter=
rible at managing large binary attachments, whereas Subversion handles this w=
ith aplomb. However try use Subversion at 30 000 feet in an aluminium tube o=
ver the Atlantic and you're stuck...
>=20
> I am glad this discussion has created some food for thought. Thanks again f=
or your contribution.
>=20
> Paul Davis wrote:
>> I'll prefix this with the fact that Sean does an excellent point of
>> responding your core concerns. I've had a bit of this brewing as of
>> late for a various number of reasons. But your email provokes a bit of
>> a response so I think this is as good of a place as any to write down
>> some thoughts.
>>=20
>> Also, to preface, my nascent background was to be taught SQL and the
>> relational model by a couple guys that are deep into it. Before I
>> moved to non-relational datastores I was deeply entrenched in a purist
>> relational model. Purist beyond even most DBA's because I was taught
>> from practical application of theory. Things like "null !=3D null" and
>> "null values are a hack around Codd's rules" are relatively benign but
>> points that I find few people have completely groked in the real
>> world. I was lucky to have these two guys teach me from the "this is
>> proper and this is why we ignore that" point of view.
>>=20
>> On Tue, Dec 27, 2011 at 3:31 PM, Steven Ringo<google@stevenringo.com>  wr=
ote:
>>> Hi all,
>>>=20
>>> (I originally posted this on the mobile-couchbase list, but was recommen=
ded
>>> to post it here instead, since this is more of a general couchdb questio=
n)
>>>=20
>>> I am struggling with something that I know would be quite trivial to do i=
n
>>> SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 yea=
rs
>>> of experience with SQL is clouding my judgement and so I am trying rathe=
r to
>>> break away from the SQL/relational mindset and understand how this would=
 be
>>> better designed in couch.
>>>=20
>>=20
>> This is good. The first thing to realize in learning a new tool is
>> that its not the old tool. New tools behave differently and
>> differences should be embraced rather than rejected outright.
>>=20
>>> I am developing an iOS art gallery app, and wish to display information
>>> about artworks, artists and galleries.
>>>=20
>>> The menu screen requires a list of all artworks together with the galler=
y
>>> they are currently housed at and the artist that created the artwork:
>>>=20
>>>  * artwork title
>>>  * artwork medium
>>>  * artwork dimensions
>>>  * artist full_name
>>>  * gallery name,
>>>  * gallery city
>>>=20
>>> for each artwork, respectively.
>>>=20
>>> Imagine the following objects in my database (I am not including all the=

>>> fields for sake of brevity. i.e. only the ones that are important).
>>>=20
>>> |     Artist
>>>    ------
>>>    _id
>>>    full_name
>>>    birth_year
>>>    birth_place
>>>=20
>>>    Gallery
>>>    -------
>>>    _id
>>>    name
>>>    city
>>>=20
>>>    Artwork
>>>    -------
>>>    _id
>>>    title
>>>    medium
>>>    dimensions
>>>    artist_id_ (foreign key)
>>>    gallery_id_ (foreign key)
>>> |
>>>=20
>>> In SQL it would be as easy as one query with two joins:
>>>=20
>>> |SELECT
>>>    artwork.title,
>>>    artwork.medium,
>>>    artwork.dimensions,
>>>    artist.full_name,
>>>    gallery.name,
>>>    gallery.city
>>> FROM
>>>    artwork INNER JOIN artist
>>>    ON artwork.artist_id_ =3D artist._id INNER JOIN gallery
>>>    ON artwork.gallery_id_ =3D gallery._id
>>> |
>>>=20
>>=20
>> The fact that you took the time to specify INNER JOIN is a good tip
>> you have a thorough understanding of SQL.
>>=20
>>> =46rom this I would be able to get the data I need in one query (i.e. on=
e call
>>> to the database).
>>>=20
>>> |     Guernica  | Pablo Picasso     | Museo Reina Sofia, Madrid, Spain
>>>    Mona Lisa | Leonardo da Vinci | Mus=C3=A9e du Louvre, Paris, France
>>>    David     | Michelangelo      | Uffizi Gallery, Florence, Italy.
>>> |
>>>=20
>>> Using couchdb I am storing the data in a very similar way to the way I w=
ould
>>> in a relational database, represented as something along the lines of:
>>>=20
>>=20
>> Roughly speaking, this is your first mistake.
>>=20
>>> |{
>>>   "type"       : "Artwork",
>>>   "_id"        : "d665afaedde12b44c5159cf1782c5c1e",
>>>   "_rev"       : "1-446201af8a1ddfa3759164a01be899de",
>>>   "artist_id"  : "d665afaedde12b44c5159cf1782c2d0b",
>>>   "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
>>>   "title"      : "Guernica",
>>>   "medium"     : "Oil on canvas",
>>>   "dimensions" : "349 cm =C3=97 776 cm (137.4 in =C3=97 305.5 in)"
>>> }
>>>=20
>>> {
>>>   "type"       : "Artist",
>>>   "_id"        : "d665afaedde12b44c5159cf1782c2d0b",
>>>   "_rev"       : "1-8fa8a6dd50d9d9072b08db1a840128b1",
>>>   "full_name"  : "Pablo Picasso",
>>>   "birth_year" : "1881"
>>> }
>>>=20
>>> {
>>>   "type"       : "Gallery",
>>>   "_id"        : "d665afaedde12b44c5159cf1782d44af",
>>>   "_rev"       : "1-27a139e40a7f88d6465812eec3c73a0f",
>>>   "name"       : "Museo Reina Sofia"
>>> }
>>> |
>>>=20
>>> To get the same result in couch, I would have to do one query to get all=
 the
>>> artworks, e.g.
>>>=20
>>> |function(doc) {
>>>  if(doc.type&&    doc.type =3D=3D 'Artist') {
>>>    emit(doc._id,doc);
>>>  }
>>> }
>>> |
>>>=20
>>> then I would need to do a |GET| for the _id of each |Gallery| and |Artis=
t|
>>> returned per |Artwork| respectively to fill in the missing information. T=
his
>>> is usually known as the |n+1| queries problem (in this case |2n+1|).
>>>=20
>>> I have always worked on the assumption that the database should be lever=
aged
>>> to do as much of the heavy lifting as possible with as few network hits a=
s
>>> possible (i.e. SQL is always faster than client manipulation). With a li=
st
>>> of 120 artworks, this  means 241 HTTP GETs to the database.
>>>=20
>>=20
>> This is your second mistake. And more to the point, this is the first
>> mistake I had to overcome myself when switching from relational to
>> non-relational data stores. Even more specifically, my stumbling block
>> was exactly the same as yours. The assumption that the number of
>> requests/queries is to be minimized.
>>=20
>> I was taught quite specifically to minimize the number of queries per
>> page load. Current ORM layers are still abhorrent to me because of the
>> ease in which they translate directly to your worst case, each row in
>> the list ends up being three queries per row. Specifically, people
>> must understand both the relational model and their ORM layer to avoid
>> it from doing the Wrong Thing&trade;.
>>=20
>> The conceptual realization that we need to make here is that SQL isn't
>> magical. When we issue a query, it goes through the planner and ends
>> up doing (seriously advanced) optimizations over the indexes and data
>> retrieval methods it has available. The "minimize number of queries"
>> mantra is a bit overly broad here. In practical terms it has the
>> effect of making sure that we're making the best use of indexes and
>> look up patterns. As an analogy, its akin to writing better C because
>> we might have some intuition on how the assembler translation might be
>> generated.
>>=20
>>> I am aware of "linked documents" in views
>>> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_doc=
uments),
>>> however this seems to only work with one related row.
>>>=20
>>> I can't really see how "view collation"
>>> (http://wiki.apache.org/couchdb/View_collation) would work here either. W=
ith
>>> view collation the correlated information ends up being on different row=
s,
>>> and I think that would be better suited for where say a single Artists o=
r
>>> Gallery is displayed with associated Artworks. My case is the opposite,
>>> whereby Artworks plus corresponding information from Artists and Galleri=
es
>>> is required.
>>>=20
>>=20
>> These are definitely tools in the arsenal, but don't really speak to
>> the core issue. I'll come back to this in a bit.
>>=20
>>> I am also aware that one could embed each gallery or artist in each artw=
ork
>>> or vice versa, but that seems to be a lot of duplication, and would make=
 it
>>> difficult to change artist or gallery information if required. Maybe thi=
s is
>>> how it should be done?
>>>=20
>>=20
>> As Sean pointed out, the second thing to overcome is data duplication.
>> Storage is cheaper than CPU and network costs. If you don't expect
>> data to change, you should try and exploit that.
>>=20
>>> For this small app, this is not going to be a performance issue. I am tr=
ying
>>> rather to break away from a SQL/relational mindset and understand how th=
is
>>> would be better designed in a database like couch.
>>>=20
>>=20
>> Your honest questions are quite refreshing because they remind me of
>> my own initial exploration. There's lots of people talking but I've
>> never found a really concise description of the difference between the
>> approaches. There's lots of anecdotal blog posts about "scaling" and
>> the such, but nothing I've seen really addresses the core issues.
>>=20
>>> Thanks,
>>>=20
>>> Steve
>>>=20
>>=20
>> So, the bottom line is that the difference between relational and
>> non-relational is two things. The first issue that people encounter
>> when moving from relational to non is that the query planner is gone.
>> This seems benign at fist, but affects a huge number of moving pieces
>> in anything but non-trivial applications. If the query planner is
>> gone, then all queries are less than optimal. Think as a DBA if
>> *every* query was a full table scan. Most DBA's I know would scream at
>> the thought. And also be highly confused why anyone would give that
>> up.
>>=20
>> There are two answers here. One is about "horizontal scalability".
>> Basically, SQL query planner optimizations don't generally work well
>> when there's a network connection involved. Note that I'm not saying
>> that SQL won't work in a clustered environment. There are companies
>> that have proven otherwise. The issue is that the difference isn't
>> clear to most people. Sure you can make SQL work on multiple machines,
>> but the tradeoffs between scaling SQL horizontally vs vertically are
>> complex and generally not well understood.
>>=20
>> The second issue is that non-relational stores have less information
>> to do query optimizations. If you're in the crowd that says "I don't
>> need scaling, I need to not be constrained by a schema" you're in this
>> group. When you don't define a schema you're not allowing for lots of
>> important information to be fed into some really awesome algorithms
>> for data retrieval.
>>=20
>> There's a rough analogy here between statically and dynamically typed
>> programming languages. Anyone that says one is better than the other
>> is someone you shouldn't listen to. There are extremes where both are
>> obviously better, but in the middle it gets into a complicated mess
>> where its more important on the observer's point of view than any sort
>> of theoretical points. If you have a well defined data usage, then
>> static might be best. A rule of thumb is, if you can't print your data
>> model on paper and have it be relevant for more than a couple months,
>> then you're probably forcing things a bit.
>>=20
>> Granted, there are some obvious optimizations that non-relational
>> stores could make by introspecting data, but its quite important to
>> note the difference here. Relational stores regardless of scaling have
>> the enormous benefit of having the data model defined up front.
>>=20
>> The best way I've learned to express this is that relational stores
>> require you to invest in defining how you get data into the store.
>> Non-relational means you have to invest in how to get data out.
>>=20
>> So, the end of the day decision is if the traditional relational
>> stores provide you the ability to meet your customer's demands. For a
>> lot of people over the last thirty or more years, this is true. It
>> might be less than the optimal, but it generally works. Some people
>> have recently realized that some of the assumptions in the relational
>> model are artificially constraining. NOSQL is basically a response to
>> this. A sudden realization of "If I ignore this one thing, my problem
>> is a  lot easier." This is the issue that drives people. "What
>> constraint am I willing to give up?" The single piece of advice I
>> would give any relational-to-non convert is, "Analyze you're
>> requirements and ask any new data store what they give up in return
>> for their promise."
>>=20
>> CouchDB gives up isolation and consistency (in the traditional sense
>> of transactions), for distribution through replication. We prioritize
>> being consistent over being fast. And we use a funky programming
>> language to make it happen.
>>=20
>> Other data stores have different reasons for existing. Redis, which
>> I'm quite fond of, is roughly our opposite in that its geared towards
>> RAM and being fast. There are problems I would solve with either, but
>> no single problem for which I would use both interchangeably.