Excellent post Paul!
On 28 Dec 2011, at 07:38, Paul Davis <paul.joseph.davis@gmail.com> wrote:
> I'll prefix this with the fact that Sean does an excellent point of
> responding your core concerns. I've had a bit of this brewing as of
> late for a various number of reasons. But your email provokes a bit of
> a response so I think this is as good of a place as any to write down
> some thoughts.
>
> Also, to preface, my nascent background was to be taught SQL and the
> relational model by a couple guys that are deep into it. Before I
> moved to non-relational datastores I was deeply entrenched in a purist
> relational model. Purist beyond even most DBA's because I was taught
> from practical application of theory. Things like "null != null" and
> "null values are a hack around Codd's rules" are relatively benign but
> points that I find few people have completely groked in the real
> world. I was lucky to have these two guys teach me from the "this is
> proper and this is why we ignore that" point of view.
>
> On Tue, Dec 27, 2011 at 3:31 PM, Steven Ringo <google@stevenringo.com> wrote:
>> Hi all,
>>
>> (I originally posted this on the mobile-couchbase list, but was recommended
>> to post it here instead, since this is more of a general couchdb question)
>>
>> I am struggling with something that I know would be quite trivial to do in
>> SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 years
>> of experience with SQL is clouding my judgement and so I am trying rather to
>> break away from the SQL/relational mindset and understand how this would be
>> better designed in couch.
>>
>
> This is good. The first thing to realize in learning a new tool is
> that its not the old tool. New tools behave differently and
> differences should be embraced rather than rejected outright.
>
>> I am developing an iOS art gallery app, and wish to display information
>> about artworks, artists and galleries.
>>
>> The menu screen requires a list of all artworks together with the gallery
>> they are currently housed at and the artist that created the artwork:
>>
>> * artwork title
>> * artwork medium
>> * artwork dimensions
>> * artist full_name
>> * gallery name,
>> * gallery city
>>
>> for each artwork, respectively.
>>
>> Imagine the following objects in my database (I am not including all the
>> fields for sake of brevity. i.e. only the ones that are important).
>>
>> | Artist
>> ------
>> _id
>> full_name
>> birth_year
>> birth_place
>>
>> Gallery
>> -------
>> _id
>> name
>> city
>>
>> Artwork
>> -------
>> _id
>> title
>> medium
>> dimensions
>> artist_id_ (foreign key)
>> gallery_id_ (foreign key)
>> |
>>
>> In SQL it would be as easy as one query with two joins:
>>
>> |SELECT
>> artwork.title,
>> artwork.medium,
>> artwork.dimensions,
>> artist.full_name,
>> gallery.name,
>> gallery.city
>> FROM
>> artwork INNER JOIN artist
>> ON artwork.artist_id_ = artist._id INNER JOIN gallery
>> ON artwork.gallery_id_ = gallery._id
>> |
>>
>
> The fact that you took the time to specify INNER JOIN is a good tip
> you have a thorough understanding of SQL.
>
>> From this I would be able to get the data I need in one query (i.e. one call
>> to the database).
>>
>> | Guernica | Pablo Picasso | Museo Reina Sofia, Madrid, Spain
>> Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France
>> David | Michelangelo | Uffizi Gallery, Florence, Italy.
>> |
>>
>> Using couchdb I am storing the data in a very similar way to the way I would
>> in a relational database, represented as something along the lines of:
>>
>
> Roughly speaking, this is your first mistake.
>
>> |{
>> "type" : "Artwork",
>> "_id" : "d665afaedde12b44c5159cf1782c5c1e",
>> "_rev" : "1-446201af8a1ddfa3759164a01be899de",
>> "artist_id" : "d665afaedde12b44c5159cf1782c2d0b",
>> "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
>> "title" : "Guernica",
>> "medium" : "Oil on canvas",
>> "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)"
>> }
>>
>> {
>> "type" : "Artist",
>> "_id" : "d665afaedde12b44c5159cf1782c2d0b",
>> "_rev" : "1-8fa8a6dd50d9d9072b08db1a840128b1",
>> "full_name" : "Pablo Picasso",
>> "birth_year" : "1881"
>> }
>>
>> {
>> "type" : "Gallery",
>> "_id" : "d665afaedde12b44c5159cf1782d44af",
>> "_rev" : "1-27a139e40a7f88d6465812eec3c73a0f",
>> "name" : "Museo Reina Sofia"
>> }
>> |
>>
>> To get the same result in couch, I would have to do one query to get all the
>> artworks, e.g.
>>
>> |function(doc) {
>> if(doc.type&& doc.type == 'Artist') {
>> emit(doc._id,doc);
>> }
>> }
>> |
>>
>> then I would need to do a |GET| for the _id of each |Gallery| and |Artist|
>> returned per |Artwork| respectively to fill in the missing information. This
>> is usually known as the |n+1| queries problem (in this case |2n+1|).
>>
>> I have always worked on the assumption that the database should be leveraged
>> to do as much of the heavy lifting as possible with as few network hits as
>> possible (i.e. SQL is always faster than client manipulation). With a list
>> of 120 artworks, this means 241 HTTP GETs to the database.
>>
>
> This is your second mistake. And more to the point, this is the first
> mistake I had to overcome myself when switching from relational to
> non-relational data stores. Even more specifically, my stumbling block
> was exactly the same as yours. The assumption that the number of
> requests/queries is to be minimized.
>
> I was taught quite specifically to minimize the number of queries per
> page load. Current ORM layers are still abhorrent to me because of the
> ease in which they translate directly to your worst case, each row in
> the list ends up being three queries per row. Specifically, people
> must understand both the relational model and their ORM layer to avoid
> it from doing the Wrong Thing ™.
>
> The conceptual realization that we need to make here is that SQL isn't
> magical. When we issue a query, it goes through the planner and ends
> up doing (seriously advanced) optimizations over the indexes and data
> retrieval methods it has available. The "minimize number of queries"
> mantra is a bit overly broad here. In practical terms it has the
> effect of making sure that we're making the best use of indexes and
> look up patterns. As an analogy, its akin to writing better C because
> we might have some intuition on how the assembler translation might be
> generated.
>
>> I am aware of "linked documents" in views
>> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents),
>> however this seems to only work with one related row.
>>
>> I can't really see how "view collation"
>> (http://wiki.apache.org/couchdb/View_collation) would work here either. With
>> view collation the correlated information ends up being on different rows,
>> and I think that would be better suited for where say a single Artists or
>> Gallery is displayed with associated Artworks. My case is the opposite,
>> whereby Artworks plus corresponding information from Artists and Galleries
>> is required.
>>
>
> These are definitely tools in the arsenal, but don't really speak to
> the core issue. I'll come back to this in a bit.
>
>> I am also aware that one could embed each gallery or artist in each artwork
>> or vice versa, but that seems to be a lot of duplication, and would make it
>> difficult to change artist or gallery information if required. Maybe this is
>> how it should be done?
>>
>
> As Sean pointed out, the second thing to overcome is data duplication.
> Storage is cheaper than CPU and network costs. If you don't expect
> data to change, you should try and exploit that.
>
>> For this small app, this is not going to be a performance issue. I am trying
>> rather to break away from a SQL/relational mindset and understand how this
>> would be better designed in a database like couch.
>>
>
> Your honest questions are quite refreshing because they remind me of
> my own initial exploration. There's lots of people talking but I've
> never found a really concise description of the difference between the
> approaches. There's lots of anecdotal blog posts about "scaling" and
> the such, but nothing I've seen really addresses the core issues.
>
>> Thanks,
>>
>> Steve
>>
>
> So, the bottom line is that the difference between relational and
> non-relational is two things. The first issue that people encounter
> when moving from relational to non is that the query planner is gone.
> This seems benign at fist, but affects a huge number of moving pieces
> in anything but non-trivial applications. If the query planner is
> gone, then all queries are less than optimal. Think as a DBA if
> *every* query was a full table scan. Most DBA's I know would scream at
> the thought. And also be highly confused why anyone would give that
> up.
>
> There are two answers here. One is about "horizontal scalability".
> Basically, SQL query planner optimizations don't generally work well
> when there's a network connection involved. Note that I'm not saying
> that SQL won't work in a clustered environment. There are companies
> that have proven otherwise. The issue is that the difference isn't
> clear to most people. Sure you can make SQL work on multiple machines,
> but the tradeoffs between scaling SQL horizontally vs vertically are
> complex and generally not well understood.
>
> The second issue is that non-relational stores have less information
> to do query optimizations. If you're in the crowd that says "I don't
> need scaling, I need to not be constrained by a schema" you're in this
> group. When you don't define a schema you're not allowing for lots of
> important information to be fed into some really awesome algorithms
> for data retrieval.
>
> There's a rough analogy here between statically and dynamically typed
> programming languages. Anyone that says one is better than the other
> is someone you shouldn't listen to. There are extremes where both are
> obviously better, but in the middle it gets into a complicated mess
> where its more important on the observer's point of view than any sort
> of theoretical points. If you have a well defined data usage, then
> static might be best. A rule of thumb is, if you can't print your data
> model on paper and have it be relevant for more than a couple months,
> then you're probably forcing things a bit.
>
> Granted, there are some obvious optimizations that non-relational
> stores could make by introspecting data, but its quite important to
> note the difference here. Relational stores regardless of scaling have
> the enormous benefit of having the data model defined up front.
>
> The best way I've learned to express this is that relational stores
> require you to invest in defining how you get data into the store.
> Non-relational means you have to invest in how to get data out.
>
> So, the end of the day decision is if the traditional relational
> stores provide you the ability to meet your customer's demands. For a
> lot of people over the last thirty or more years, this is true. It
> might be less than the optimal, but it generally works. Some people
> have recently realized that some of the assumptions in the relational
> model are artificially constraining. NOSQL is basically a response to
> this. A sudden realization of "If I ignore this one thing, my problem
> is a lot easier." This is the issue that drives people. "What
> constraint am I willing to give up?" The single piece of advice I
> would give any relational-to-non convert is, "Analyze you're
> requirements and ask any new data store what they give up in return
> for their promise."
>
> CouchDB gives up isolation and consistency (in the traditional sense
> of transactions), for distribution through replication. We prioritize
> being consistent over being fast. And we use a funky programming
> language to make it happen.
>
> Other data stores have different reasons for existing. Redis, which
> I'm quite fond of, is roughly our opposite in that its geared towards
> RAM and being fast. There are problems I would solve with either, but
> no single problem for which I would use both interchangeably.
|