incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Ringo <goo...@stevenringo.com>
Subject Re: Database design question
Date Wed, 28 Dec 2011 11:42:16 GMT
Thanks for this comprehensive explanation. Have an even deeper 
understanding now. Good to see I am not the only one that had to start 
over with the mental models of how a system might work.

You make some very interesting points about the planner (or lack thereof 
in CouchDB).

And I see you have also experienced what happens when an ORM abstraction 
is misunderstood in terms of its underlying execution. I see this a lot 
with Rails. As an aside I have also noticed must CouchDB to Rails ORMs 
trying to make CouchDB perform like a SQL database. This is precisely 
what I was trying to avoid.

You also make some great points around compromises. I now am on the road 
to having a much better idea as to what types of data and their 
exchanges between one another are best suited to SQL and NoSQL respectively.

The app I am referring to would have worked very well with a 
conventional SQL store if it were server-only. SQL's design and 
normalisation would be its strength, and would require much less code. 
However as soon as you throw mobile replication and offline use into the 
mix, the SQL store becomes its weakness. Its here where NoSQL shines, 
and the minor tradeoffs of denormalisation pale in comparison to the 
benefits.

You make an analogy between static and dynamic languages. I think 
there's another great analogy: centralised vs decentralised version 
control systems (or Git vs Subversion if you like). I have often 
described CouchDB as the Git of the database world. Replication and 
peer-to-peer is ingrained into Git as it is to Couch. SQL and Subversion 
are both strong single server solutions. Git has tradeoffs including a 
steeper learning curve and some conceptual issues to get one's head 
around at the beginning. These are very nicely traded for cheap and easy 
branching, offline use and peer-replication. Git is terrible at managing 
large binary attachments, whereas Subversion handles this with aplomb. 
However try use Subversion at 30 000 feet in an aluminium tube over the 
Atlantic and you're stuck...

I am glad this discussion has created some food for thought. Thanks 
again for your contribution.

Paul Davis wrote:
> I'll prefix this with the fact that Sean does an excellent point of
> responding your core concerns. I've had a bit of this brewing as of
> late for a various number of reasons. But your email provokes a bit of
> a response so I think this is as good of a place as any to write down
> some thoughts.
>
> Also, to preface, my nascent background was to be taught SQL and the
> relational model by a couple guys that are deep into it. Before I
> moved to non-relational datastores I was deeply entrenched in a purist
> relational model. Purist beyond even most DBA's because I was taught
> from practical application of theory. Things like "null != null" and
> "null values are a hack around Codd's rules" are relatively benign but
> points that I find few people have completely groked in the real
> world. I was lucky to have these two guys teach me from the "this is
> proper and this is why we ignore that" point of view.
>
> On Tue, Dec 27, 2011 at 3:31 PM, Steven Ringo<google@stevenringo.com>  wrote:
>> Hi all,
>>
>> (I originally posted this on the mobile-couchbase list, but was recommended
>> to post it here instead, since this is more of a general couchdb question)
>>
>> I am struggling with something that I know would be quite trivial to do in
>> SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 years
>> of experience with SQL is clouding my judgement and so I am trying rather to
>> break away from the SQL/relational mindset and understand how this would be
>> better designed in couch.
>>
>
> This is good. The first thing to realize in learning a new tool is
> that its not the old tool. New tools behave differently and
> differences should be embraced rather than rejected outright.
>
>> I am developing an iOS art gallery app, and wish to display information
>> about artworks, artists and galleries.
>>
>> The menu screen requires a list of all artworks together with the gallery
>> they are currently housed at and the artist that created the artwork:
>>
>>   * artwork title
>>   * artwork medium
>>   * artwork dimensions
>>   * artist full_name
>>   * gallery name,
>>   * gallery city
>>
>> for each artwork, respectively.
>>
>> Imagine the following objects in my database (I am not including all the
>> fields for sake of brevity. i.e. only the ones that are important).
>>
>> |     Artist
>>     ------
>>     _id
>>     full_name
>>     birth_year
>>     birth_place
>>
>>     Gallery
>>     -------
>>     _id
>>     name
>>     city
>>
>>     Artwork
>>     -------
>>     _id
>>     title
>>     medium
>>     dimensions
>>     artist_id_ (foreign key)
>>     gallery_id_ (foreign key)
>> |
>>
>> In SQL it would be as easy as one query with two joins:
>>
>> |SELECT
>>     artwork.title,
>>     artwork.medium,
>>     artwork.dimensions,
>>     artist.full_name,
>>     gallery.name,
>>     gallery.city
>> FROM
>>     artwork INNER JOIN artist
>>     ON artwork.artist_id_ = artist._id INNER JOIN gallery
>>     ON artwork.gallery_id_ = gallery._id
>> |
>>
>
> The fact that you took the time to specify INNER JOIN is a good tip
> you have a thorough understanding of SQL.
>
>>  From this I would be able to get the data I need in one query (i.e. one call
>> to the database).
>>
>> |     Guernica  | Pablo Picasso     | Museo Reina Sofia, Madrid, Spain
>>     Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France
>>     David     | Michelangelo      | Uffizi Gallery, Florence, Italy.
>> |
>>
>> Using couchdb I am storing the data in a very similar way to the way I would
>> in a relational database, represented as something along the lines of:
>>
>
> Roughly speaking, this is your first mistake.
>
>> |{
>>    "type"       : "Artwork",
>>    "_id"        : "d665afaedde12b44c5159cf1782c5c1e",
>>    "_rev"       : "1-446201af8a1ddfa3759164a01be899de",
>>    "artist_id"  : "d665afaedde12b44c5159cf1782c2d0b",
>>    "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
>>    "title"      : "Guernica",
>>    "medium"     : "Oil on canvas",
>>    "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)"
>> }
>>
>> {
>>    "type"       : "Artist",
>>    "_id"        : "d665afaedde12b44c5159cf1782c2d0b",
>>    "_rev"       : "1-8fa8a6dd50d9d9072b08db1a840128b1",
>>    "full_name"  : "Pablo Picasso",
>>    "birth_year" : "1881"
>> }
>>
>> {
>>    "type"       : "Gallery",
>>    "_id"        : "d665afaedde12b44c5159cf1782d44af",
>>    "_rev"       : "1-27a139e40a7f88d6465812eec3c73a0f",
>>    "name"       : "Museo Reina Sofia"
>> }
>> |
>>
>> To get the same result in couch, I would have to do one query to get all the
>> artworks, e.g.
>>
>> |function(doc) {
>>   if(doc.type&&    doc.type == 'Artist') {
>>     emit(doc._id,doc);
>>   }
>> }
>> |
>>
>> then I would need to do a |GET| for the _id of each |Gallery| and |Artist|
>> returned per |Artwork| respectively to fill in the missing information. This
>> is usually known as the |n+1| queries problem (in this case |2n+1|).
>>
>> I have always worked on the assumption that the database should be leveraged
>> to do as much of the heavy lifting as possible with as few network hits as
>> possible (i.e. SQL is always faster than client manipulation). With a list
>> of 120 artworks, this  means 241 HTTP GETs to the database.
>>
>
> This is your second mistake. And more to the point, this is the first
> mistake I had to overcome myself when switching from relational to
> non-relational data stores. Even more specifically, my stumbling block
> was exactly the same as yours. The assumption that the number of
> requests/queries is to be minimized.
>
> I was taught quite specifically to minimize the number of queries per
> page load. Current ORM layers are still abhorrent to me because of the
> ease in which they translate directly to your worst case, each row in
> the list ends up being three queries per row. Specifically, people
> must understand both the relational model and their ORM layer to avoid
> it from doing the Wrong Thing&trade;.
>
> The conceptual realization that we need to make here is that SQL isn't
> magical. When we issue a query, it goes through the planner and ends
> up doing (seriously advanced) optimizations over the indexes and data
> retrieval methods it has available. The "minimize number of queries"
> mantra is a bit overly broad here. In practical terms it has the
> effect of making sure that we're making the best use of indexes and
> look up patterns. As an analogy, its akin to writing better C because
> we might have some intuition on how the assembler translation might be
> generated.
>
>> I am aware of "linked documents" in views
>> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents),
>> however this seems to only work with one related row.
>>
>> I can't really see how "view collation"
>> (http://wiki.apache.org/couchdb/View_collation) would work here either. With
>> view collation the correlated information ends up being on different rows,
>> and I think that would be better suited for where say a single Artists or
>> Gallery is displayed with associated Artworks. My case is the opposite,
>> whereby Artworks plus corresponding information from Artists and Galleries
>> is required.
>>
>
> These are definitely tools in the arsenal, but don't really speak to
> the core issue. I'll come back to this in a bit.
>
>> I am also aware that one could embed each gallery or artist in each artwork
>> or vice versa, but that seems to be a lot of duplication, and would make it
>> difficult to change artist or gallery information if required. Maybe this is
>> how it should be done?
>>
>
> As Sean pointed out, the second thing to overcome is data duplication.
> Storage is cheaper than CPU and network costs. If you don't expect
> data to change, you should try and exploit that.
>
>> For this small app, this is not going to be a performance issue. I am trying
>> rather to break away from a SQL/relational mindset and understand how this
>> would be better designed in a database like couch.
>>
>
> Your honest questions are quite refreshing because they remind me of
> my own initial exploration. There's lots of people talking but I've
> never found a really concise description of the difference between the
> approaches. There's lots of anecdotal blog posts about "scaling" and
> the such, but nothing I've seen really addresses the core issues.
>
>> Thanks,
>>
>> Steve
>>
>
> So, the bottom line is that the difference between relational and
> non-relational is two things. The first issue that people encounter
> when moving from relational to non is that the query planner is gone.
> This seems benign at fist, but affects a huge number of moving pieces
> in anything but non-trivial applications. If the query planner is
> gone, then all queries are less than optimal. Think as a DBA if
> *every* query was a full table scan. Most DBA's I know would scream at
> the thought. And also be highly confused why anyone would give that
> up.
>
> There are two answers here. One is about "horizontal scalability".
> Basically, SQL query planner optimizations don't generally work well
> when there's a network connection involved. Note that I'm not saying
> that SQL won't work in a clustered environment. There are companies
> that have proven otherwise. The issue is that the difference isn't
> clear to most people. Sure you can make SQL work on multiple machines,
> but the tradeoffs between scaling SQL horizontally vs vertically are
> complex and generally not well understood.
>
> The second issue is that non-relational stores have less information
> to do query optimizations. If you're in the crowd that says "I don't
> need scaling, I need to not be constrained by a schema" you're in this
> group. When you don't define a schema you're not allowing for lots of
> important information to be fed into some really awesome algorithms
> for data retrieval.
>
> There's a rough analogy here between statically and dynamically typed
> programming languages. Anyone that says one is better than the other
> is someone you shouldn't listen to. There are extremes where both are
> obviously better, but in the middle it gets into a complicated mess
> where its more important on the observer's point of view than any sort
> of theoretical points. If you have a well defined data usage, then
> static might be best. A rule of thumb is, if you can't print your data
> model on paper and have it be relevant for more than a couple months,
> then you're probably forcing things a bit.
>
> Granted, there are some obvious optimizations that non-relational
> stores could make by introspecting data, but its quite important to
> note the difference here. Relational stores regardless of scaling have
> the enormous benefit of having the data model defined up front.
>
> The best way I've learned to express this is that relational stores
> require you to invest in defining how you get data into the store.
> Non-relational means you have to invest in how to get data out.
>
> So, the end of the day decision is if the traditional relational
> stores provide you the ability to meet your customer's demands. For a
> lot of people over the last thirty or more years, this is true. It
> might be less than the optimal, but it generally works. Some people
> have recently realized that some of the assumptions in the relational
> model are artificially constraining. NOSQL is basically a response to
> this. A sudden realization of "If I ignore this one thing, my problem
> is a  lot easier." This is the issue that drives people. "What
> constraint am I willing to give up?" The single piece of advice I
> would give any relational-to-non convert is, "Analyze you're
> requirements and ask any new data store what they give up in return
> for their promise."
>
> CouchDB gives up isolation and consistency (in the traditional sense
> of transactions), for distribution through replication. We prioritize
> being consistent over being fast. And we use a funky programming
> language to make it happen.
>
> Other data stores have different reasons for existing. Redis, which
> I'm quite fond of, is roughly our opposite in that its geared towards
> RAM and being fast. There are problems I would solve with either, but
> no single problem for which I would use both interchangeably.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message