couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Database design question
Date Wed, 28 Dec 2011 07:38:12 GMT
I'll prefix this with the fact that Sean does an excellent point of
responding your core concerns. I've had a bit of this brewing as of
late for a various number of reasons. But your email provokes a bit of
a response so I think this is as good of a place as any to write down
some thoughts.

Also, to preface, my nascent background was to be taught SQL and the
relational model by a couple guys that are deep into it. Before I
moved to non-relational datastores I was deeply entrenched in a purist
relational model. Purist beyond even most DBA's because I was taught
from practical application of theory. Things like "null != null" and
"null values are a hack around Codd's rules" are relatively benign but
points that I find few people have completely groked in the real
world. I was lucky to have these two guys teach me from the "this is
proper and this is why we ignore that" point of view.

On Tue, Dec 27, 2011 at 3:31 PM, Steven Ringo <google@stevenringo.com> wrote:
> Hi all,
>
> (I originally posted this on the mobile-couchbase list, but was recommended
> to post it here instead, since this is more of a general couchdb question)
>
> I am struggling with something that I know would be quite trivial to do in
> SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 years
> of experience with SQL is clouding my judgement and so I am trying rather to
> break away from the SQL/relational mindset and understand how this would be
> better designed in couch.
>

This is good. The first thing to realize in learning a new tool is
that its not the old tool. New tools behave differently and
differences should be embraced rather than rejected outright.

> I am developing an iOS art gallery app, and wish to display information
> about artworks, artists and galleries.
>
> The menu screen requires a list of all artworks together with the gallery
> they are currently housed at and the artist that created the artwork:
>
>  * artwork title
>  * artwork medium
>  * artwork dimensions
>  * artist full_name
>  * gallery name,
>  * gallery city
>
> for each artwork, respectively.
>
> Imagine the following objects in my database (I am not including all the
> fields for sake of brevity. i.e. only the ones that are important).
>
> |     Artist
>    ------
>    _id
>    full_name
>    birth_year
>    birth_place
>
>    Gallery
>    -------
>    _id
>    name
>    city
>
>    Artwork
>    -------
>    _id
>    title
>    medium
>    dimensions
>    artist_id_ (foreign key)
>    gallery_id_ (foreign key)
> |
>
> In SQL it would be as easy as one query with two joins:
>
> |SELECT
>    artwork.title,
>    artwork.medium,
>    artwork.dimensions,
>    artist.full_name,
>    gallery.name,
>    gallery.city
> FROM
>    artwork INNER JOIN artist
>    ON artwork.artist_id_ = artist._id INNER JOIN gallery
>    ON artwork.gallery_id_ = gallery._id
> |
>

The fact that you took the time to specify INNER JOIN is a good tip
you have a thorough understanding of SQL.

> From this I would be able to get the data I need in one query (i.e. one call
> to the database).
>
> |     Guernica  | Pablo Picasso     | Museo Reina Sofia, Madrid, Spain
>    Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France
>    David     | Michelangelo      | Uffizi Gallery, Florence, Italy.
> |
>
> Using couchdb I am storing the data in a very similar way to the way I would
> in a relational database, represented as something along the lines of:
>

Roughly speaking, this is your first mistake.

> |{
>   "type"       : "Artwork",
>   "_id"        : "d665afaedde12b44c5159cf1782c5c1e",
>   "_rev"       : "1-446201af8a1ddfa3759164a01be899de",
>   "artist_id"  : "d665afaedde12b44c5159cf1782c2d0b",
>   "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
>   "title"      : "Guernica",
>   "medium"     : "Oil on canvas",
>   "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)"
> }
>
> {
>   "type"       : "Artist",
>   "_id"        : "d665afaedde12b44c5159cf1782c2d0b",
>   "_rev"       : "1-8fa8a6dd50d9d9072b08db1a840128b1",
>   "full_name"  : "Pablo Picasso",
>   "birth_year" : "1881"
> }
>
> {
>   "type"       : "Gallery",
>   "_id"        : "d665afaedde12b44c5159cf1782d44af",
>   "_rev"       : "1-27a139e40a7f88d6465812eec3c73a0f",
>   "name"       : "Museo Reina Sofia"
> }
> |
>
> To get the same result in couch, I would have to do one query to get all the
> artworks, e.g.
>
> |function(doc) {
>  if(doc.type&&  doc.type == 'Artist') {
>    emit(doc._id,doc);
>  }
> }
> |
>
> then I would need to do a |GET| for the _id of each |Gallery| and |Artist|
> returned per |Artwork| respectively to fill in the missing information. This
> is usually known as the |n+1| queries problem (in this case |2n+1|).
>
> I have always worked on the assumption that the database should be leveraged
> to do as much of the heavy lifting as possible with as few network hits as
> possible (i.e. SQL is always faster than client manipulation). With a list
> of 120 artworks, this  means 241 HTTP GETs to the database.
>

This is your second mistake. And more to the point, this is the first
mistake I had to overcome myself when switching from relational to
non-relational data stores. Even more specifically, my stumbling block
was exactly the same as yours. The assumption that the number of
requests/queries is to be minimized.

I was taught quite specifically to minimize the number of queries per
page load. Current ORM layers are still abhorrent to me because of the
ease in which they translate directly to your worst case, each row in
the list ends up being three queries per row. Specifically, people
must understand both the relational model and their ORM layer to avoid
it from doing the Wrong Thing &trade;.

The conceptual realization that we need to make here is that SQL isn't
magical. When we issue a query, it goes through the planner and ends
up doing (seriously advanced) optimizations over the indexes and data
retrieval methods it has available. The "minimize number of queries"
mantra is a bit overly broad here. In practical terms it has the
effect of making sure that we're making the best use of indexes and
look up patterns. As an analogy, its akin to writing better C because
we might have some intuition on how the assembler translation might be
generated.

> I am aware of "linked documents" in views
> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents),
> however this seems to only work with one related row.
>
> I can't really see how "view collation"
> (http://wiki.apache.org/couchdb/View_collation) would work here either. With
> view collation the correlated information ends up being on different rows,
> and I think that would be better suited for where say a single Artists or
> Gallery is displayed with associated Artworks. My case is the opposite,
> whereby Artworks plus corresponding information from Artists and Galleries
> is required.
>

These are definitely tools in the arsenal, but don't really speak to
the core issue. I'll come back to this in a bit.

> I am also aware that one could embed each gallery or artist in each artwork
> or vice versa, but that seems to be a lot of duplication, and would make it
> difficult to change artist or gallery information if required. Maybe this is
> how it should be done?
>

As Sean pointed out, the second thing to overcome is data duplication.
Storage is cheaper than CPU and network costs. If you don't expect
data to change, you should try and exploit that.

> For this small app, this is not going to be a performance issue. I am trying
> rather to break away from a SQL/relational mindset and understand how this
> would be better designed in a database like couch.
>

Your honest questions are quite refreshing because they remind me of
my own initial exploration. There's lots of people talking but I've
never found a really concise description of the difference between the
approaches. There's lots of anecdotal blog posts about "scaling" and
the such, but nothing I've seen really addresses the core issues.

> Thanks,
>
> Steve
>

So, the bottom line is that the difference between relational and
non-relational is two things. The first issue that people encounter
when moving from relational to non is that the query planner is gone.
This seems benign at fist, but affects a huge number of moving pieces
in anything but non-trivial applications. If the query planner is
gone, then all queries are less than optimal. Think as a DBA if
*every* query was a full table scan. Most DBA's I know would scream at
the thought. And also be highly confused why anyone would give that
up.

There are two answers here. One is about "horizontal scalability".
Basically, SQL query planner optimizations don't generally work well
when there's a network connection involved. Note that I'm not saying
that SQL won't work in a clustered environment. There are companies
that have proven otherwise. The issue is that the difference isn't
clear to most people. Sure you can make SQL work on multiple machines,
but the tradeoffs between scaling SQL horizontally vs vertically are
complex and generally not well understood.

The second issue is that non-relational stores have less information
to do query optimizations. If you're in the crowd that says "I don't
need scaling, I need to not be constrained by a schema" you're in this
group. When you don't define a schema you're not allowing for lots of
important information to be fed into some really awesome algorithms
for data retrieval.

There's a rough analogy here between statically and dynamically typed
programming languages. Anyone that says one is better than the other
is someone you shouldn't listen to. There are extremes where both are
obviously better, but in the middle it gets into a complicated mess
where its more important on the observer's point of view than any sort
of theoretical points. If you have a well defined data usage, then
static might be best. A rule of thumb is, if you can't print your data
model on paper and have it be relevant for more than a couple months,
then you're probably forcing things a bit.

Granted, there are some obvious optimizations that non-relational
stores could make by introspecting data, but its quite important to
note the difference here. Relational stores regardless of scaling have
the enormous benefit of having the data model defined up front.

The best way I've learned to express this is that relational stores
require you to invest in defining how you get data into the store.
Non-relational means you have to invest in how to get data out.

So, the end of the day decision is if the traditional relational
stores provide you the ability to meet your customer's demands. For a
lot of people over the last thirty or more years, this is true. It
might be less than the optimal, but it generally works. Some people
have recently realized that some of the assumptions in the relational
model are artificially constraining. NOSQL is basically a response to
this. A sudden realization of "If I ignore this one thing, my problem
is a  lot easier." This is the issue that drives people. "What
constraint am I willing to give up?" The single piece of advice I
would give any relational-to-non convert is, "Analyze you're
requirements and ask any new data store what they give up in return
for their promise."

CouchDB gives up isolation and consistency (in the traditional sense
of transactions), for distribution through replication. We prioritize
being consistent over being fast. And we use a funky programming
language to make it happen.

Other data stores have different reasons for existing. Redis, which
I'm quite fond of, is roughly our opposite in that its geared towards
RAM and being fast. There are problems I would solve with either, but
no single problem for which I would use both interchangeably.

Mime
View raw message