couchdb-erlang mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Binbin Wang <binbinwang...@gmail.com>
Subject Re: code to handle the procedure - between POST doc and save data in filesystem
Date Fri, 02 Nov 2012 01:10:54 GMT
Hi Jan,

That's great post, and greatly can help we guys who want to dive into the
CouchDB source a lot. Thank you for your sharing!

Regards & Thanks!
Binbin

2012/11/1 Jan Lehnardt <jan@apache.org>

> Heya David,
>
> On Nov 1, 2012, at 08:39 , 高大为 <fovecifer@gmail.com> wrote:
>
> > Hi Erlang/CouchDB,
> >
> > Recently I am trying to read the source code of CouchDB, and got some
> > knowledge that how the CouchDB booting up.
> >
> > Right now I want to learn, when send a request, for example POST a
> > document, what parts of CouchDB code will do from handle the request to
> > save the data into disk / filesystem.
> >
> > In other words, POST doc ===> save data to disk / filesystem, what parts
> of
> > code will work for the whole procedure?
> >
> > Regards & Thanks!
> > David
>
>
> I’ve been waiting for an excuse to do a deep-dive like this, thanks! :)
>
> Given that you already dug around some yourself, I omit the “how to get
> to the code” section. I am on current master 1a9143e.
>
> Let’s start at the HTTP API: src/couchdb/couch_httpd.erl
>
> `couch_httpd` is the main entry point for all request handling in
> CouchDB. Its responsibilities are:
>
>  - Read the CouchDB configuration to configure itself with all
>    settings a user wishes to have for handling requests.
>  - Set up a socket to listen on for incoming requests.
>  - Set up a list of request handlers that map API actions to
>    internal module calls that do actual work.
>  - Start Mochiweb to handle everything related to HTTP.
>  - Export a number of functions that the request handler sub
>    modules can use to handle requests.
>
> The sub-modules are all in src/couchdb/:
>
>  - couch_httpd.erl
>  - couch_httpd_auth.erl
>  - couch_httpd_db.erl
>  - couch_httpd_external.erl
>  - couch_httpd_misc_handlers.erl
>  - couch_httpd_oauth.erl
>  - couch_httpd_proxy.erl
>  - couch_httpd_rewrite.erl
>  - couch_httpd_stats_handlers.er
>  - couch_httpd_vhost.erl
>
> The mapping of request handlers to URLs happens in the CouchDB
> configuration. The defaults are set in etc/couchdb/default.ini,
> which in source form is called etc/couchdb/default.ini.tpl.in,
> meaning that there are two layers of replacing variables going
> on until we get a final default.ini. For the request handlers,
> we can look at default.ini.tpl.in.
>
> The mapping of URLs to request handlers happen on three layers:
>
>  - Global handlers for things like `/`, `/_utils`, `_config` etc.
>  - Database handlers like `/db/_all_docs` or `/db/_compact`.
>  - Design document handlers like `/db/_design/docid/_view`
>
>
> With this knowledge, let’s trace this HTTP Requst:
>
>     POST /db/docid
>     ...
>     {"a":1}
>
> Or in `curl`:
>
>     $ curl -X PUT http://127.0.0.1:5984/db/docid -d '{"a":1}'
>
>
> The request goes to a `/db` URL, so we’ll have a look at the
> `[httpd_database_handlers]` section of default.ini.tpl.in:
>
>
>     [httpd_db_handlers]
>     _all_docs = {couch_mrview_http, handle_all_docs_req}
>     _changes = {couch_httpd_db, handle_changes_req}
>     _compact = {couch_httpd_db, handle_compact_req}
>     _design = {couch_httpd_db, handle_design_req}
>     _temp_view = {couch_mrview_http, handle_temp_view_req}
>     _view_cleanup = {couch_mrview_http, handle_cleanup_req}
>
> Hm, nothing that looks like a handler for creating documents.
>
> Let’s go back to couch_httpd.erl. In line 138 we see that we
> start Mochiweb with a list of handlers, first of all the
> `DefaultFun`, maybe we need to look at that. We are tracking
> it back to line 102. There’s a bit of gibberish about “arity”,
> we’ll ignore that for now. Then we see that we *do* rely on
> the config system:
>
>     couch_config:get("httpd", "default_handler"…).
>
> So let’s look at the `[httpd]` section of default.ini.tpl.in:
>
>     default_handler = {couch_httpd_db, handle_request}
>
> That looks promising, let’s find that in code, at
> src/couchdb/couch_httpd_db.erl, line 36.
>
> `handle_request()` first checks whether we want to create or
> delete a database, but when it sees we don’t, it passes our
> request along to `do_db_req()` (line 230), which turns out
> just to be a wrapper that opens a database and calls a callback,
> so back to where `do_db_req()` is called, we see `db_req/2` is
> passed as a callback.
>
> Now `db_req()` has various clauses to differentiate the different
> HTTP request methods it is called with and to allow for all sorts
> of special URLs to be called. We are interested in PUT, but we
> don‘t find that PUT is handled anywhere in particular. We do see
> however, that all the clauses before the last-but-one handle
> something that is *not* put, so we know that our clause is on
> line 464:
>
>     db_req(#httpd{path_parts=[_, DocId]}=Req, Db) ->
>         db_doc_req(Req, Db, DocId);
>
> Which turns to be yet another indirection, so let’s go with it.
> `do_doc_req` again has a number of clauses to deal with various
> request types. Lucky for us, there is a PUT clause on line 563:
>
>     db_doc_req(#httpd{method='PUT'}=Req, Db, DocId) ->
>
> First, the function checks whether we have a valid `DocId`.
> Assuming we do, it checks whether the request is a HTTP multipart
> request or a regular one. We have a regular one and are lucky
> again, our part of code here is rather small:
>
>     Body = couch_httpd:json_body(Req),
>     Doc = couch_doc_from_req(Req, DocId, Body),
>     update_doc(Req, Db, DocId, Doc)
>
> The first line fetches the JSON document body from the `Req`
> variable. At this point, `Body` should equal: `<<"{\"a\":1}">>,
> an Erlang binary that encodes the JSON body we passed in as
> a request.
>
> The second line turns the JSON, together with the `DocId` into
> a CouchDB document.
>
> Finally, we pass all we have now to the `update_doc` function we
> check out later.
>
> `couch_doc_from_req()` figures out whether we are trying to update
> and existing doc with our PUT request, or whether we want to create
> a new one. In our case, not much is done, in the update case, we
> need to pass in a `rev=` query parameter and that is checked here.
>
> In either case though, this function returns a value of the type
> `#doc{}`, which is a record that is defined in src/couchdb/couch_db.hrl,
> line 99, if you are curious.
>
> With all that in place, we can finally visit `update_doc()`. It again
> has a few clauses starting in line 716 (we are still in couch_db_httpd.erl)
>
> `update_doc` deals with a number query parameters again until it finally
> calls `couch_db:update_doc()`.
>
> This is our entry into the innards of CouchDB.
>
> Enter `couch_db` in src/couchdb/couch_db.erl. Our function
> `update_doc()` is defined in line 422, and it ultimately seems to
> be a wrapper around `update_docs()` (plural) in the lines starting
> at 688. Update docs has two independent clauses:
>
>     update_docs(Db, Docs, Options, replicated_changes) ->
>
> and
>
>     update_docs(Db, Docs, Options, interactive_edit) ->
>
> The first one handles replications that can create conflicts in
> document revision lists. The second one deals with regular
> database operations. So that that is for us.
>
> Our `update_docs()` does a number of things:
>
>  - prepare for yet more request parameters.
>  - separate our `_local` docs and regular docs (ours is a regular one.
>  - validate our document against `validate_update_function`s, if they
> exist.
>  - check whether we provided the correct `rev` in case of updates.
>  -
>
> And Finally:
>
>     {ok, CommitResults} = write_and_commit(Db, DocBuckets4, NonRepDocs,
> Options2),
>
> Let’s jump there, line 831:
>
> After doing some more preparations that I will gloss over, we see
> that CouchDB keeps around an `UpdatePid` in the `#db{}` record that
> is passed down with us so far. This `UpdatePid` is the process ID of
> a process that deals with database updates.
>
> In CouchDB, each database has a single process handling writes to the
> database, to ensure a consistent database file.
>
> In `write_and_commit()` we send a message to that process with the message
> `update_docs` (in line 839):
>
>    UpdatePid ! {update_docs, self(), DocBuckets, NonRepDocs,
> MergeConflicts, FullCommit},
>
> Let’s see where that message is handled.
>
> We need to know that the module that the `UpdatePid` runs is an
> instance of the `couch_db_updater` module. We would have found that
> out in `couch_db:init()`.
>
> The `update_docs` message is handled in src/couchdb/couch_db_update.erl
> in line 223.
>
> After receiving the whole message, with all docs (in our case, a list with
> just our document) is sent to `update_docs_int()` (line 672).
>
> `open_docs_int()` handles access to CouchDB’s main database data structure,
> the B+-tree. In fact, there are two B+-trees in each database at the same
> time: the fulldocinfo_by_id_btree and the docinfo_by_seq_btree. The first
> one contains all document data indexed by document id. The second one
> includes pointers to the fulldocinfo btree indexed by update sequence. The
> by_seq btree is what drives CouchDB’s /_changes feature which in turn
> powers replication, compaction and view creation.
>
> A new document is inserted in both indexes in lines 705 and 706:
>
>     {ok, DocInfoByIdBTree2} = couch_btree:add_remove(DocInfoByIdBTree,
> IndexFullDocInfos, []),
>     {ok, DocInfoBySeqBTree2} = couch_btree:add_remove(DocInfoBySeqBTree,
> IndexDocInfos, RemoveSeqs),
>
> At this point, our docs lives in the database structure, has been
> assigned a new `rev`, but it has not yet been written to disk. The
> last operation in `update_docs_int()` is `commit_data()` which
> sounds promising. Let’s jump down.
>
> The definition starts in line 781, the relevant bit for us in line 785.
> The way CouchDB write changes to disk is in this fashion:
>
>  1. write all changes to the data and index trees to the disk.
>  3. write a header to disk that has the current pointers to the index
>     trees that we wrote in 1.
>
> Writing to disk does not yet mean that the data actually arrived on
> disk. It might, but we only know for sure after we call the `fsync`
> system call. From Erlang, we call `couch_file:sync()`.
>
> Now there are different classes of behaviour possible in the list above.
> Notice how I left out 2.
>
> Writing a CouchDB file (which can be either a database file or a view
> index)
> can give different storage guarantees. The options are to fsync before
> the header is written, or after, or both. An fsync is a potentially
> expensive operation, so we have fine grained control over this here.
>
> The full list is:
>
>  1. write all changes to the data and index trees to the disk.
>  2. fsync.
>  3. write a header to disk that has the current pointers to the index
>     trees that we wrote in 1.
>  4. fsync.
>
> 2.-4. happen in `commit_data()`, but wait, where did 1. happen?
>
> For that, we need to jump back to `update_docs_int()`, line 697:
>
>     % Write out the document summaries (the bodies are stored in the nodes
> of
>     % the trees, the attachments are already written to disk)
>     {ok, FlushedFullDocInfos} = flush_trees(Db2, NewFullDocInfos, []),
>
> `flush_trees()` is defined in line 519. It iterates over the new data
> in the database and recursively writes it to disk in line 547:
>
>     {ok, NewSummaryPointer, SummarySize} =
>         couch_file:append_raw_chunk(Fd, Summary),
>
> Finally, we drop into `couch_file`, the lowest level of CouchDB.
> `append_raw_chunk()` is defined in line 111 and it is just a small
> wrapper that sends the `append_bin` message to the process that
> manages the file descriptor for our database file.
>
> `append_bin` is handled in line 373. It takes the data to be
> written and pads it out to make it a multiple of `?SIZE_BLOCK`
> (which is 4096 bytes).
>
> In line 376 our data is finally written to disk:
>
>     file:write(Fd, Blocks)
>
> From here on out we now go back up into `couch_db_updater` and
> deal with the header business we looked at earlier, from there
> it jumps back up into `couch_db` which waits for a success in
> writing the data, and when that shows up, it hands it back to
> `couch_httpd_db` which uses `couch_httpd` to send the successful
> writing of the document as an HTTP response.
>
> This concludes our little tour.
>
> I hope this was helpful! Let us know if there are any questions.
>
> Jan
> --
>
>
>


-- 
Wang.bupt

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message