couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Filipe David Manana <>
Subject Re: About possibly reverting COUCHDB-767
Date Sun, 07 Nov 2010 19:52:09 GMT
On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <> wrote:
> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>> Hi,
>> Regarding the change introduced by the ticket:
>> (opening the same file in a different process and call fsync on the
>> new file descriptor through the new process)
>> I found out that it's not a recomended practice. I posted the
>> following question to the ext4 development mailing list:
> Reading that thread it seems quite unlikely that we're getting into any trouble with
this patch.  But it does seem like we should be considering alternatives.

Right, but unlikely does not mean never, unfortunately.

>> Also, with this patch I verified (on Solaris, with the 'zpool iostat
>> 1' command) that when running a writes only test with relaximation
>> (200 write processes), disk write activity is not continuous. Without
>> this patch, there's continuous (every 1 second) write activity.
> I'm confused by this statement. You must be talking about relaximation runs with delayed_commits
= true, right?  Why do you think you see larger intervals between write activity with the
optimization from COUCHDB-767?  Have you measured the time it takes to open the extra FD?
 In my tests that was a sub-millisecond operation, but maybe you've uncovered something else.

No, it happens for tests with delayed_commits = false. The only
possible explanation I see for the variance might be related to the
Erlang VM scheduler decisions about when to start/run that process.
Nevertheless, I dont know the exact cause, but the fsync run frequency
varies a lot.

>> This also makes performance comparison tests with relaximation much harder
>> to analyse, as the peak variation is much higher and not periodic.
> Still confused.  I'm guessing the increased variance only shows up on the writes graph
- reads certainly ought to have decreased variance because the fsync occurs out of band.  Is
the issue that the fsync takes longer when it requires a new FD, or am I reading this all

Yes, for writes.

> <rant>Any variance-related statements based on relaximation results are purely
qualitative, since relaximation does not report measurement variance.  Similarly, any claims
about relative improvements in response times have an unknown statistical significance.</rant>

Right. I didn't meant to say that relaximation is always right or that
we should be blindily guided by relaximation reports.

>> For the goal of not having readers getting blocked by fsync calls (and
>> write calls), I would propose using a separate couch_file process just
>> for read operations. I have a branch in my github for this (with
>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>> tests are very positive, both reads and writes get better response
>> times and throughput:
> I'd like to propose an alternative optimization, which is to keep a dedicated file descriptor
open in the couch_db_updater process and use that file descriptor for _all_ IO initiated by
the db_updater.  The advantage is that the db_updater does not need to do any message passing
for disk IO, and thus does not slow down when the incoming message queue is large.  A message
queue much much larger than the number of concurrent writers can occur if a user writes with
batch=ok, and it can also happen rather easily in a BigCouch cluster.

I don't see how that will improve things, since all write operations
will still be done in a serialized manner. Since only couch_db_updater
writes to the DB file, and since access to the couch_db_updater is
serialized, to me it only seems that you're solution avoids one level
of indirection (the couch_file process). I don't see how, when using a
couch_file only for writes, you get the message queue for that
couc_file process full of write messages.

Also, what I did on that branch is a bit more generic, as it works for
view index files as well, and doesn't introduce significant changes
elsewhere except in couch_file.erl. Of course your solution might be
extended to the view updater process as well easily, I don't have
anything against it.

Anyway, +1.

>>  (relaximation test)
>> The test does a direct comparsion with trunk (also with COUCHDB-767
>> reverted) and was run like this:
>> $ node tests/compare_write_and_read.js --wclients 300 --rclients 150 \
>>    -name1 2_couch_files_no_batch_reads -name2 trunk \
>>    -url1 http://localhost:5984/ -url2 http://localhost:5985/ \
>>    --duration 120
> Is the graph you posted a comparison with trunk, or with trunk minus COUCHDB-767?  I
couldn't parse your statement.

With trunk minus COUCHDB-767.
> Side note: I linked to a relaximation test in COUCHDB-767, but that link (to
went dead. Is the graph still available somewhere?

No idea.

> That patch looks like it probably improves the requests per second for both reads and
writes, as it should if you have a multicore server and/or you've turned on the async thread
pool.  Pretty difficult to say if the change is statistically significant, though.

Well, yes. But at least write operations (started by couch_db_updater
for e.g.) no longer wait for read operations to be processed (issued
by couch_db.erl for e.g.) and the other way around. This was tested
with the +A flag set to 4 (default on trunk) and set to 16. For both
cases the read and write performance was better (according to the
relaximation tests).

>> This approach, of using a file descriptor just for reads and another
>> one just for writes (but both referring to the same file), seems to be
>> safe:
>> Race conditions shouldn't happen, as each read call needs an offset,
>> and the offset is only known after a previous write calls finished.
> I'm definitely not worried about race conditions, our DB updating logic seems to prevent
that several times over.
>> Thoughts on this? Should we revert COUCHDB-767? Integrate the 2
>> couch_files strategy into trunk? (only after 1.1 has its own branch)
> I'm against reverting COUCHDB-767, but I guess it deserves more research to see if any
of our supported platforms actually implement an fsync() that does not sync writes from another
file descriptor for the same file.  If we do find a platform like that we should certainly

The fact that the behaviour is not guaranteed by POSIX or the Single
Unix Specification worries me (not to mention Windows).

> I'm probably in favor of a two file descriptors per DB approach, but I think attaching
the second fd directly to the db_updater will be the winning strategy.  Best,
> Adam
>> cheers
>> --
>> Filipe David Manana,
>> "Reasonable men adapt themselves to the world.
>>  Unreasonable men adapt the world to themselves.
>>  That's why all progress depends on unreasonable men."

Filipe David Manana,,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

View raw message