couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: About possibly reverting COUCHDB-767
Date Sun, 07 Nov 2010 19:20:00 GMT
On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:

> Hi,
> 
> Regarding the change introduced by the ticket:
> 
> https://issues.apache.org/jira/browse/COUCHDB-767
> 
> (opening the same file in a different process and call fsync on the
> new file descriptor through the new process)
> 
> I found out that it's not a recomended practice. I posted the
> following question to the ext4 development mailing list:
> 
> http://www.spinics.net/lists/linux-ext4/msg21388.html

Reading that thread it seems quite unlikely that we're getting into any trouble with this
patch.  But it does seem like we should be considering alternatives.

> Also, with this patch I verified (on Solaris, with the 'zpool iostat
> 1' command) that when running a writes only test with relaximation
> (200 write processes), disk write activity is not continuous. Without
> this patch, there's continuous (every 1 second) write activity.

I'm confused by this statement. You must be talking about relaximation runs with delayed_commits
= true, right?  Why do you think you see larger intervals between write activity with the
optimization from COUCHDB-767?  Have you measured the time it takes to open the extra FD?
 In my tests that was a sub-millisecond operation, but maybe you've uncovered something else.

> This also makes performance comparison tests with relaximation much harder
> to analyse, as the peak variation is much higher and not periodic.

Still confused.  I'm guessing the increased variance only shows up on the writes graph - reads
certainly ought to have decreased variance because the fsync occurs out of band.  Is the issue
that the fsync takes longer when it requires a new FD, or am I reading this all wrong?

<rant>Any variance-related statements based on relaximation results are purely qualitative,
since relaximation does not report measurement variance.  Similarly, any claims about relative
improvements in response times have an unknown statistical significance.</rant>

> For the goal of not having readers getting blocked by fsync calls (and
> write calls), I would propose using a separate couch_file process just
> for read operations. I have a branch in my github for this (with
> COUCHDB-767 reverted). It needs to be polished, but the relaximation
> tests are very positive, both reads and writes get better response
> times and throughput:
> 
> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads

I'd like to propose an alternative optimization, which is to keep a dedicated file descriptor
open in the couch_db_updater process and use that file descriptor for _all_ IO initiated by
the db_updater.  The advantage is that the db_updater does not need to do any message passing
for disk IO, and thus does not slow down when the incoming message queue is large.  A message
queue much much larger than the number of concurrent writers can occur if a user writes with
batch=ok, and it can also happen rather easily in a BigCouch cluster.

> http://graphs.mikeal.couchone.com/#/graph/62b286fbb7aa55a4b0c4cc913c00e659
>  (relaximation test)
> 
> The test does a direct comparsion with trunk (also with COUCHDB-767
> reverted) and was run like this:
> 
> $ node tests/compare_write_and_read.js --wclients 300 --rclients 150 \
>    -name1 2_couch_files_no_batch_reads -name2 trunk \
>    -url1 http://localhost:5984/ -url2 http://localhost:5985/ \
>    --duration 120

Is the graph you posted a comparison with trunk, or with trunk minus COUCHDB-767?  I couldn't
parse your statement.

Side note: I linked to a relaximation test in COUCHDB-767, but that link (to mikeal.couchone.com)
went dead. Is the graph still available somewhere?

That patch looks like it probably improves the requests per second for both reads and writes,
as it should if you have a multicore server and/or you've turned on the async thread pool.
 Pretty difficult to say if the change is statistically significant, though.

> This approach, of using a file descriptor just for reads and another
> one just for writes (but both referring to the same file), seems to be
> safe:
> 
> http://www.spinics.net/lists/linux-ext4/msg21429.html
> 
> Race conditions shouldn't happen, as each read call needs an offset,
> and the offset is only known after a previous write calls finished.

I'm definitely not worried about race conditions, our DB updating logic seems to prevent that
several times over.
 
> Thoughts on this? Should we revert COUCHDB-767? Integrate the 2
> couch_files strategy into trunk? (only after 1.1 has its own branch)

I'm against reverting COUCHDB-767, but I guess it deserves more research to see if any of
our supported platforms actually implement an fsync() that does not sync writes from another
file descriptor for the same file.  If we do find a platform like that we should certainly
revert.

I'm probably in favor of a two file descriptors per DB approach, but I think attaching the
second fd directly to the db_updater will be the winning strategy.  Best,

Adam

> 
> cheers
> 
> -- 
> Filipe David Manana,
> fdmanana@gmail.com, fdmanana@apache.org
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."


Mime
View raw message