couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Dionne <dio...@dionne-associates.com>
Subject Re: [VOTE] Apache CouchDB 1.2.0 release, second round
Date Tue, 28 Feb 2012 11:33:32 GMT
Filipe,

This additional patch looks good, though I haven't tested it. Interesting comment about R15B,
I did notice a difference with BigCouch in terms of some of the internal race conditions we
see at times. Perhaps there are some performance changes relating to that. I also recently
upgraded from the Macbook pro to a MBA so who knows.

I ran Jason and Bob's scripts a bit last night and saw similar slow downs between 1.1 and
1.2, though as reported elsewhere with larger docs it's less of an issue. In this patch[1]
there's clearly a savings in avoiding the decode call, but I wonder how often that case obtains
compared to the others. If {cmd, CMD} dominates then there is an additional overhead incurred
however small it might be. Perhaps this explains why the benefits appear for larger docs only.

Anyway, just speculation from the code.

Regards,

Bob

[1] https://github.com/fdmanana/couchdb/commit/cce325378723c863f05cca21

On Feb 27, 2012, at 11:33 AM, Filipe David Manana wrote:

> I just tried Jason's script (modified it to use 500 000 docs instead
> of 50 000) against 1.2.x and 1.1.1, using OTP R14B03. Here's my
> results:
> 
> 1.2.x:
> 
> $ port=5984 ./test.sh
> "none"
> Filling db.
> done
> HTTP/1.1 200 OK
> Server: CouchDB/1.2.0 (Erlang OTP/R14B03)
> Date: Mon, 27 Feb 2012 16:08:43 GMT
> Content-Type: text/plain; charset=utf-8
> Content-Length: 252
> Cache-Control: must-revalidate
> 
> {"db_name":"db1","doc_count":500001,"doc_del_count":0,"update_seq":500001,"purge_seq":0,"compact_running":false,"disk_size":130494577,"data_size":130490673,"instance_start_time":"1330358830830086","disk_format_version":6,"committed_update_seq":500001}
> Building view.
> 
> real	1m5.725s
> user	0m0.006s
> sys	0m0.005s
> done
> 
> 
> 1.1.1:
> 
> $ port=5984 ./test.sh
> ""
> Filling db.
> done
> HTTP/1.1 200 OK
> Server: CouchDB/1.1.2a785d32f-git (Erlang OTP/R14B03)
> Date: Mon, 27 Feb 2012 16:15:33 GMT
> Content-Type: text/plain;charset=utf-8
> Content-Length: 230
> Cache-Control: must-revalidate
> 
> {"db_name":"db1","doc_count":500001,"doc_del_count":0,"update_seq":500001,"purge_seq":0,"compact_running":false,"disk_size":122142818,"instance_start_time":"1330359233327316","disk_format_version":5,"committed_update_seq":500001}
> Building view.
> 
> real	1m4.249s
> user	0m0.006s
> sys	0m0.005s
> done
> 
> 
> I don't see any significant difference there.
> 
> Regarding COUCHDB-1186, the only thing that might cause some non
> determinism and affect performance is the queing/dequeing. Depending
> on timings, it's possible the writer is dequeing less items per
> dequeue operation and therefore inserting smaller batches into the
> btree. The following small change ensures larger batches (while still
> respecting the queue max size/item count):
> 
> http://friendpaste.com/178nPFgfyyeGf2vtNRpL0w
> 
> Running the test with this change:
> 
> $ port=5984 ./test.sh
> "none"
> Filling db.
> done
> HTTP/1.1 200 OK
> Server: CouchDB/1.2.0 (Erlang OTP/R14B03)
> Date: Mon, 27 Feb 2012 16:23:20 GMT
> Content-Type: text/plain; charset=utf-8
> Content-Length: 252
> Cache-Control: must-revalidate
> 
> {"db_name":"db1","doc_count":500001,"doc_del_count":0,"update_seq":500001,"purge_seq":0,"compact_running":false,"disk_size":130494577,"data_size":130490673,"instance_start_time":"1330359706846104","disk_format_version":6,"committed_update_seq":500001}
> Building view.
> 
> real	0m49.762s
> user	0m0.006s
> sys	0m0.005s
> done
> 
> 
> If there's no objection, I'll push that patch.
> 
> Also, another note, I noticed sometime ago that with master, using OTP
> R15B I got a performance drop of 10% to 15% compared to using master
> with OTP R14B04. Maybe it applies to 1.2.x as well.
> 
> 
> On Mon, Feb 27, 2012 at 5:33 AM, Robert Newson <rnewson@apache.org> wrote:
>> Bob D, can you give more details on the data set you're testing?
>> Number of docs, size/complexity of docs, etc? Basically, enough info
>> that I could write a script to automate building an equivalent
>> database.
>> 
>> I wrote a quick bash script to make a database and time a view build
>> here: http://friendpaste.com/7kBiKJn3uX1KiGJAFPv4nK
>> 
>> B.
>> 
>> On 27 February 2012 13:15, Jan Lehnardt <jan@apache.org> wrote:
>>> 
>>> On Feb 27, 2012, at 12:58 , Bob Dionne wrote:
>>> 
>>>> Thanks for the clarification. I hope I'm not conflating things by continuing
the discussion here, I thought that's what you requested?
>>> 
>>> The discussion we had on IRC was regarding collecting more data items for the
performance regression before we start to draw conclusions.
>>> 
>>> My intention here is to understand what needs doing before we can release 1.2.0.
>>> 
>>> I'll reply inline for the other issues.
>>> 
>>>> I just downloaded the release candidate again to start fresh. "make distcheck"
hangs on this step:
>>>> 
>>>> /Users/bitdiddle/Downloads/apache-couchdb-1.2.0/apache-couchdb-1.2.0/_build/../test/etap/150-invalid-view-seq.t
......... 6/?
>>>> 
>>>> Just stops completely. This is on R15B which has been rebuilt to use the
recommended older SSL version. I haven't looked into this crashing too closely but I'm suspicious
that I only see it with couchdb and never with bigcouch and never using the 1.2.x branch from
source or any branch for that matter
>>> 
>>> From the release you should run `make check`, not make distcheck. But I assume
you see a hang there too, as I have and others (yet not everybody), too. I can't comment on
BigCouch and what is different there. It is interesting that 1.2.x won't hang. For me, `make
check` in 1.2.x on R15B hangs sometimes, in different places. I'm currently trying to gather
more information about this.
>>> 
>>> The question here is whether `make check` passing in R15B is a release requirement.
In my vote I considered no, but I am happy to go with a community decision if it emerges.
What is your take here?
>>> 
>>> In addition, this just shouldn't be a question, so we should investigate why
this happens at all and address the issue, hence COUCHDB-1424. Any insight here would be appreciated
as well.
>>> 
>>> 
>>>> In the command line tests, 2,7, 27, and 32 fail. but it differs from run
to run.
>>> 
>>> I assume you mean the JS tests. Again, this isn't supposed to work in 1.2.x.
I'm happy to backport my changes from master to 1.2.x to make that work, but I refrained from
that because I didn't want to bring too much change to a release branch. I'm happy to reconsider,
but I don't think a release vote is a good place to discuss feature backports.
>>> 
>>> 
>>>> On Chrome attachment_ranges fails and it hangs on replicator_db
>>> 
>>> This one is an "explaining away", but I think it is warranted. Chrome is broken
for attachment_ranges. I don't know if we reported this upstream (Robert N?), but this isn't
a release blocker. For the replicator_db test, can you try running that in other browsers.
I understand it is not the best of situation (hence the move to the cli test suite for master),
but if you get this test to pass in at least one other browsers, this isn't a problem that
holds 1.2.x.
>>> 
>>> 
>>>> With respect to performance I think comparisons with 1.1.x are important.
I think almost any use case, contrived or otherwise should not be dismissed as a pathological
or edge case. Bob's script is as simple as it gets and to me is a great smoke test. We need
to figure out the reason 1.2 is clearly slower in this case. If there are specific scenarios
that 1.2.x is optimized for then we should document that and provide reasons for the trade-offs
>>> 
>>> I want to make absolutely clear that I take any report of performance regression
very seriously. But I'm rather annoyed that no information about this ends up on dev@. I understand
that on IRC there's some shared understanding of a few scenarios where performance regressions
can be shown. I asked three times now that these be posted to this mailing list. I'm not asking
for a comprehensive report, but anything really. I found Robert Newson's simple test script
on IRC and ran that to test a suspicion of mine which I posted in an earlier mail (tiny docs
-> slower, bigger docs -> faster). Nobody else bothered to post this here. I see no
discussion about what is observed, what is expected, what would be acceptable for a release
of 1.2.0 as is and what not.
>>> 
>>> As far as this list is concerned, we know that a few people claimed that things
are slower and it's very real and that we should hold the 1.2.0 release for it. I'm more than
happy to hold the release until we figured out the things I asked for above and help out figuring
it all out. But we need something to work with here.
>>> 
>>> I also understand that this is a voluntary project and people don't have infinite
time to spend, but at least a message of "we're collecting things, will report when done",
would be *great* to start. So far we only have a "hold the horses, there might be a something
going on".
>>> 
>>> Please let me know if this request is unreasonable or whether I am overreacting.
>>> 
>>> Sorry for the rant.
>>> 
>>> To anyone who has been looking into performance regression, can you please send
to this list any info you have? If you have a comprehensive analysis, awesome, if you just
ran some script on a machine, just send us that, let's collect all the data to get this situation
solved! We need your help.
>>> 
>>> 
>>> tl;dr:
>>> 
>>> There's three issues at hand:
>>> 
>>>  - Robert D -1'd a release artefact. We want to understand what needs to happen
to make a release. This includes assessing the issues he raises and squaring them against
the release vote.
>>> 
>>>  - There's a vague (as far as dev@ is concerned) report about a performance regression.
We need to get behind that.
>>> 
>>>  - There's been a non-dev@ discussion about the performance regression and that
is referenced to influence a dev@ decision. We need that discussion's information on dev@
to proceed.
>>> 
>>> 
>>> And to make it absolutely clear again. The performance regression *is* an issue
and I am very grateful for the people, including Robert Newson, Robert Dionne and Jason Smith,
who look into it. It's just that we need to treat this as an issue and get all this info onto
dev@ or into JRIA.
>>> 
>>> 
>>> Cheers
>>> Jan
>>> --
>>> 
>>> 
>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> Bob
>>>> 
>>>> 
>>>> On Feb 26, 2012, at 4:07 PM, Jan Lehnardt wrote:
>>>> 
>>>>> Bob,
>>>>> 
>>>>> thanks for your reply
>>>>> 
>>>>> I wasn't implying we should try to explain anything away. All of these
are valid concerns, I just wanted to get a better understanding on where the bit flips from
+0 to -1 and subsequently, how to address that boundary.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> Ideally we can just fix all of the things you mention, but I think it
is important to understand them in detail, that's why I was going into them. Ultimately, I
want to understand what we need to do to ship 1.2.0.
>>>>> 
>>>>> On Feb 26, 2012, at 21:22 , Bob Dionne wrote:
>>>>> 
>>>>>> Jan,
>>>>>> 
>>>>>> I'm -1 based on all of my evaluation. I've spent a few hours on this
release now yesterday and today. It doesn't really pass what I would call the "smoke test".
Almost everything I've run into has an explanation:
>>>>>> 
>>>>>> 1. crashes out of the box - that's R15B, you need to recompile SSL
and Erlang (we'll note on release notes)
>>>>> 
>>>>> Have we spent any time on figuring out what the trouble here is?
>>>>> 
>>>>> 
>>>>>> 2. etaps hang running make check. Known issue. Our etap code is out
of date, recent versions of etap don't even run their own unit tests
>>>>> 
>>>>> I have seen the etap hang as well, and I wasn't diligent enough to report
it in JIRA, I have done so now (COUCHDB-1424).
>>>>> 
>>>>> 
>>>>>> 3. Futon tests fail. Some are known bugs (attachment ranges in Chrome)
. Both Chrome and Safari also hang
>>>>> 
>>>>> Do you have more details on where Chrome and Safari hang? Can you try
their private browsing features, double/triple check that caches are empty? Can you get to
a situation where you get all tests succeeding across all browsers, even if individual ones
fail on one or two others?
>>>>> 
>>>>> 
>>>>>> 4. standalone JS tests fail. Again most of these run when run by
themselves
>>>>> 
>>>>> Which ones?
>>>>> 
>>>>> 
>>>>>> 5. performance. I used real production data *because* Stefan on user
reported performance degradation on his data set. Any numbers are meaningless for a single
test. I also ran scripts that BobN and Jason Smith posted that show a difference between 1.1.x
and 1.2.x
>>>>> 
>>>>> You are conflating an IRC discussion we've had into this thread. The
performance regression reported is a good reason to look into other scenarios where we can
show slowdowns. But we need to understand what's happening. Just from looking at dev@ all
I see is some handwaving about some reports some people have done (Not to discourage any work
that has been done on IRC and user@, but for the sake of a release vote thread, this related
information needs to be on this mailing list).
>>>>> 
>>>>> As I said on IRC, I'm happy to get my hands dirty to understand the regression
at hand. But we need to know where we'd draw a line and say this isn't acceptable for a 1.2.0.
>>>>> 
>>>>> 
>>>>>> 6. Reviewed patch pointed to by Jason that may be the cause but it's
hard to say without knowing the code analysis that went into the changes. You can see obvious
local optimizations that make good sense but those are often the ones that get you, without
knowing the call counts.
>>>>> 
>>>>> That is a point that wasn't included in your previous mail. It's great
that there is progress, thanks for looking into this!
>>>>> 
>>>>> 
>>>>>> Many of these issues can be explained away, but I think end users
will be less forgiving. I think we already struggle with view performance. I'm interested
to see how others evaluate this regression.
>>>>>> I'll try this seatoncouch tool you mention later to see if I can
construct some more definitive tests.
>>>>> 
>>>>> Again, I'm not trying to explain anything away. I want to get a shared
understanding of the issues you raised and where we stand on solving them squared against
the ongoing 1.2.0 release.
>>>>> 
>>>>> And again: Thanks for doing this thorough review and looking into performance
issue. I hope with your help we can understand all these things a lot better very soon :)
>>>>> 
>>>>> Cheers
>>>>> Jan
>>>>> --
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Bob
>>>>>> On Feb 26, 2012, at 2:29 PM, Jan Lehnardt wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Feb 26, 2012, at 13:58 , Bob Dionne wrote:
>>>>>>> 
>>>>>>>> -1
>>>>>>>> 
>>>>>>>> R15B on OS X Lion
>>>>>>>> 
>>>>>>>> I rebuilt OTP with an older SSL and that gets past all the
crashes (thanks Filipe). I still see hangs when running make check, though any particular
etap that hangs will run ok by itself. The Futon tests never run to completion in Chrome without
hanging and the standalone JS tests also have fails.
>>>>>>> 
>>>>>>> What part of this do you consider the -1? Can you try running
the JS tests in Firefox and or Safari? Can you get all tests pass at least once across all
browsers? The cli JS suite isn't supposed to work, so that isn't a criterion. I've seen the
hang in make check for R15B while individual tests run as well, but I don't consider this
blocking. While I understand and support the notion that tests shouldn't fail, period, we
gotta work with what we have and master already has significant improvements. What would you
like to see changed to not -1 this release?
>>>>>>> 
>>>>>>>> I tested the performance of view indexing, using a modest
200K doc db with a large complex view and there's a clear regression between 1.1.x and 1.2.x
Others report similar results
>>>>>>> 
>>>>>>> What is a large complex view? The complexity of the map/reduce
functions is rarely an indicator of performance, it's usually input doc size and output/emit()/reduce
data size. How big are the docs in your test and how big is the returned data? I understand
the changes for 1.2.x will improve larger-data scenarios more significantly.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> Jan
>>>>>>> --
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Feb 23, 2012, at 5:25 PM, Bob Dionne wrote:
>>>>>>>> 
>>>>>>>>> sorry Noah, I'm in debug mode today so I don't care to
start mucking with my stack, recompiling erlang, etc...
>>>>>>>>> 
>>>>>>>>> I did try using that build repeatedly and it crashes
all the time. I find it very odd and I had seen those before as I said on my older macbook.
>>>>>>>>> 
>>>>>>>>> I do see the hangs Jan describes in the etaps, they have
been there right along, so I'm confident this just the SSL issue. Why it only happens on the
build is puzzling, any source build of any branch works just peachy.
>>>>>>>>> 
>>>>>>>>> So I'd say I'm +1 based on my use of the 1.2.x branch
but I'd like to hear from Stefan, who reported the severe performance regression. BobN seems
to think we can ignore that, it's something flaky in that fellow's environment. I tend to
agree but I'm conservative
>>>>>>>>> 
>>>>>>>>> On Feb 23, 2012, at 1:23 PM, Noah Slater wrote:
>>>>>>>>> 
>>>>>>>>>> Can someone convince me this bus error stuff and
segfaults is not a
>>>>>>>>>> blocking issue.
>>>>>>>>>> 
>>>>>>>>>> Bob tells me that he's followed the steps above and
he's still experiencing
>>>>>>>>>> the issues.
>>>>>>>>>> 
>>>>>>>>>> Bob, you did follow the steps to install your own
SSL right?
>>>>>>>>>> 
>>>>>>>>>> On Thu, Feb 23, 2012 at 5:09 PM, Jan Lehnardt <jan@apache.org>
wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Feb 23, 2012, at 00:28 , Noah Slater wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> 
>>>>>>>>>>>> I would like call a vote for the Apache CouchDB
1.2.0 release, second
>>>>>>>>>>> round.
>>>>>>>>>>>> 
>>>>>>>>>>>> We encourage the whole community to download
and test these
>>>>>>>>>>>> release artifacts so that any critical issues
can be resolved before the
>>>>>>>>>>>> release is made. Everyone is free to vote
on this release, so get stuck
>>>>>>>>>>> in!
>>>>>>>>>>>> 
>>>>>>>>>>>> We are voting on the following release artifacts:
>>>>>>>>>>>> 
>>>>>>>>>>>> http://people.apache.org/~nslater/dist/1.2.0/
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> These artifacts have been built from the
following tree-ish in Git:
>>>>>>>>>>>> 
>>>>>>>>>>>> 4cd60f3d1683a3445c3248f48ae064fb573db2a1
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Please follow the test procedure before voting:
>>>>>>>>>>>> 
>>>>>>>>>>>> http://wiki.apache.org/couchdb/Test_procedure
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>> 
>>>>>>>>>>>> Happy voting,
>>>>>>>>>>> 
>>>>>>>>>>> Signature and hashes check out.
>>>>>>>>>>> 
>>>>>>>>>>> Mac OS X 10.7.3, 64bit, SpiderMonkey 1.8.0, Erlang
R14B04: make check
>>>>>>>>>>> works fine, browser tests in Safari work fine.
>>>>>>>>>>> 
>>>>>>>>>>> Mac OS X 10.7.3, 64bit, SpiderMonkey 1.8.5, Erlang
R14B04: make check
>>>>>>>>>>> works fine, browser tests in Safari work fine.
>>>>>>>>>>> 
>>>>>>>>>>> FreeBSD 9.0, 64bit, SpiderMonkey 1.7.0, Erlang
R14B04: make check works
>>>>>>>>>>> fine, browser tests in Safari work fine.
>>>>>>>>>>> 
>>>>>>>>>>> CentOS 6.2, 64bit, SpiderMonkey 1.8.5, Erlang
R14B04: make check works
>>>>>>>>>>> fine, browser tests in Firefox work fine.
>>>>>>>>>>> 
>>>>>>>>>>> Ubuntu 11.4, 64bit, SpiderMonkey 1.8.5, Erlang
R14B02: make check works
>>>>>>>>>>> fine, browser tests in Firefox work fine.
>>>>>>>>>>> 
>>>>>>>>>>> Ubuntu 10.4, 32bit, SpiderMonkey 1.8.0, Erlang
R13B03: make check fails in
>>>>>>>>>>> - 076-file-compression.t: https://gist.github.com/1893373
>>>>>>>>>>> - 220-compaction-daemon.t: https://gist.github.com/1893387
>>>>>>>>>>> This on runs in a VM and is 32bit, so I don't
know if there's anything in
>>>>>>>>>>> the tests that rely on 64bittyness or the R14B03.
Filipe, I think you
>>>>>>>>>>> worked on both features, do you have an idea?
>>>>>>>>>>> 
>>>>>>>>>>> I tried running it all through Erlang R15B on
Mac OS X 1.7.3, but a good
>>>>>>>>>>> way into `make check` the tests would just stop
and hang. The last time,
>>>>>>>>>>> repeatedly in 160-vhosts.t, but when run alone,
that test finished in under
>>>>>>>>>>> five seconds. I'm not sure what the issue is
here.
>>>>>>>>>>> 
>>>>>>>>>>> Despite the things above, I'm happy to give this
a +1 if we put a warning
>>>>>>>>>>> about R15B on the download page.
>>>>>>>>>>> 
>>>>>>>>>>> Great work all!
>>>>>>>>>>> 
>>>>>>>>>>> Cheers
>>>>>>>>>>> Jan
>>>>>>>>>>> --
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."


Mime
View raw message