river-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Resendes (JIRA)" <j...@apache.org>
Subject [jira] Updated: (RIVER-206) Change default load factors from 3 to 1
Date Fri, 07 Dec 2007 13:09:43 GMT

     [ https://issues.apache.org/jira/browse/RIVER-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Robert Resendes updated RIVER-206:

    Fix Version/s: AR2

> Change default load factors from 3 to 1
> ---------------------------------------
>                 Key: RIVER-206
>                 URL: https://issues.apache.org/jira/browse/RIVER-206
>             Project: River
>          Issue Type: Improvement
>            Reporter: Ron Mann
>            Priority: Minor
>             Fix For: AR2
> Bugtraq ID [6355743|http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6355743]
> Taken from jini-users mailing list [http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095]:
> This is a sad horror story about a default value for a load factor in
> Mahalo that turned out to halt our software system at regular intervals,
> but never in a deterministic way, leading to many lost development
> hours, loss of faith and even worse.
> In short what we experienced was that some operations in our software
> system (includes JavaSpaces and various services that perform operations
> under a distributed transaction) that should take place in parallel
> took place in a serialized manner. We noticed this behavior only
> occurred under some (at that time unknown) conditions. Not only
> throughput was harmed but our assumptions with regard to the maximum
> time in which operations should complete didn't hold any longer and
> things started to fail. One can argue well that is what distributed
> systems is all about, but nevertheless it is something you try to avoid,
> especially when all parts seem to function properly.
> We were not able to find dead-locks in our code or some other problem
> that could cause this behavior. Given the large number of services,
> their interaction and associated thousands of threads over multiple JVMs
> and that you can't freeze-frame time for your system, this appeared as a
> tricky problem to tackle. One of those moments you really regret you
> started to develop a distributed application at the first place.
> However a little voice told me that Mahalo must be involved in all this
> trouble, this was in line with my feeling with respect to Mahalo as I
> knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
> at the 7th JCM "Mahalo is the weakest child of the contributed services"
> or similar wording.
> So I decided to assume there was a bug in Mahalo and the only way to
> find out was to develop a scenario that could make that bug obvious and
> to improve logging a lot (proper tracking of transactions and
> participants involved). So lately I started to developed some scenario's
> and none of them could reproduce a bug or explain what we saw. Until
> lately I tried to experiment with transaction participants that are able
> to 'take their time' in the prepare method [1]. When using random
> prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
> and the througput of a transaction (time from client commit to
> completion) varied and was no direct funtion of the prepare time. The
> behavior I experienced could only be explained when the schedular of the
> various internal tasks was constrained by something. Knowing the code I
> suddenly realized there must have been a 'load factor' applied to the
> thread pool that was used for the commit related tasks. I was rather
> shocked to find out that the default was 3.0 and suddenly the mistery
> became completely clear to me. Mahalo has out-of-the-box a built-in
> constraint that can make the system serialize transaction related
> operation in case participants really take their time to return.
> So it turned out that Mahalo is a fine services after all, but that one
> 'freak' ;-) chose a very unfortunate default value for the load factor [2].
> Load-factors for thread pools (and max limits to a lesser degree) are so
> tricky to get right [3] and therefore IMHO high load factors should only
> be used in case you know for sure you are dealing with bursts of tasks
> with a guaranteed short duration and I think that is really something
> people should tune themselves.
> Maybe it was stupid of me and I should have read and understand the
> Mahalo documentation better. But I would expect any system to use
> out-of-the-box load-factors of 1.0 for tasks in a thread pool that
> are potentially long running tasks [3], especially for something as
> delecate as a transaction manager that seems to operate as the so called
> spider in the web. It is better to have a system consuming too much
> threads opposed to constrain it in a way that leads to problems that are
> very hard to find out.
> I hope this mail is seen as an RFE for a default load factor of 1.0 to
> prevent from people running into similar problems as we had and as a
> lesson/warning for those working with Mahalo and the risk of using
> load-factors in general.
> [1] in our system some service have to consult external systems when
> prepare is called on them and under some conditions it can take a long
> time to return from the prepare method. We are aware this is something
> you want to prevent but we have requirements that mandate this.
> [2] the one that gave us problems in production was Mahalo from JTSK
> 2.0 that didn't have the ability to specify a taskpool through the
> configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
> documented at that time if I recall correctly (don't have a 2.0
> distribution at hand).
> [3] more and more I'm starting to believe that each task in a thread
> pool should have a dead-line in which they should be assigned to a
> worker thread, for this purpose we support in our thread pools a
> priority constraint to attach to Runnables, see
> http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
> In a discussion in the Porter mailing list I know Bob Scheifler once
> said "I have in a past life been a fan of deadline scheduling.", I'm
> very interested to know whether he still is a fan.
> Evaluation:
> Given a low priority since in 2.1 the task pool objects are user configurable. This request
is to change the default setting for those objects.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message