river-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Resendes (JIRA)" <j...@apache.org>
Subject [jira] Updated: (RIVER-206) Change default load factors from 3 to 1
Date Fri, 07 Dec 2007 13:09:43 GMT

     [ https://issues.apache.org/jira/browse/RIVER-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Resendes updated RIVER-206:
----------------------------------

    Fix Version/s: AR2

> Change default load factors from 3 to 1
> ---------------------------------------
>
>                 Key: RIVER-206
>                 URL: https://issues.apache.org/jira/browse/RIVER-206
>             Project: River
>          Issue Type: Improvement
>            Reporter: Ron Mann
>            Priority: Minor
>             Fix For: AR2
>
>
> Bugtraq ID [6355743|http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6355743]
> Taken from jini-users mailing list [http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095]:
> This is a sad horror story about a default value for a load factor in
> Mahalo that turned out to halt our software system at regular intervals,
> but never in a deterministic way, leading to many lost development
> hours, loss of faith and even worse.
> In short what we experienced was that some operations in our software
> system (includes JavaSpaces and various services that perform operations
> under a distributed transaction) that should take place in parallel
> took place in a serialized manner. We noticed this behavior only
> occurred under some (at that time unknown) conditions. Not only
> throughput was harmed but our assumptions with regard to the maximum
> time in which operations should complete didn't hold any longer and
> things started to fail. One can argue well that is what distributed
> systems is all about, but nevertheless it is something you try to avoid,
> especially when all parts seem to function properly.
> We were not able to find dead-locks in our code or some other problem
> that could cause this behavior. Given the large number of services,
> their interaction and associated thousands of threads over multiple JVMs
> and that you can't freeze-frame time for your system, this appeared as a
> tricky problem to tackle. One of those moments you really regret you
> started to develop a distributed application at the first place.
> However a little voice told me that Mahalo must be involved in all this
> trouble, this was in line with my feeling with respect to Mahalo as I
> knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
> at the 7th JCM "Mahalo is the weakest child of the contributed services"
> or similar wording.
> So I decided to assume there was a bug in Mahalo and the only way to
> find out was to develop a scenario that could make that bug obvious and
> to improve logging a lot (proper tracking of transactions and
> participants involved). So lately I started to developed some scenario's
> and none of them could reproduce a bug or explain what we saw. Until
> lately I tried to experiment with transaction participants that are able
> to 'take their time' in the prepare method [1]. When using random
> prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
> and the througput of a transaction (time from client commit to
> completion) varied and was no direct funtion of the prepare time. The
> behavior I experienced could only be explained when the schedular of the
> various internal tasks was constrained by something. Knowing the code I
> suddenly realized there must have been a 'load factor' applied to the
> thread pool that was used for the commit related tasks. I was rather
> shocked to find out that the default was 3.0 and suddenly the mistery
> became completely clear to me. Mahalo has out-of-the-box a built-in
> constraint that can make the system serialize transaction related
> operation in case participants really take their time to return.
> So it turned out that Mahalo is a fine services after all, but that one
> 'freak' ;-) chose a very unfortunate default value for the load factor [2].
> Load-factors for thread pools (and max limits to a lesser degree) are so
> tricky to get right [3] and therefore IMHO high load factors should only
> be used in case you know for sure you are dealing with bursts of tasks
> with a guaranteed short duration and I think that is really something
> people should tune themselves.
> Maybe it was stupid of me and I should have read and understand the
> Mahalo documentation better. But I would expect any system to use
> out-of-the-box load-factors of 1.0 for tasks in a thread pool that
> are potentially long running tasks [3], especially for something as
> delecate as a transaction manager that seems to operate as the so called
> spider in the web. It is better to have a system consuming too much
> threads opposed to constrain it in a way that leads to problems that are
> very hard to find out.
> I hope this mail is seen as an RFE for a default load factor of 1.0 to
> prevent from people running into similar problems as we had and as a
> lesson/warning for those working with Mahalo and the risk of using
> load-factors in general.
> [1] in our system some service have to consult external systems when
> prepare is called on them and under some conditions it can take a long
> time to return from the prepare method. We are aware this is something
> you want to prevent but we have requirements that mandate this.
> [2] the one that gave us problems in production was Mahalo from JTSK
> 2.0 that didn't have the ability to specify a taskpool through the
> configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
> documented at that time if I recall correctly (don't have a 2.0
> distribution at hand).
> [3] more and more I'm starting to believe that each task in a thread
> pool should have a dead-line in which they should be assigned to a
> worker thread, for this purpose we support in our thread pools a
> priority constraint to attach to Runnables, see
> http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
> In a discussion in the Porter mailing list I know Bob Scheifler once
> said "I have in a past life been a fan of deadline scheduling.", I'm
> very interested to know whether he still is a fan.
> Evaluation:
> Given a low priority since in 2.1 the task pool objects are user configurable. This request
is to change the default setting for those objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message