Return-Path: Delivered-To: apmail-harmony-dev-archive@www.apache.org Received: (qmail 83538 invoked from network); 11 Jan 2007 03:42:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Jan 2007 03:42:39 -0000 Received: (qmail 57286 invoked by uid 500); 11 Jan 2007 03:42:43 -0000 Delivered-To: apmail-harmony-dev-archive@harmony.apache.org Received: (qmail 57262 invoked by uid 500); 11 Jan 2007 03:42:42 -0000 Mailing-List: contact dev-help@harmony.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@harmony.apache.org Delivered-To: mailing list dev@harmony.apache.org Received: (qmail 57253 invoked by uid 99); 11 Jan 2007 03:42:42 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jan 2007 19:42:42 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of rdasgupt@gmail.com designates 64.233.162.232 as permitted sender) Received: from [64.233.162.232] (HELO nz-out-0506.google.com) (64.233.162.232) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jan 2007 19:42:33 -0800 Received: by nz-out-0506.google.com with SMTP id j2so240437nzf for ; Wed, 10 Jan 2007 19:42:12 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=fztR3EjhsbqBoFK6x5+OZtg+ifE9lqiWEyoICllAVE70os565qXxuhICIv8HsHy9jv3TUjZ614Qk1hJfLkMRGcu1V6KtPVKZmf96x8OHXH1Mn9/Ws2Aol/0s403iPi7q43vSqCLKK7YPvST4OnvscnBReJ7aQIHt0dSdT6AeqxA= Received: by 10.65.114.11 with SMTP id r11mr1489776qbm.1168486932541; Wed, 10 Jan 2007 19:42:12 -0800 (PST) Received: by 10.64.153.1 with HTTP; Wed, 10 Jan 2007 19:42:12 -0800 (PST) Message-ID: <51d555c70701101942v4bf01f9cwb83558302a477354@mail.gmail.com> Date: Wed, 10 Jan 2007 20:42:12 -0700 From: "Rana Dasgupta" To: dev@harmony.apache.org Subject: Re: [drlvm] stress.Mix / MegaSpawn threading bug In-Reply-To: <4229DCE9-63F0-489A-B4FE-32D21B0FDAA4@pobox.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_25910_32953139.1168486932480" References: <97EB9EC8-ECCD-4E9A-A165-F16310C37133@pobox.com> <4dd1f3f00701090751w23e9e3b5pcef55055f201544e@mail.gmail.com> <51d555c70701091014n6f6874ocdb08fdeb383213e@mail.gmail.com> <3A2DFD3D-60F3-477C-838F-E7BAD8F22FBB@pobox.com> <6396856A-FA18-4482-896C-2621822EE7E4@pobox.com> <4dd1f3f00701101113s4c520a08uc7a333565955aece@mail.gmail.com> <4229DCE9-63F0-489A-B4FE-32D21B0FDAA4@pobox.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_25910_32953139.1168486932480 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline On 1/10/07, Geir Magnusson Jr. wrote: > > > On Jan 10, 2007, at 2:13 PM, Weldon Washburn wrote: > > >> 1) > >> In some earlier posting, it was mentioned that somehow the virtual > >> memory > >> address space is impacted by how much physical memory is in a given > >> computer. Actually this is not true. The virtual address space > >> available > >> to the JVM is fixed by the OS. A machine with less phys mem will > >> do more > >> disk I/O. In other words "C" malloc() hard limits are set by OS > >>version > >> number not by RAM chips. > >> > > >Talking about VM vs RAM vs whatever is a red herring - we may be > >ported to a machine w/o virtual memory. What matters is that when > >malloc() returns null, we do something smart. At least, do nothing > >harmful. There can be no machine without virtual memory on any of the OS's of interest to us. VM is not a type of memory technology. What Weldon, Gregory and several others have pointed out is that if one keeps on consuming virtual address space by allocating space for thread stacks, the address space will eventually run out, and the process will be a fatal state independent of what is the physical memory on the machine. >> 2) > >> Why not simply hard code DRLVM to throw an OOME whenever there are > >> more than > >> 1K threads running? I think Rana first suggested this approach. > >> My guess > >> is that 1K threads is good enough to run lots of interesting > >> workloads. My > >> guess is that common versions of WinXP and Linux will handle the C > >> malloc() > >> load of 1K threads successfully. If not, how about trying 512 > >> threads? > > >Because this is picking up the rug, and sweeping all the dirt > >underneath it. The core problem isn't that we try too many threads, > >but the code wasn't written defensively. Putting an artificial limit > >on # of threads just means that we'll hit it somewhere else, in some > >other resource usage. > > >I think we should fix it. Sure. The way to fix a fatal error is to leave room for a process to recover from it or handle it. Another example of a fatal error is a Stack overflow or a TerminateProcess signal. In the case of Stack overflow, we handle it by trying to raise the exception while some room is left of the stack so that there is a fair chance to handle. Similarly, an approach could be to set a limit on the maximum number of threads we create. Based on the memory we give each thread stack we can choose a limit which we estimate will leave us room to handle the error. >>There seem to be some basic things we can do, like reduce the stack > >>size on windows from the terabyte or whatever it is now, to the > >>number that our dear, esteemed colleague from IBM claims is perfectly > >>suitable for production use. > > >That too doesn't solve the problem, but it certainly fixes a problem > >we are now aware of - our stack size is too big.... :) The best size to set for the thread stack is a valid issue, and it is useful information to know what the IBM VM sets. Google searches also seem to show that threadstack size on J9 is user configurable. But even with smaller stack sizes, if one ran Megaspawn for sufficiently long time, we would get the same error. So we can't have unbounded stresses like this or, the VM needs to bound the resources consumable by such a test. Also, we cannot just emulate what the IBM VM does in one specific area without understanding all their entire design. For example, a small stack size will cause Stack Overflow exceptions to happen early. We need to tune these sizes based on our own experiments. >> > >> 3) > >> The above does not deal with the general architecture question of > >> handling C > >> malloc failures. This is far harder to solve. Note that solving > >> the big > >> question will also require far more extensive regression tests than > >> MegaSpawn. However, it does fix DRLVM so that it does not crash/ > >> burn on > >> threads overload. This, in turn, gives us time to fix the real > >> underlying > >> problem(s) with C malloc. I think that we should defer this part, it is a dificult problem and there are several potential approaches based on what kind of reliable computing contracts we want to expose. For example, one can think of a contract that no fatal failures( OOME, stack overflow, thread abort ) happen in marked regions of code, ever. I don't think that we need to solve this hard problem right now. ------=_Part_25910_32953139.1168486932480--