Return-Path: Delivered-To: apmail-incubator-stdcxx-dev-archive@www.apache.org Received: (qmail 61526 invoked from network); 21 Sep 2006 16:59:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 21 Sep 2006 16:59:03 -0000 Received: (qmail 50239 invoked by uid 500); 21 Sep 2006 16:59:03 -0000 Delivered-To: apmail-incubator-stdcxx-dev-archive@incubator.apache.org Received: (qmail 50221 invoked by uid 500); 21 Sep 2006 16:59:02 -0000 Mailing-List: contact stdcxx-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: stdcxx-dev@incubator.apache.org Delivered-To: mailing list stdcxx-dev@incubator.apache.org Received: (qmail 50210 invoked by uid 99); 21 Sep 2006 16:59:02 -0000 Received: from idunn.apache.osuosl.org (HELO idunn.apache.osuosl.org) (140.211.166.84) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Sep 2006 09:59:02 -0700 Authentication-Results: idunn.apache.osuosl.org smtp.mail=ablack@roguewave.com; spf=permerror X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests= Received-SPF: error (idunn.apache.osuosl.org: domain roguewave.com from 208.30.140.160 cause and error) Received: from [208.30.140.160] ([208.30.140.160:34323] helo=moroha.quovadx.com) by idunn.apache.osuosl.org (ecelerity 2.1.1.8 r(12930)) with ESMTP id 83/00-03440-3D4C2154 for ; Thu, 21 Sep 2006 09:59:01 -0700 Received: from [10.70.3.48] ([10.70.3.48]) by moroha.quovadx.com (8.13.6/8.13.6) with ESMTP id k8LGwfLZ003322 for ; Thu, 21 Sep 2006 16:58:41 GMT Message-ID: <4512C4CF.2040802@roguewave.com> Date: Thu, 21 Sep 2006 10:58:55 -0600 From: Andrew Black User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.7) Gecko/20060910 SeaMonkey/1.0.5 MIME-Version: 1.0 To: stdcxx-dev@incubator.apache.org Subject: Re: incorrect runtimes References: <450B1652.10205@roguewave.com> In-Reply-To: <450B1652.10205@roguewave.com> Content-Type: multipart/mixed; boundary="------------040004090706080708040501" X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N --------------040004090706080708040501 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Greetings all. During an offline brainstorming session, Martin and I determined that the cause of this observed problem is a race condition when a child (monitored) process creates grandchild processes. If a child process terminates prior to the grandchild process, it is possible for the call to getrusage() to occur prior to the termination of the grandchild process, causing the time consumed by that process to be deferred to the next call to getrusage(), and the time to be merged with that of the following process. Attached are some possible solutions I've put together for this issue. With solution A, I receive a usable status code on Linux, along with an account of the amount of user/system time spent in the child process (but not the grandchild processes). On Solaris, I receive an ERROR status code, which may be caused by an inability of the Solaris waitpid syscall to wait on process groups. With solution B, the status code on Linux is inaccurate, though the user and system times appear to reflect that of the child process (again excluding the grandchild processes). On Solaris, I observe the same behavior as in solution A, for the same probable reason. With solution C, the status code on Linux is accurate, though the user and system times appear to be inaccurate. On Solaris, the status and user/system times appear to be correct. At this point, I believe that solution C is the best choice. If one of these patches is to be applied, the following change log could be used. Log: * exec.cpp [!_WIN32 && !_WIN64] (wait_for_child): Resolve race condition resulting in incorrect child process times. --Andrew Black Martin Sebor wrote: > Yo Andrew, > > I have another issue with the reported times (this one seems real). > The times reported for a killed process are all zero and the time > seems to be added to the next process that runs after it. I noticed > it while running the locale tests: > > $ make run > NAME STATUS ASSERTS FAILED PERCNT USER SYS > sanity_test.sh 0 46 0 100% 0.380 0.880 > af_ZA.ISO-8859-1.sh 0 7 0 100% 39.570 1.110 > ar_AE.ISO-8859-6.sh 0 7 0 100% 39.000 1.250 > ar_BH.ISO-8859-6.sh 0 7 0 100% 39.010 1.300 > ar_DZ.ISO-8859-6.sh 0 7 0 100% 38.970 1.220 > ar_EG.ISO-8859-6.sh 0 7 0 100% 39.010 1.350 > ar_IN.UTF-8.sh HUP 0.000 0.010 > ar_IQ.ISO-8859-6.sh 0 7 0 100% 217.890 2.200 > > Could you see what's going on? > > Thanks > Martin --------------040004090706080708040501 Content-Type: text/x-patch; name="racefixa.diff" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="racefixa.diff" Index: exec.cpp =================================================================== --- exec.cpp (revision 447484) +++ exec.cpp (working copy) @@ -419,7 +419,7 @@ unsigned siginx = 0; - int stopped = 0; + int status, stopped = 0; #ifdef WCONTINUED int waitopts = WUNTRACED | WCONTINUED; @@ -442,14 +442,14 @@ alarm (timeout); while (1) { - const pid_t wait_pid = waitpid (child_pid, &state.status, waitopts); + const pid_t wait_pid = waitpid (-child_pid, &status, waitopts); if (child_pid == wait_pid) { - if (WIFEXITED (state.status) || WIFSIGNALED (state.status)) - break; /*we've got an exit state, so let's bail*/ - else if (WIFSTOPPED (state.status)) - stopped = state.status; + if (WIFEXITED (status) || WIFSIGNALED (status)) + state.status = status; + else if (WIFSTOPPED (status)) + stopped = status; #ifdef WIFCONTINUED /*Perhaps we should guard WIFSTOPPED with this guard also */ - else if (WIFCONTINUED (state.status)) + else if (WIFCONTINUED (status)) stopped = 0; #endif else @@ -460,15 +460,34 @@ /* timeout elapsed, so send a signal to the child */ if (stopped) { /* If the child process is stopped, it is incapable of - recieving signals. Therefore, we'll record this - and break out of the loop. + recieving signals. Therefore, we'll record this. */ state.status = stopped; - break; } - /* ignore kill errors (perhaps should record them)*/ - (void)kill (-child_pid, signals [siginx]); + if(0 != kill (-child_pid, signals [siginx])) { + if (EPERM == errno){ + /* EPERM means 'No permissions to signal any recieving + process. Since we can't send the signal, we'll + just move on. + */ + state.status = -1; + state.killed = -1; + break; + } + if (ESRCH == errno) + /* ESRCH means 'No process (group) found'. Since + there aren't any processes in the process group, + we'll continue so we can collect the return value + if needed. + */ + continue; + /* The remaining error value is EINVAL, which should never + happen. If one of our signals were unsupported, we + should get a compile error. Regardess, we tried to + send it, and will move on. + */ + } /* Record the signal used*/ state.killed = signals [siginx]; @@ -501,9 +520,8 @@ ; /* Now what? */ } else if (ECHILD == errno) { - /* should not happen */ - warn ("waitpid (%d) error: %s\n", (int)child_pid, - strerror (errno)); + /* There are no children left in the process group, so exit */ + break; } else { /* waitpid () error */ @@ -511,12 +529,10 @@ strerror (errno)); } } - else if ((pid_t)0 == wait_pid) { - /* should not happen */ - } - else { - /* what the heck? */ - } + /* Other possible values for wait_pid are 0, denoting no valid status + when polling the process group, and the pid of a grandchild process. + The former should never happen, and we don't care about the later. + */ } /* Clear alarm */ --------------040004090706080708040501 Content-Type: text/x-patch; name="racefixb.diff" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="racefixb.diff" Index: exec.cpp =================================================================== --- exec.cpp (revision 447484) +++ exec.cpp (working copy) @@ -419,8 +419,6 @@ unsigned siginx = 0; - int stopped = 0; - #ifdef WCONTINUED int waitopts = WUNTRACED | WCONTINUED; #else @@ -442,34 +440,34 @@ alarm (timeout); while (1) { - const pid_t wait_pid = waitpid (child_pid, &state.status, waitopts); - if (child_pid == wait_pid) { - if (WIFEXITED (state.status) || WIFSIGNALED (state.status)) - break; /*we've got an exit state, so let's bail*/ - else if (WIFSTOPPED (state.status)) - stopped = state.status; -#ifdef WIFCONTINUED /*Perhaps we should guard WIFSTOPPED with this guard also */ - else if (WIFCONTINUED (state.status)) - stopped = 0; -#endif - else - ; /* huh? */ - } - else if ((pid_t)-1 == wait_pid) { + const pid_t wait_pid = waitpid (-child_pid, 0, waitopts); + if ((pid_t)-1 == wait_pid) { if (EINTR == errno && alarm_timeout) { /* timeout elapsed, so send a signal to the child */ - if (stopped) { - /* If the child process is stopped, it is incapable of - recieving signals. Therefore, we'll record this - and break out of the loop. + if(0 != kill (-child_pid, signals [siginx])) { + if (EPERM == errno){ + /* EPERM means 'No permissions to signal any recieving + process. Since we can't send the signal, we'll + just move on. + */ + state.status = -1; + state.killed = -1; + break; + } + if (ESRCH == errno) + /* ESRCH means 'No process (group) found'. Since + there aren't any processes in the process group, + we'll continue so we can collect the return value + if needed. + */ + continue; + /* The remaining error value is EINVAL, which should never + happen. If one of our signals were unsupported, we + should get a compile error. Regardess, we tried to + send it, and will move on. */ - state.status = stopped; - break; } - /* ignore kill errors (perhaps should record them)*/ - (void)kill (-child_pid, signals [siginx]); - /* Record the signal used*/ state.killed = signals [siginx]; @@ -501,9 +499,8 @@ ; /* Now what? */ } else if (ECHILD == errno) { - /* should not happen */ - warn ("waitpid (%d) error: %s\n", (int)child_pid, - strerror (errno)); + /* There are no children left in the process group, so exit */ + break; } else { /* waitpid () error */ @@ -511,14 +508,14 @@ strerror (errno)); } } - else if ((pid_t)0 == wait_pid) { - /* should not happen */ - } - else { - /* what the heck? */ - } + /* Other possible values for wait_pid are 0, denoting no valid status + when polling the process group, and the pid of a grandchild process. + The former should never happen, and we don't care about the later. + */ } + if (-1 != state.killed) + (void)waitpid (-child_pid, &state.status, waitopts); /* Clear alarm */ alarm (0); --------------040004090706080708040501 Content-Type: text/x-patch; name="racefixc.diff" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="racefixc.diff" Index: exec.cpp =================================================================== --- exec.cpp (revision 447484) +++ exec.cpp (working copy) @@ -467,8 +467,29 @@ break; } - /* ignore kill errors (perhaps should record them)*/ - (void)kill (-child_pid, signals [siginx]); + if(0 != kill (-child_pid, signals [siginx])) { + if (EPERM == errno){ + /* EPERM means 'No permissions to signal any recieving + process. Since we can't send the signal, we'll + just move on. + */ + state.status = -1; + state.killed = -1; + break; + } + if (ESRCH == errno) + /* ESRCH means 'No process (group) found'. Since + there aren't any processes in the process group, + we'll continue so we can collect the return value + if needed. + */ + continue; + /* The remaining error value is EINVAL, which should never + happen. If one of our signals were unsupported, we + should get a compile error. Regardess, we tried to + send it, and will move on. + */ + } /* Record the signal used*/ state.killed = signals [siginx]; @@ -522,6 +543,12 @@ /* Clear alarm */ alarm (0); + /*Kill/cleanup any grandchildren.*/ + while (siginx <= sigcount && 0 == kill (-child_pid, signals [siginx])) { + ++siginx; + sleep(1); + } + return state; } --------------040004090706080708040501--