stdcxx-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Sebor <se...@roguewave.com>
Subject Re: incorrect runtimes
Date Thu, 21 Sep 2006 20:02:28 GMT
I think it would be helpful for us to start by deciding what the
desired behavior of the utility WRT reporting process times should
be. I think the ground rule should be to report only accurate
results and no indeterminate values (e.g., usage times that may
or may not include those attributable to the grandchildren of the
target process). With that as a basic requirement, here are some
possibilities that I think would be useful, listed in the order
of decreasing preference:

   (1) the times for the target process and all process spawned by
   it and its children should be reported, regardless of whether
   they exited normally or not

   (2) the times for the target process alone (but none of its
   children or their descendants) should be reported, regardless
   of whether it exited normally or not

Some questions we might want to ask ourselves:

   Is it (at least in theory) possible to portably implement the
   same behavior across all POSIX platforms? If so, let's do it :)
   Any deviation from the expected behavior should be treated as
   bugs in the operating system.

   Otherwise, is it possible to implement the same behavior using
   documented platform-specific extensions? If so, let's do it.

   Otherwise, if it isn't possible to implement one of the two types
   of behavior across all the main UNIX platforms, is it at least
   possible to implement one or the other type of behavior on some
   subset of the platforms (i.e., is there at least a few platforms
   where the behavior is well-specified)? If so, let's implement it
   on that subset and document that the feature is not available on
   the rest.

   Otherwise, what can we do?

Martin

PS I wrote a little test program to illustrate the behavior of UNIX
systems -- it's in the attachment. I haven't played with it enough
to make sense out of the output but I see similar behavior on AIX,
HP-UX, IRIX and Linux, and something different on Solaris, and
something different still on Tru64. It should be self-documenting.

Andrew Black wrote:
> Greetings all.
> 
> During an offline brainstorming session, Martin and I determined that 
> the cause of this observed problem is a race condition when a child 
> (monitored) process creates grandchild processes.  If a child process 
> terminates prior to the grandchild process, it is possible for the call 
> to getrusage() to occur prior to the termination of the grandchild 
> process, causing the time consumed by that process to be deferred to the 
> next call to getrusage(), and the time to be merged with that of the 
> following process.
> 
> Attached are some possible solutions I've put together for this issue.
> With solution A, I receive a usable status code on Linux, along with an 
> account of the amount of user/system time spent in the child process 
> (but not the grandchild processes).  On Solaris, I receive an ERROR 
> status code, which may be caused by an inability of the Solaris waitpid 
> syscall to wait on process groups.
> 
> With solution B, the status code on Linux is inaccurate, though the user 
> and system times appear to reflect that of the child process (again 
> excluding the grandchild processes).  On Solaris, I observe the same 
> behavior as in solution A, for the same probable reason.
> 
> With solution C, the status code on Linux is accurate, though the user 
> and system times appear to be inaccurate.  On Solaris, the status and 
> user/system times appear to be correct.
> 
> At this point, I believe that solution C is the best choice.  If one of 
> these patches is to be applied, the following change log could be used.
> 
> Log:
>     * exec.cpp [!_WIN32 && !_WIN64] (wait_for_child): Resolve race 
> condition resulting in incorrect child process times.
> 
> --Andrew Black
> 
> Martin Sebor wrote:
> 
>> Yo Andrew,
>>
>> I have another issue with the reported times (this one seems real).
>> The times reported for a killed process are all zero and the time
>> seems to be added to the next process that runs after it. I noticed
>> it while running the locale tests:
>>
>> $ make run
>> NAME                      STATUS ASSERTS FAILED PERCNT    USER     SYS
>> sanity_test.sh                 0      46      0   100%   0.380   0.880
>> af_ZA.ISO-8859-1.sh            0       7      0   100%  39.570   1.110
>> ar_AE.ISO-8859-6.sh            0       7      0   100%  39.000   1.250
>> ar_BH.ISO-8859-6.sh            0       7      0   100%  39.010   1.300
>> ar_DZ.ISO-8859-6.sh            0       7      0   100%  38.970   1.220
>> ar_EG.ISO-8859-6.sh            0       7      0   100%  39.010   1.350
>> ar_IN.UTF-8.sh               HUP                         0.000   0.010
>> ar_IQ.ISO-8859-6.sh            0       7      0   100% 217.890   2.200
>>
>> Could you see what's going on?
>>
>> Thanks
>> Martin
> 
> 
> ------------------------------------------------------------------------
> 
> Index: exec.cpp
> ===================================================================
> --- exec.cpp	(revision 447484)
> +++ exec.cpp	(working copy)
> @@ -419,7 +419,7 @@
>  
>      unsigned siginx = 0;
>  
> -    int stopped = 0;
> +    int status, stopped = 0;
>  
>  #ifdef WCONTINUED
>      int waitopts = WUNTRACED | WCONTINUED;
> @@ -442,14 +442,14 @@
>          alarm (timeout);
>  
>      while (1) {
> -        const pid_t wait_pid = waitpid (child_pid, &state.status, waitopts);
> +        const pid_t wait_pid = waitpid (-child_pid, &status, waitopts);
>          if (child_pid == wait_pid) {
> -            if (WIFEXITED (state.status) || WIFSIGNALED (state.status))
> -                break; /*we've got an exit state, so let's bail*/
> -            else if (WIFSTOPPED (state.status))
> -                stopped = state.status;
> +            if (WIFEXITED (status) || WIFSIGNALED (status))
> +                state.status = status;
> +            else if (WIFSTOPPED (status))
> +                stopped = status;
>  #ifdef WIFCONTINUED /*Perhaps we should guard WIFSTOPPED with this guard also */
> -            else if (WIFCONTINUED (state.status))
> +            else if (WIFCONTINUED (status))
>                  stopped = 0;
>  #endif
>              else
> @@ -460,15 +460,34 @@
>                  /* timeout elapsed, so send a signal to the child */
>                  if (stopped) {
>                      /* If the child process is stopped, it is incapable of
> -                       recieving signals.  Therefore, we'll record this
> -                       and break out of the loop.
> +                       recieving signals.  Therefore, we'll record this.
>                      */
>                      state.status = stopped;
> -                    break;
>                  }
>  
> -                /* ignore kill errors (perhaps should record them)*/
> -                (void)kill (-child_pid, signals [siginx]);
> +                if(0 != kill (-child_pid, signals [siginx])) {
> +                    if (EPERM == errno){
> +                        /* EPERM means 'No permissions to signal any recieving 
> +                           process.  Since we can't send the signal, we'll
> +                           just move on.
> +                        */
> +                        state.status = -1;
> +                        state.killed = -1;
> +                        break;
> +                    }
> +                    if (ESRCH == errno)
> +                        /* ESRCH means 'No process (group) found'.  Since 
> +                           there aren't any processes in the process group, 
> +                           we'll continue so we can collect the return value
> +                           if needed.
> +                        */
> +                        continue;
> +                    /* The remaining error value is EINVAL, which should never 
> +                       happen.  If one of our signals were unsupported, we 
> +                       should get a compile error.  Regardess, we tried to 
> +                       send it, and will move on.
> +                    */
> +                }
>  
>                  /* Record the signal used*/
>                  state.killed = signals [siginx];
> @@ -501,9 +520,8 @@
>                      ; /* Now what? */
>              }
>              else if (ECHILD == errno) {
> -                /* should not happen */
> -                warn ("waitpid (%d) error: %s\n", (int)child_pid, 
> -                      strerror (errno));
> +                /* There are no children left in the process group, so exit */
> +                break;
>              }
>              else {
>                  /* waitpid () error */
> @@ -511,12 +529,10 @@
>                        strerror (errno));
>              }
>          }
> -        else if ((pid_t)0 == wait_pid) {
> -            /* should not happen */
> -        }
> -        else {
> -            /* what the heck? */
> -        }
> +        /* Other possible values for wait_pid are 0, denoting no valid status
> +           when polling the process group, and the pid of a grandchild process.
> +           The former should never happen, and we don't care about the later.
> +         */
>      }
>  
>      /* Clear alarm */
> 
> 
> ------------------------------------------------------------------------
> 
> Index: exec.cpp
> ===================================================================
> --- exec.cpp	(revision 447484)
> +++ exec.cpp	(working copy)
> @@ -419,8 +419,6 @@
>  
>      unsigned siginx = 0;
>  
> -    int stopped = 0;
> -
>  #ifdef WCONTINUED
>      int waitopts = WUNTRACED | WCONTINUED;
>  #else
> @@ -442,34 +440,34 @@
>          alarm (timeout);
>  
>      while (1) {
> -        const pid_t wait_pid = waitpid (child_pid, &state.status, waitopts);
> -        if (child_pid == wait_pid) {
> -            if (WIFEXITED (state.status) || WIFSIGNALED (state.status))
> -                break; /*we've got an exit state, so let's bail*/
> -            else if (WIFSTOPPED (state.status))
> -                stopped = state.status;
> -#ifdef WIFCONTINUED /*Perhaps we should guard WIFSTOPPED with this guard also */
> -            else if (WIFCONTINUED (state.status))
> -                stopped = 0;
> -#endif
> -            else
> -                ;   /* huh? */
> -        }
> -        else if ((pid_t)-1 == wait_pid) {
> +        const pid_t wait_pid = waitpid (-child_pid, 0, waitopts);
> +        if ((pid_t)-1 == wait_pid) {
>              if (EINTR == errno && alarm_timeout) {
>                  /* timeout elapsed, so send a signal to the child */
> -                if (stopped) {
> -                    /* If the child process is stopped, it is incapable of
> -                       recieving signals.  Therefore, we'll record this
> -                       and break out of the loop.
> +                if(0 != kill (-child_pid, signals [siginx])) {
> +                    if (EPERM == errno){
> +                        /* EPERM means 'No permissions to signal any recieving 
> +                           process.  Since we can't send the signal, we'll
> +                           just move on.
> +                        */
> +                        state.status = -1;
> +                        state.killed = -1;
> +                        break;
> +                    }
> +                    if (ESRCH == errno)
> +                        /* ESRCH means 'No process (group) found'.  Since 
> +                           there aren't any processes in the process group, 
> +                           we'll continue so we can collect the return value
> +                           if needed.
> +                        */
> +                        continue;
> +                    /* The remaining error value is EINVAL, which should never 
> +                       happen.  If one of our signals were unsupported, we 
> +                       should get a compile error.  Regardess, we tried to 
> +                       send it, and will move on.
>                      */
> -                    state.status = stopped;
> -                    break;
>                  }
>  
> -                /* ignore kill errors (perhaps should record them)*/
> -                (void)kill (-child_pid, signals [siginx]);
> -
>                  /* Record the signal used*/
>                  state.killed = signals [siginx];
>  
> @@ -501,9 +499,8 @@
>                      ; /* Now what? */
>              }
>              else if (ECHILD == errno) {
> -                /* should not happen */
> -                warn ("waitpid (%d) error: %s\n", (int)child_pid, 
> -                      strerror (errno));
> +                /* There are no children left in the process group, so exit */
> +                break;
>              }
>              else {
>                  /* waitpid () error */
> @@ -511,14 +508,14 @@
>                        strerror (errno));
>              }
>          }
> -        else if ((pid_t)0 == wait_pid) {
> -            /* should not happen */
> -        }
> -        else {
> -            /* what the heck? */
> -        }
> +        /* Other possible values for wait_pid are 0, denoting no valid status
> +           when polling the process group, and the pid of a grandchild process.
> +           The former should never happen, and we don't care about the later.
> +         */
>      }
>  
> +    if (-1 != state.killed)
> +        (void)waitpid (-child_pid, &state.status, waitopts);
>      /* Clear alarm */
>      alarm (0);
>  
> 
> 
> ------------------------------------------------------------------------
> 
> Index: exec.cpp
> ===================================================================
> --- exec.cpp	(revision 447484)
> +++ exec.cpp	(working copy)
> @@ -467,8 +467,29 @@
>                      break;
>                  }
>  
> -                /* ignore kill errors (perhaps should record them)*/
> -                (void)kill (-child_pid, signals [siginx]);
> +                if(0 != kill (-child_pid, signals [siginx])) {
> +                    if (EPERM == errno){
> +                        /* EPERM means 'No permissions to signal any recieving 
> +                           process.  Since we can't send the signal, we'll
> +                           just move on.
> +                        */
> +                        state.status = -1;
> +                        state.killed = -1;
> +                        break;
> +                    }
> +                    if (ESRCH == errno)
> +                        /* ESRCH means 'No process (group) found'.  Since 
> +                           there aren't any processes in the process group, 
> +                           we'll continue so we can collect the return value
> +                           if needed.
> +                        */
> +                        continue;
> +                    /* The remaining error value is EINVAL, which should never 
> +                       happen.  If one of our signals were unsupported, we 
> +                       should get a compile error.  Regardess, we tried to 
> +                       send it, and will move on.
> +                    */
> +                }
>  
>                  /* Record the signal used*/
>                  state.killed = signals [siginx];
> @@ -522,6 +543,12 @@
>      /* Clear alarm */
>      alarm (0);
>  
> +    /*Kill/cleanup any grandchildren.*/
> +    while (siginx <= sigcount && 0 == kill (-child_pid, signals [siginx]))
{
> +        ++siginx;
> +        sleep(1);
> +    }
> +
>      return state;
>  }
>  


Mime
View raw message