From Jay Soffian <...@cimedia.com>
Subject os-solaris/2834: Seeing lots of[Wed Aug 12 02:41:56 1998] access to /index_layout.html failed for, reason: stat: Stale NFS file handle (errno = 151) in error log
Date Wed, 12 Aug 1998 07:50:03 GMT

>Number:         2834
>Category:       os-solaris
>Synopsis:       Seeing lots of[Wed Aug 12 02:41:56 1998] access to /index_layout.html
failed for, reason: stat: Stale NFS file handle (errno = 151) in error log
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    apache
>State:          open
>Class:          change-request
>Submitter-Id:   apache
>Arrival-Date:   Wed Aug 12 02:00:01 PDT 1998
>Originator:     jay@cimedia.com
>Release:        1.3.1
SunOS web22 5.5.1 Generic_103640-19 sun4u sparc SUNW,Ultra-2
We have a large number of web-servers that all share a common NFS file system 
which contains all of our html documents. Recently, we began using staging
servers to develop our content, and then using a package such as rsync
to synchronize the content from our staging servers to our web servers.

When we started doing this, we all of a sudden started getting lots of the
Stale NFS file handle error messages in our logs.

We have tracked the problem down to Solaris's rnode cache. The rnode cache
is used to maintain a cache between filenames and NFS file handles on the
client side. Although I cannot confirm this, I believe that the expiration
for the cache is LRU. Emperical testing seems to indicate that items in the
cache do not have an explicit expiration time. Rather, items are expired from 
the rnode cache after receiving a Stale NFS file handle error from the server.
Create an index.html document on a NFS file system that includes another 
document (say test.html). The NFS client in this case should be Solaris 2.5.1,
although other OS's may experience this same issue. The from either
another NFS client or from the NFS server, repeatedly do the following while
also repeatedly loading the web page in a browser:

mv test.html test.html~ && cp test.html~ test.html && rm test.html~

Eventually, you should get a '[an error occured while processing this directive]'
where the test.html file should have been included, and a 'Stale NFS file handle'
error message in the error log. Note that it is probably not necesssary
to involve a SSI document in this test, but that is where we see the problem
occur most often.
The following patch has fixed the problem for us:

*** http_request.c.orig Wed Aug 12 03:28:38 1998
--- http_request.c      Wed Aug 12 03:28:42 1998
*** 211,216 ****
--- 211,218 ----
          errno = 0;
          rv = stat(path, &r->finfo);
+       if (rv < 0 && errno == ESTALE)  /* workaround for Stale NFS Filehandle
problem */
+             rv = stat(path, &r->finfo); /* with Solaris's rnode cache */
          if (cp != end)
              *cp = '/';

It seems that the first stat call which fails also expires the file from
the rnode cache. The second stat then succeeds. I am not familiar enough
with the NFS implementations on other OS's to know if this patch is relevant
to more that Solaris 2.5.1. However, it certainly doesn't seem like it could
hurt anything.

Also, while there are numerous other calls to stat() in the apache code, this
seems to be the only one that is generating any errors to our error log.
Admitedly, this instance of the stat() gets called much more frequently
than elsewhere in the code. However, perhaps it would be prudent to define
stat() as either a macro, or as ap_os_stat() so that the handling of
ESTALE may be applied throughout the code.
