Date: Tue, 28 Apr 1998 21:08:10 -0700 (PDT)
From: Dean Gaudet <dgaudet@arctic.org>
To: new-httpd@apache.org
Subject: Re: NSPR (was Re: rewritelog inefficiency)
In-Reply-To: <3546578F.2CB32B71@netscape.com>
Message-ID: <Pine.LNX.3.96dg4.980428165026.498F-100000@twinlark.arctic.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: new-httpd-owner@apache.org
Precedence: bulk
Reply-To: new-httpd@apache.org

On Tue, 28 Apr 1998, Wan-Teh Chang wrote:

> In NSPR, every fd that represents a socket or pipe on Unix is put innonblocking
> mode (O_NONBLOCK flag is on).

Oh wow.  Even when you're using pthreads only?  This seems inefficient.
When you're using pthreads supplied by the kernel/libc then this sort of
multiplexing is all done for you behind the scenes.  Granted, that too
can be slow because of too many contexts -- and that's why the MxN model
is so interesting.  But I'm confused why you'd be doing this in the 1-1
model where each NSPR thread is a kernel thread.  (I'm not claiming to
be an expert here :)

> If a PRFileDesc is in blocking mode, which is the default (note: this blocking
> modeis at the NSPR level, not at the Unix level.  The Unix fd is always
> nonblocking.),
> PR_Write() does not return until all the data is written.  So it may have to
> make
> several write() system calls, as follows:
>     write()    /* get EAGAIN, or a byte count less than you requested */
>     poll()
>     write()    /* get EAGAIN, or a byte count less than you requested */
>     poll()
>     ....
>     write()   /* finally the entire user buffer has been transmitted */
>     return
> 
> You can see this write-poll loop in FileWrite() in prio.c and SocketWrite() in
> prsocket.c and pt_Write() and pt_write_cont in ptio.c.

Ok I see the loop in FileWrite() (prfile.c) -- but I don't see the
poll()... If I dig down into _MD_write in src/md/unix/unix.c I see that
when it's a native thread it will use select(). 

I'm guessing that it's done this way so that you can share as much code as
possible between implementations.  So far I haven't seen anything which
requires it to be this way... which is cool, 'cause it means there's room
for improvement :)  Something I happen to enjoy!

> So, even if the write() system call is atomic (are you sure this
> is true, by the way?), PR_Write() may not be atomic if a write() call
> only writes part of the buffer.

The Single Unix Reference guarantees writes on pipes/FIFOs (blocking
or non-blocking) of size <= PIPE_BUF are atomic.  Larger writes can be
broken on arbitrary boundaries.  It makes a similar guarantee on STREAMS,
but the actual size depends on the STREAM.  I can't find the reference
right now, but there's an ambiguity somewhere which made it seem to
me that all writes on files are atomic (which is absurd).  It makes no
explicit guarantees for sockets... but it actually depends on how you
interpret the semantics of send().

In practice, I'm confident that all unixes make the atomicity guarantee
for writes <= PIPE_BUF.  PIPE_BUF is 512 at a minimum, but is more like
4096 on real systems.  For example, the current apache assumes this is
the case for write()s to the file system, and it's been that way for
years, and nobody complains of messed up logs.  I can't give a specific
example for sockets though which would help prove my claim... but for
logs I'm not interested in atomicity of sockets.

> > It'll be really unfortunate if we have to add locking for logs at the
> > application layer.  In a pure pthreads setting, for example, we could
> > just write() directly and it would all be taken care of by the kernel --
> > the kernel has to lock anyhow, so it's not needed at the user level.
> 
> You are assuming that every write() will write the entire buffer.Suppose you
> want to write 64K to a nonblocking socket,
> write() may return 32K.  Then this write() is not atomic.

Nope I don't make that assumption, I just wasn't clear enough in my
description of the problem.

In the case of log writes Apache doesn't actually buffer anything
by default -- it builds each log entry in a buffer on the stack, and
issues a single write() call for it.  I did implement a form of atomic
buffered logs, it's a compile time option.  In this case I use a buffer
of size PIPE_BUF, and delay write()s until we come across a log entry
which won't fit into the buffer.  Then the buffer is flushed (without
the new log entry), and the new log entry is put into the empty buffer
(or written directly if it's larger than PIPE_BUF).  This gives atomic
logs with buffering.

> In the case of logging, you can probably assume that the data
> buffer is less than 16K or 32K so that each write() will be able
> to write the entire buffer when the fd is nonblocking.  If this
> assumption is valid, then PR_Write() will only make one write()
> system call, and therefore PR_Write() is also atomic.  But I don't
> know if you are able to guarantee that all your write() calls are
> atomic.

Right, it sounds like it's all there on unix.  I just want to nail
the corner cases of the semantics of things.  I'm sure I'll have a
bunch more questions/comments along the way :)

Dean