apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cmpil...@collab.net
Subject Re: RA-DAV tests acting *very* strangely.
Date Mon, 10 Mar 2003 06:21:44 GMT
(Cc:ing this to the APR Development List in hopes that someone there
 reads my problem description and it triggers some distant memory of a
 long-ago pain, or that perhaps they can otherwise shed some light on
 the situation.  For reference, I am running Red Hat 7.2, and the APR
 from HTTP 2.0.44.)

cmpilato@collab.net writes:

> Hm.  Well debugging led me to believe that 5202 didn't actually break
> something, but perhaps shed light on a previously existing other
> problem.  I went over that change with a fine-toothed comb, and could
> find absolutely no fault with it.

After spending all of Friday night and Saturday morning hunting this
puppy, I was all set to put it out of mind until Monday morning.  But
you know how that goes.  The wife wanted to do a little reading before
going to sleep tonight, so I figured I'd use that time to take a fresh
look at this nasty bug.

Wow, does mental clarity help when debugging.

I think I've finally narrowed the bug down to a small enough problem
area, and let me tell you -- it ain't where I expected it.  Let me
recite the facts I had as of 7:00am Saturday:

   - Subversion as of revision 5202 was not passing 'make check' over
     DAV.  The initial import that happens at the beginning of the
     test run was failing because the directory "A" (in the greek
     tree) was being imported more than once.  In case you don't know,
     5202 was the revision in which I changed 'svn import' back to an
     "infix textdelta" commit process instead of a "postfix textdelta"
     one.  This means that the contents of imported files are now
     transmitted as the files are created in the repository instead of
     having all the files created as part of the directory walk, and
     then all the content for all the files being transmitted at the
     very end of the import.

   - The bug seemed to only occur when 'svn import' was run inside the
     Python test suite, but never when run on the cmdline.  I could
     trigger the bug by running 'basic_tests.py 1 --url
     http://localhost' or by calling actions.guarantee_greek_repository()
     from inside of a Python interpreter.

   - svn_io_dir_read() was definitely, without-a-doubt, returning the
     following ordered list of dirents:  { '.', '..', 'iota', 'A', 'A' } 
     (And the second 'A' triggered the failure.)

   - Reverting the 5202 change caused the bug to go away.

   - Keeping the 5202 change, but commenting out the call to 
     editor->apply_textdelta() that happens during the import would
     also cause the bug to go away (though of course, now the tests
     would fail because no file contents were being transmitted).

I began my search tonight with a focus on that messed-up
svn_io_dir_read() situation.  Recall that svn_io_dir_read() is just a
wrapper around apr_dir_read().  

First, I modified the import_dir() function to simply do the loop over
dirents, but nothing else.  I gdb'd the client during its 'svn import'
from a test run and examined the apr_dir_t structure used in that
call.  The 'dirstruct' member of that structure was particularly
interesting to me, and looked like this after each call (the paths
after the '###' below are the current dirent just read by
svn_io_dir_read()):

   ### "."
   (gdb) p *dir->dirstruct
   $4 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, 
     offset = 16, filepos = 1, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### ".."
   (gdb) p *dir->dirstruct
   $6 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, 
     offset = 32, filepos = 2, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### "iota"
   (gdb) p *dir->dirstruct
   $8 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, 
     offset = 48, filepos = 3, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### "A"
   (gdb) p *dir->dirstruct
   $10 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, 
     offset = 64, filepos = 4, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   
Then, I reverted that temporary change and debugged my client again to
see what happened when my weird bug showed up:

   ### "."
   (gdb) p *dir->dirstruct
   $1 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
     offset = 16, filepos = 1, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### ".."
   (gdb) p *dir->dirstruct
   $3 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
     offset = 32, filepos = 2, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### "iota"
   (gdb) p *dir->dirstruct
   $5 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
     offset = 48, filepos = 3, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### "A"
   (gdb) p *dir->dirstruct
   $7 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
     offset = 64, filepos = 4, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### "A" (again)
   (gdb) p *dir->dirstruct
   $10 = {fd = 4, data = 0x80896a0 "èh\b", allocation = 4096, size = 16, 
     offset = 16, filepos = 5, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}

Whoa!  I noticed almost immediately that on that second "A" that is
returned, the *dir->dirstruct->data member had suddenly changed (and
the offset member was no longer incrementing by 16 like it had been).
This got me thinking that something down in APR was causing illness.
So I set a watchpoint on *dir->dirstruct->data, and ran the test
again.  The watchpoint was first triggered when the field was first
initialized with directory data, and then *POW* it was triggered again
on a call to readdir_r() (from apr_dir_read()).

So, readdir_r() was actually fouling up somehow, losing track of where
it was in its directory reading or something.  At this point, I
mis-read the documentation for readdir_r().  I recalled that we were
creating a temporary file in the top-level import directory each time
file contents were sent (in the editor->apply_textdelta() call, which
is commit_apply_txdelta in ra_dav).  My misunderstanding of the
readdir_r() manpage caused me to hack a change in ra.c so that instead
of making the temporary file in the top-level import directory, it
always made the file in "/tmp".  *POOF*.  The bug disappeared.

At this point, I called Karl.

We talked through the situation and by the end had decided on a couple
of things:

   - First, that I had misread the readdir_r() manpage, but that
     regardless, something seemed to be quite wrong with how that
     function (or APR's use of that function) was behaving on my
     system.

   - Secondly, the reason that 5202 appeared the cause of the bug was
     because when we were doing postfix textdeltas, we had finished
     doing all the directory reading for the whole import before we
     created our first temporary file, thus our tmpfile creation
     wasn't screwing up dir reads like it is for me now.

Now that I've read the readdir_r() manpage correctly, it would appear
that it should be completely safe to have this kind of processing:

   opendir(DIR)
   while (readdir(DIR))
     {
        create_file(DIR/tmpfile)
        do_stuff()
        delete_file(DIR/tmpfile)
     }

But apparently this isn't working out too well on my machine.  

Does anyone have any reason to believe that this would not be safe,
and have any ideas why running Subversion via a Python popen3() call
would cause some problem to occur which does not run Subversion is run
directly from the shell?  Inquiring (and admittedly boggled) minds
wanna know.

Mime
View raw message