httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Wilson <and...@www.elsevier.co.uk>
Subject LogFile Formats
Date Wed, 19 Apr 1995 09:40:26 GMT

> This might be bad form to complain about this functionality this late in 
> the game, but conceptually I have a hard time justifying the 
> two-web-log-hits effect of error response redirects.  I.e., when I access 
> a protected area under a bogus username/password:
> 
> fully - asdfsaf [19/Apr/1995:01:03:05 -0700] "GET /Login/ HTTP/1.0" 401 -
> fully - asdfsaf [19/Apr/1995:01:03:05 -0700] "GET /401.html" 200 703
> 
> The problem is that the second one, when not in the context of the first, 
> looks like a valid user "asdfsaf" accessed a page under authentication.
> I'd have to tell my scripts "no, no, toss out all accesses to 401.html 
> before doing any user-based analysis".

This is not a bad thing, if it brings you closer to the truth.

> What do people think?

The new behaviour came about as a result of Rob H's fix.  I *want* to see
a complete record of the results that redirects produce, so I do want to see
both entries...

However I think the current solution to logging is flawed 'cuz you don't
get told explicitly that the second log entry is as a result of the first.
Rob H and I have bickered about augmenting the logfile format so's it records
a unique identifier for each 'transaction'.  A normal GET / rould result in:

fully - asdfsaf [19/Apr/1995:01:03:05 -0700] "GET /Login/ HTTP/1.0" 401 - 123456

where '123456' is the unique id.

A hit that generated a redirect would produce:

fully - asdfsaf [19/Apr/1995:01:03:05 -0700] "GET /Login/ HTTP/1.0" 401 - 123456
fully - asdfsaf [19/Apr/1995:01:03:05 -0700] "GET /401.html" 200 703 123456

and so we know that the two log entries are related.

The drawback is that we now have a non-common log format, and that a lot of
existing log munging scripts will croak accordingly.

I'd like to propose that we do 3 things:

1)	Log everything, absolutely everthing, and nothing but everything.
	As a rule of thumb, if the action results in some text being sent out
	of the server then that transmission should be logged.  Even if it's
	a 204 No Content or whatever.

2)	Use unique ids to tie related log entries together.  The id's can just
	be strings, I think that using the lower-order clock timing bytes
	is common practice.  Either that or some guaranteed non-repeating
	sequence.

	[it needs to be a solution that works for the non-forking model too]

3)	Provide a support/apache2common script that sucks up Apache log
	files and spits out Common Format logfiles.  This means that
	Joe.Webster's end-of-day stats programs can get something useful to
	read.

	[This would come running to the aid of Brian's "no, no, toss out all
	accesses to 401.html before doing any user-based analysis" cries]

The apache format logfile behaviour could be a .conf setting 'LogFileFormat'
with values either 'Common' or 'Apache'.  As a further enhancement the
format of the logfile could be specified in a .conf file, a single line of
the form:

ApacheLogFileFormat HOST REMOTENAME USERID TIME ACCESS STATUS SIZE UNIQUE
 
This same entry could be read by support/apache2common when deciphering the
present state of the real logfile and converting it to the Common form.

This approach also lets you drop fields you don't care about, or add new ones
CGI-VARS perhaps, if you're running your own stats programs.  If these
field names become a standard then mebbies people will write better stats
programs that don't even need support/apache2common.

> 	Brian

Cheers,
Ay.

     Andrew Wilson	     URL: http://www.cm.cf.ac.uk/User/Andrew.Wilson/
Elsevier Science, Oxford   Office: +44 01865 843155    Mobile: +44 0589 616144


Mime
View raw message