httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niklas Edmundsson <>
Subject Re: mod_disk_cache summarization
Date Sat, 28 Oct 2006 18:50:45 GMT
On Fri, 27 Oct 2006, Graham Leggett wrote:

> Niklas Edmundsson wrote:
>> Different VHosts meaning different URLs/directories, pointing to the same 
>> files...
> Hmm... Two thoughts come into my head over this one.
> One way to approach this is to treat this as a general problem of how do we 
> stop people who download the same file from multiple places (say different 
> mirrors via proxy, or different URLs to the backend like you have) from 
> downloading multiple copies of the same file hosted at different URLs.
> Here you might have some kind of regex-like expression, like *.iso, that says 
> "all files whose names match this regex, are considered the same file". A 
> mechanism might have a small cache of filenames that have matched the regex 
> in the past, and that link to actual cached entries in the cache.
> This would need to be abstracted out into an existing hook (or new one if 
> necessary).
> A second approach could involve the use of the Etags associated with file 
> responses, which in the case of files served off disk (as I understand it) 
> are generated based on inode number and various other uniquely file specific 
> information.
> Therefore in theory two responses with the same Etag are actually the same 
> file, and if you've already cached a file with that Etag, then the same Etag 
> quick cache scenario described above could provide a shortcut to the same 
> file cached at a different URL.

For our use, the following solves the "multiple url:s points to the 
same file" problem: When caching the file, if file larger than 
$threshold (we use 64k), write a "alias-header" only saying "this URL 
equals r->filename". Hash on r->filename, cache the file. Reading the 
file follows the "alias-header", and opens the cached file.

This only works when having a filesystem-backend, and it does not 
solve the real problems of multiple symlinks pointing to the same 
file. The symlink-problem is a significant source of data-duplication 
in the cache for us, but I suspect that there must be a relatively 
clean solution to this. I'm not particulary fond of the "stat each 
component of the path"-solution though, even though caching would 
reduce the stat-hammering on the backend.

After reading Henriks post, I suspect that the only way to do this for 
non-file-backend is to use content-md5, and that sounds way to 
expensive to be really usable...

  Niklas Edmundsson, Admin @ {acc,hpc2n}      |
  Confucious say too damn much!

View raw message