--On Wednesday, February 2, 2005 11:38 PM +0200 Graham Leggett
<minfrin@sharp.fm> wrote:
> If mod_cache was taught to serve a being-cached URL directly from the
> cache (shadowing the real download), there would be no need for parallel
> connections to the backend server while the file is being cached, and no
> load spike.
I don't see any way to implement that cleanly and without lots of undue
complexity. Many dragons lay in that direction.
How do we know when another worker has already started to fetch a page?
How do we even know if the response is even cacheable at all?
How do we know when the content is completed?
For example, if the response is chunked, there is no way to know what the
final length is ahead of time.
If we're still waiting for the initial response (i.e. request has already
been issued but no data received back yet), then we don't know if the
origin server will tack on a Cache-Control: no-store or Vary or there is
some other server-driven reason that it won't be cached or acceptable to
this client.
Additionally, with this strategy, if the first client to request a page is
on a slow link, then other clients who are on faster links will be stalled
while the cached content is stored and then served.
The downside of stalling in the hope that we'll be able to actually serve
from our cache because another process has made the same request seems much
worse to me than our current approach. We could end up making the client
wait an indefinite amount of time for little advantage.
The downside of the current approach is that we introduce no performance
penalty to the users at the expense of additional bandwidth towards the
origin server: we essentially act as if there was no cache present at all.
I'm also unsure that this strategy would mesh well with mod_disk_cache. I
think an entirely new and different provider would have to be written
(assuming we could surmount the above challenges, which I believe are much
harder than they look). mod_disk_cache deliberately doesn't use shared
memory because it introduces unnecessary complexity to the code.
mod_disk_cache also delays any indication that it has started to fetch the
page until content has been received. In fact, the way mod_disk_cache
works right now is we have an acceptable race condition in that the last
one to finish will store the data and overwrite all the instances that came
before.
I would rather focus on getting mod_cache reliable than rewriting it all
over again to minimize a relatively rare issue. If it's that much of a
problem, many pre-caching/priming strategies are also available. -- justin
|