trafficserver-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Bates <6ny...@nottheoilrig.com>
Subject Download mirrors, plugin, GSoC
Date Sat, 12 May 2012 14:51:39 GMT
Hi, I would like files that are distributed from multiple mirrors to 
work better with caching proxies, and I hope to write a Traffic Server 
plugin to help with this

I would love any input or feedback on how mirrors can work better with 
Traffic Server

The approach that I am taking for my initial attempt is to use RFC 6249, 
Metalink/HTTP: Mirrors and Hashes. I listen for responses that are an 
HTTP redirect and have "Link: <...>; rel=duplicate" headers, then I scan 
the URLs for one that already exists in the cache. If found then I 
transform the response, replacing the "Location: ..." header with the 
URL that already exists in the cache

Later, I would also like to use RFC 3230, Instance Digests in HTTP, and 
find a way to lookup URLs in the Traffic Server cache by content digest. 
I gather that ATS does create checksums of content stored in the cache, 
but doesn't support looking up content by digest. Some possibilities 
include extending the core with new APIs to accomplish this, or a plugin 
could add additional entries to the ATS cache for content digests. 
Alternatively a separate cache could be used, e.g. KyotoDB or memcached

Some further ideas for download mirrors and Traffic Server include:

   * Remember lists of mirrors so future requests for any of these URLs 
use the same cache key. A problem is how to prevent a malicious domain 
from distributing false information about URLs it doesn't control. This 
could be addressed with a whitelist of domains

   * Making decisions about the best mirror to choose, e.g. one that is 
most cost efficient, faster, or more local

   * Use content digest to detect or repair download errors

A first attempt at a plugin is up on GitHub: https://github.com/jablko/dedup

I would love any feedback on this code

   1. I assume I want to minimize cache lookups, so I first check that a 
response has both a "Location: ..." header and a "Link: <...>; 
rel=duplicate" header

   2. Then I check if the "Location: ..." URL already exists in the 
cache. If so then I just reenable the response

   3. Otherwise I check if the "Link: <...>; rel=duplicate" URL already 
exists in the cache. If so then I rewrite the "Location: ..." header and 
reenable the response

   4. I continue to scan "Link: <...>; rel=duplicate" headers until a 
URL is found that already exists in the cache. If none is found then I 
just reenable the response without any changes

I use TS_HTTP_SEND_RESPONSE_HDR_HOOK to work on responses sent from the 
cache to clients, vs. responses sent from the origin to the cache, 
because it's likely that when the redirect is first received, no mirror 
URLs are cached yet, so the "Location: ..." header will be unchanged. If 
a mirror URL is later added to the cache, then subsequent responses of 
the redirect to clients should be transformed accordingly. If a redirect 
can't be cached then it makes no difference whether it's transformed 
before or after cache

I use TSCacheKeyDigestFromUrlSet() and TSCacheRead() to check if a URL 
already exists in the cache, thanks to sample code from Leif. This works 
well so far

I use TSmalloc() to allocate a struct to pass variables to TSCacheRead() 
callbacks. Leif mentioned in sample code that this is suboptimal and to 
use jemalloc in configure instead. I will do so

The parsing of "Link: <...>; rel=duplicate" is rough, I would most 
appreciate any feedback on this. I call TSUrlParse() from the second 
character of the field value to the first ">" character after the first 
character. I think that according to RFC 3986, a URI-reference can't 
contain a ">" character, so I think this logic is okay? I use memchr() 
to find the ">" character because "string values returned from marshall 
buffers are not null-terminated ... cannot be passed into the common 
str*() routines"

I'm not sure how best to test if Link headers have a "rel=duplicate" 
parameter. Traffic Server has some private code, 
HttpCompat::lookup_param_in_semicolon_string(), to parse, e.g. 
"Content-Type: ...; charset=UTF-8", but nothing in the public API. I can 
probably cobble together something from scratch with memchr(), etc. but 
I'm nervous about getting it right, e.g. all the RFC rules about 
whitespace, and is conformance good enough or are there nonconformant 
implementations to consider? Finally are there any libraries I should 
consider using?

Unfortunately I don't have enough experience to know which approach to 
try first. If anyone can point me in the right direction, or offer 
advice, I would be very grateful

We run Traffic Server here at a rural village in Rwanda. Getting 
download mirrors to work well with Traffic Server is important because 
many download sites have a download button that doesn't always send 
users to the same mirror, so users can't predict whether a download will 
take seconds or hours, which is frustrating

I am working on this as part of the Google Summer of Code

Mime
View raw message