From dev-return-25094-apmail-apr-dev-archive=apr.apache.org@apr.apache.org Fri Oct 19 13:26:05 2012 Return-Path: X-Original-To: apmail-apr-dev-archive@www.apache.org Delivered-To: apmail-apr-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B291D4ED for ; Fri, 19 Oct 2012 13:26:05 +0000 (UTC) Received: (qmail 37130 invoked by uid 500); 19 Oct 2012 13:26:05 -0000 Delivered-To: apmail-apr-dev-archive@apr.apache.org Received: (qmail 36851 invoked by uid 500); 19 Oct 2012 13:26:01 -0000 Mailing-List: contact dev-help@apr.apache.org; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Id: Delivered-To: mailing list dev@apr.apache.org Received: (qmail 36820 invoked by uid 99); 19 Oct 2012 13:26:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Oct 2012 13:26:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jmarantz@google.com designates 209.85.223.178 as permitted sender) Received: from [209.85.223.178] (HELO mail-ie0-f178.google.com) (209.85.223.178) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Oct 2012 13:25:55 +0000 Received: by mail-ie0-f178.google.com with SMTP id e11so809581iej.37 for ; Fri, 19 Oct 2012 06:25:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-system-of-record; bh=l8zNOhoQKnyOJl5xWgztSUUQ7vScyzDvexs6jX7x3OM=; b=oFXXhb5x6PNmO8o8dP/u7H9YBvhGNcA8d/Rh3Fu9nK+ordo+E9spDn43kYlkUq2bZY hF88S1wDwCHOsHFG6nzxyuVFxxT+pjlXcVtn6Em/7t1n1TRarzAsmMoMmcm2k2LGyaHJ AgMU5DuEygvi8L9QONGG/ftEfhhZe9Mclx0pLZZxGe/sAzP9qmRSVeCW+3lFB0/bHuPb iTc2OGW8RVJscGAHLYtUxnzAiePn3/1nIcVjnOeyRoy1tl7+TgO+WQlHmIo0VLV6t6Xi 4VeLTcTR8A6GWG36XyUwE5OSIOLH9k37Vu2i+7IHDpTxQI3uxFsLcqgZrTCOD9ejdcP+ Gk6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-system-of-record:x-gm-message-state; bh=l8zNOhoQKnyOJl5xWgztSUUQ7vScyzDvexs6jX7x3OM=; b=RkZcCUz6VMv+sCteJT9TMunwzMx6QoQpricR1uX1SNHsr4318JMCUQpfUgSpMYhKkM df6f4kGjMyJq7hvXU1yGY+pVDLcKX0odarLFGaxsUFZgeARDjg+EdISUwLS9XtlkzuTC 1J0mXQh2fdBPV67YiCkDdE1b/vlZTi4lY0afe9CVREtXw8W7qWpxQyAYjhLMDoE+3V3y Fd4oYAiL9qp7YfT+kIxc9BCUrUUyGlDKHe4F6qzNnAJBmulnga3s2k/CorLwSy+dqBFf NuNOe5lKIB/wUDbKBcb6T1UwWvrS18OZNt6IgYeeSilWZd8rT9mpSRVP6nkAOqI2ObIb +v6A== Received: by 10.50.185.200 with SMTP id fe8mr1219084igc.35.1350653135123; Fri, 19 Oct 2012 06:25:35 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.31.12 with HTTP; Fri, 19 Oct 2012 06:25:13 -0700 (PDT) In-Reply-To: References: From: Joshua Marantz Date: Fri, 19 Oct 2012 09:25:13 -0400 Message-ID: Subject: Re: apr_memcache operation timeouts To: "modules-dev@httpd.apache.org" Cc: dev Content-Type: multipart/alternative; boundary=14dae934067355733404cc6970f1 X-System-Of-Record: true X-Gm-Message-State: ALoCoQm8fWiiOKkZqcWYkVd+N6K84aCkSK44jt2Z2IoU3ct1QjLAbR6LikaW5mvdkeguURTj0Z1EvUZBg+qFU3GSte0lFuXX4VGnpWie7KZ24yonIiTPyrNnr9tbmUaxxyZjRP4Tuu5j2b9fZADb3VgMDp9nz2wYlsVQKV930w7Y9JKCxpko5/gRg3Ip4aZ1wCC6Qqj0D94W X-Virus-Checked: Checked by ClamAV on apache.org --14dae934067355733404cc6970f1 Content-Type: text/plain; charset=ISO-8859-1 Following up: I tried doing what I suggested above: patching that change into my own copy of apr_memcache.c It was first of all a bad idea to pull in only part of apr_memcache.c because that file changed slightly between 2.2 and 2.4 and our module works in both. I was successful making my own version of apr_memcache (renaming entry-points apr_memcache2*) that I could hack. But if I changed the socket timeout from -1 to 10 seconds, then the system behaved very poorly under load test (though it worked fine in our unit-tests and system-tests). In other words, I think the proposed patch that Jeff pointed to above is not really working (as he predicted). This test was done without SIGSTOPing the memcached; it would timeout under our load anyway and thereafter behave poorly. I'm going to follow up on that bugzilla entry, but for now I'm going to pursue my own complicated mechanism of timing out the calls from my side. -Josh On Thu, Oct 18, 2012 at 10:46 AM, Joshua Marantz wrote: > Thanks Jeff, that is very helpful. We are considering a course of action > and before doing any work toward this, I'd like to understand the pitfalls > from people that understand Apache better than us. > > Here's our reality: we believe we need to incorporate memcached for > mod_pagespeed to scale effectively for very > large sites & hosting providers. We are fairly close (we think) to > releasing this functionality as beta. However, in such large sites, stuff > goes wrong: machines crash, power failure, fiber cut, etc. When it does we > want to fall back to serving partially unoptimized sites rather than > hanging the servers. > > I understand the realities of backward-compatible APIs. My expectation is > that this would take years to make it into an APR distribution we could > depend on. We want to deploy this functionality in weeks. The workarounds > we have tried backgrounding the apr_memcache calls in a thread and timing > out in mainline are complex and even once they work 100% will be very > unsatisfactory (resource leaks; Apache refusing to exit cleanly on > 'apachectl stop') if this happens more than (say) once a month. > > Our plan is to copy the patched implementation of > apr_memcache_server_connect and the static methods it calls into a new .c > file we will link into our module, naming the new entry-point something > else (apr_memcache_server_connect_with_timeout seems good). From a CS/SE > perspective this is offensive and we admit it, but from a product quality > perspective we believe this beats freezes and complicated/imperfect > workarounds with threads. > > So I have two questions for the Apache community: > > 1. What are the practical problems with this approach? Note that in > any case a new APR rev would require editing/ifdefing our code anyway, so I > think immunity from APR updates such as this patch being applied is not a > distinguishing drawback. > 2. Is there an example of the correct solution to the technical > problem Jeff highlighted: "it is apparently missing a call to adjust > the socket timeout and to discard the connection if the timeout is reached". > That sounds like a pattern that might be found elsewhere in the Apache > HTTPD code base. > > Thanks in advance for your help! > -Josh > > > On Wed, Oct 17, 2012 at 8:16 PM, Jeff Trawick wrote: > >> On Wed, Oct 17, 2012 at 3:36 PM, Joshua Marantz >> wrote: >> > Is there a mechanism to time out individual operations? >> >> No, the socket connect timeout is hard-coded at 1 second and the >> socket I/O timeout is disabled. >> >> Bugzilla bug https://issues.apache.org/bugzilla/show_bug.cgi?id=51065 >> has a patch, though it is apparently missing a call to adjust the >> socket timeout and to discard the connection if the timeout is >> reached. More importantly, the API can only be changed in future APR >> 2.0; alternate, backwards-compatible API(s) could be added in future >> APR-Util 1.6. >> >> > >> > If memcached freezes, then it appears my calls to 'get' will block until >> > memcached wakes up. Is there any way to set a timeout for that call? >> > >> > I can repro this in my unit tests by sending a SIGSTOP to memcached >> before >> > doing a 'get'? >> > >> > Here are my observations: >> > >> > apr_memcache_multgetp seems to time out in bounded time if I SIGSTOP the >> > memcached process. Yes! >> > >> > apr_memcache_getp seems to hang indefinitely if I SIGSTOP the memcached >> > process. >> > >> > apr_memcache_set seems to hang indefinitely if I SIGSTOP the memcached >> > process. >> > >> > apr_memcache_delete seems to hang indefinitely if I SIGSTOP the >> memcached >> > process. >> > >> > apr_memcache_stats seems to hang indefinitely if I SIGSTOP the memcached >> > process. >> > >> > That last one really sucks as I am using that to print the status of >> all my >> > cache shards to the log file if I detected a problem :( >> > >> > >> > Why does apr_memcache_multgetp do what I want and not the others? Can I >> > induce the others to have reasonable timeout behavior? >> > >> > When I SIGSTOP memcached this makes it hard to even restart Apache, at >> > least with graceful-stop. >> > >> > >> > On a related note, the apr_memcache >> > documentation< >> http://apr.apache.org/docs/apr-util/1.4/group___a_p_r___util___m_c.html >> >is >> > very thin. I'd be happy to augment it with my observations on its >> > usage >> > and the meaning of some of the arguments if that was desired. How >> would I >> > go about that? >> >> Check out APR trunk from Subversion, adjust the doxygen docs in the >> include files, build (make dox) and inspect the results, submit a >> patch to dev@apr.apache.org. >> >> > >> > -Josh >> >> >> >> -- >> Born in Roswell... married an alien... >> http://emptyhammock.com/ >> > > --14dae934067355733404cc6970f1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Following up: I tried doing what I suggested above: patching that change in= to my own copy of apr_memcache.c =A0It was first of all a bad idea to pull = in only part of apr_memcache.c because that file changed slightly between 2= .2 and 2.4 and our module works in both.

I was successful making my own version of apr_memcache (renaming entry-= points apr_memcache2*) that I could hack. =A0But if I changed the socket ti= meout from -1 to 10 seconds, then the system behaved very poorly under load= test (though it worked fine in our unit-tests and system-tests). =A0In oth= er words, I think the proposed patch that Jeff pointed to above is not real= ly working (as he predicted). =A0This test was done without SIGSTOPing the = memcached; it would timeout under our load anyway and thereafter behave poo= rly.

I'm going to follow up on that bugzilla entry, but = for now I'm going to pursue my own complicated mechanism of timing out = the calls from my side.

-Josh


On Thu, Oct 18, 2012 at 10:46 AM, Joshua= Marantz <jmarantz@google.com> wrote:
Thanks Jeff, that is very helpful. =A0We are considering a course of action= and before doing any work toward this, I'd like to understand the pitf= alls from people that understand Apache better than us.

Here's our reality: we believe we need to incorporate memcached for=A0<= a href=3D"http://modpagespeed.com" target=3D"_blank">mod_pagespeed=A0to= scale effectively for very large sites & hosting providers. =A0We are = fairly close (we think) to releasing this functionality as beta. =A0However= , in such large sites, stuff goes wrong: machines crash, power failure, fib= er cut, etc. =A0When it does we want to fall back to serving partially unop= timized sites rather than hanging the servers.

I understand the realities of backward-compatible APIs.= =A0My expectation is that this would take years to make it into an APR dis= tribution we could depend on. =A0We want to deploy this functionality in we= eks. =A0The workarounds we have tried backgrounding the apr_memcache calls = in a thread and timing out in mainline are complex and even once they work = 100% will be very unsatisfactory (resource leaks; Apache refusing to exit c= leanly on 'apachectl stop') if this happens more than (say) once a = month.

Our plan is to copy the patched implementation of apr_m= emcache_server_connect and the static methods it calls into a new .c file w= e will link into our module, naming the new entry-point something else (apr_memcache_server_connect_with_timeout= seems good). =A0From a CS/SE perspective this is offensive and we a= dmit it, but from a product quality perspective we believe this beats freez= es and complicated/imperfect workarounds with threads.

So I have two questions for the Apache community:
=
  1. What are the practical problems with this approach? =A0Note th= at in any case a new APR rev would require editing/ifdefing our code anyway= , so I think immunity from APR updates such as this patch being applied is = not a distinguishing drawback.
  2. Is there an example of the correct solution to the technical probl= em Jeff highlighted: "it is apparently missing a call to adjust the=A0socket timeout and to d= iscard the connection if the timeout is=A0reached". =A0That sounds like a patt= ern that might be found elsewhere in the Apache HTTPD code base.
Thanks in advance for your help!
-Josh


On Wed, Oct 17, 2012 at 8:16 PM, Jeff Tr= awick <trawick@gmail.com> wrote:
On Wed, Oct 17, 2012 at 3:36 PM, Joshua Marantz <jmarantz@google.com> wrote: > Is there a mechanism to time out individual operations?

No, the socket connect timeout is hard-coded at 1 second and the
socket I/O timeout is disabled.

Bugzilla bug https://issues.apache.org/bugzilla/show_bug.cgi= ?id=3D51065
has a patch, though it is apparently missing a call to adjust the
socket timeout and to discard the connection if the timeout is
reached. =A0More importantly, the API can only be changed in future APR
2.0; alternate, backwards-compatible API(s) could be added in future
APR-Util 1.6.

>
> If memcached freezes, then it appears my calls to 'get' will b= lock until
> memcached wakes up. =A0Is there any way to set a timeout for that call= ?
>
> I can repro this in my unit tests by sending a SIGSTOP to memcached be= fore
> doing a 'get'?
>
> Here are my observations:
>
> apr_memcache_multgetp seems to time out in bounded time if I SIGSTOP t= he
> memcached process. Yes!
>
> apr_memcache_getp seems to hang indefinitely if I SIGSTOP the memcache= d
> process.
>
> apr_memcache_set seems to hang indefinitely if I SIGSTOP the memcached=
> process.
>
> apr_memcache_delete seems to hang indefinitely if I SIGSTOP the memcac= hed
> process.
>
> apr_memcache_stats seems to hang indefinitely if I SIGSTOP the memcach= ed
> process.
>
> That last one really sucks as I am using that to print the status of a= ll my
> cache shards to the log file if I detected a problem :(
>
>
> Why does apr_memcache_multgetp do what I want and not the others? =A0C= an I
> induce the others to have reasonable timeout behavior?
>
> When I SIGSTOP memcached this makes it hard to even restart Apache, at=
> least with graceful-stop.
>
>
> On a related note, the apr_memcache
> documentation<http://apr.apache.or= g/docs/apr-util/1.4/group___a_p_r___util___m_c.html>is
> very thin. =A0I'd be happy to augment it with my observations= on its
> usage
> and the meaning of some of the arguments if that was desired. =A0How w= ould I
> go about that?

Check out APR trunk from Subversion, adjust the doxygen docs in the include files, build (make dox) and inspect the results, submit a
patch to dev@apr.ap= ache.org.

>
> -Josh



--
Born in Roswell... married an alien...
http://emptyhammock.= com/


--14dae934067355733404cc6970f1--