Return-Path: X-Original-To: apmail-infrastructure-dev-archive@minotaur.apache.org Delivered-To: apmail-infrastructure-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1900A9AA0 for ; Mon, 13 Aug 2012 03:53:17 +0000 (UTC) Received: (qmail 32850 invoked by uid 500); 13 Aug 2012 03:53:16 -0000 Delivered-To: apmail-infrastructure-dev-archive@apache.org Received: (qmail 32471 invoked by uid 500); 13 Aug 2012 03:53:09 -0000 Mailing-List: contact infrastructure-dev-help@apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: infrastructure-dev@apache.org Delivered-To: mailing list infrastructure-dev@apache.org Delivered-To: moderator for infrastructure-dev@apache.org Received: (qmail 2284 invoked by uid 99); 12 Aug 2012 22:42:14 -0000 X-ASF-Spam-Status: No, hits=1.5 required=10 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of metcalfc@gmail.com designates 209.85.215.178 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=ADOCgXEYhxreXsbdd5cS23PXAimxz+X5gyhNZaqpVa0=; b=L0J5zW3aQMcTxKAZQZjTIKkJbxvxLzBW84ZHUGpqHoFbHh7z9fA/ijy0LOXP7ibGBE K2jWOPFPixo2nnNH5WTEi4GMH87b0xrc4+NCweXBG+xPgxRi4UA7hwWBcascrznkpRJA RhnpWayaC/oUB0SW2Y9xTf2ykTzjxn7plCFd7k4VkxJU8tmwzvDRoihknlOBXKj7QxVQ oOa7S8GTMXWVU3ZhboFrfA+NxVyjk56eNbfNVDI+M/Ie3SKVW1mpM9Ym4wtMA+SFOhEb uwerlfkufmlducboTLFbv1nTlGdn48G7ES2DnEmWZoxKqRiyB6pS21s98s2kF6Rc/n4N GwLA== MIME-Version: 1.0 Date: Sun, 12 Aug 2012 15:41:46 -0700 Message-ID: Subject: Thoughts on the jira outage From: Chad Metcalf To: infrastructure-dev@apache.org Content-Type: multipart/alternative; boundary=047d7b343db634e58404c71948b8 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b343db634e58404c71948b8 Content-Type: text/plain; charset=ISO-8859-1 tl;dr More information, in a public place, thank you for your hard work. I'd like to start by thanking members of the infrastructure team. I realize that you are volunteers. I also realize that you all have day jobs and lives outside infra. So when things like jira go down I understand that there are other commitments, priorities, and things happening. I also understand that the infra team is acutely aware of the size and importance of the Apache jira to users and companies around the Apache projects. My first contact with the outage was trying to load a JIRA that I am impacted by. I got the system maintenance page and I clicked through to the status page. http://monitoring.apache.org/status/ At first glance Issues - JIRA - General was ok. This check is misleading since of course the system maintenance page is returning 200OK. I think the expectation for a "status" page entry for JIRA is status on the actual JIRA service not a HTTP page. At the time I first landed there wasn't any additional information about Jira (that I could see). Later there was a message added by Daniel and the additional "comment" and "scheduled downtime" icons to direct people were to look. Had there been a message to the effect of "Jira is down for a host migration we expect this to take 24-72 hours" I think that would have been sufficient. The second place I looked was Twitter. In the past I've appreciated the quick updates and tidbits by @infrabot. Sure enough I see there are some issues on Aug 7th. And over the next 12 hours. http://id.apache.org and jira are currently down. We are having problems > with VMs. -- 1:34 AM - 10 Aug 12 via infrabot > > JIRA is still down, sorry, we are working to bing it back ASAP. -- > <pctony> 7:41 AM - 10 Aug 12 via infrabot > > @TheASF please don't ask @infrabot when jira is back --- we know it's down > and we're working on it -- 12:31 PM - 10 Aug 12 via infrabot JIRA is still down, and we expect it to be down for the next 6-8 hours at > least. -- <pctony> 12:32 PM - 10 Aug 12 via infrabot This marks the point in the response I think the response could use start to use some improvement. One of the worst things a sysadm can do is that second to last tweet. We've all been there. The Nth person has asked you if you know the XYZ is down. You're working the issue, you're trying to get things online, things may or may not be going well. Its frustrating and the users won't leave you alone. The reason people are still asking is because a) there is a lot of them and b) clearly the word isn't out. The status page doesn't have any information at this point. The best I can tell the most information, anyone outside of infra has, at this point is on committers@apache.org that jira has overwhelmed its vm host and has to be migrated back to a physical host. I'm not sure why that wasn't communicated on status and twitter to begin with. That information alone communicates a longer time-frame. "Oh they have to migrate a host. That might take awhile." Instead the lack of information has the opposite effect. "Why is this still down? What is taking so long?" I would also add that committers@apache.org is not the only place to take communication. There are a lot of contributors and "simple" users. Sending information to a closed list and then saying "we're sending updates to a closed list" also counter productive. I'm a contributor not a committer but my job needs jira just as much as anyone else. The point for me that this outage switched from being a simple outage to a problem was 6-8 hours after Tony's last tweet. If you're going to publicly make an estimate for getting a service online great. If you miss that estimate fine. Make a new estimate or say XYZ happened and you can't make an estimate. The JIRA move is still in full swing, we will update as soon as we know > more. -- <pctony> 12:30 AM - 12 Aug 12 via infrabot The next direct update was 36 hours later. Jira is a pretty major service. People's day to day jobs were impacted. In the middle of that 36 hour period there were also two two tweets directed to me about the http://monitoring.apache.org/status/ page. Which was great. It was the first time publicly there was mention of a host migration. But by this point its been almost 2 days. I know its a volunteer organization and I know that infra knows that people's jobs are impacted. But there really needs to be more communication. I understand you may not be able to put a time-frame on it. Perhaps mention that a degraded vm host is causing the migration to take forever. The response might be more "oh that sucks" then "wtf is going on". So to sum up here is one possible reality that would have been a more helpful: Jira fails. The Issues - JIRA - General on http://monitoring.apache.org/status/ goes red because the service is down. Someone adds a comment about the same time the system maintenance page goes up. Maybe that comment points to the twitter feed for more timely updates. Or maybe updates go here as well. http://id.apache.org and jira are currently down. We are having problems > with VMs. > JIRA is still down, sorry, it looks like we need to migrate hosts. We are > working to bring it back but it might take awhile. > JIRA is still down, we're in the process of migrating hosts and we expect > it to be down for the next 6-8 hours at least. > Unfortunately, the host migration is going slowly due to a degraded vm > host. Sorry but we don't have an estimate. > The JIRA move is still in full swing, we will update as soon as we know > more. jira is available again, thanks for your patience. It might be slow while > folks catch up Word-smithing aside, a little information goes a long way. Whatever additional traffic happens on closed lists continues to happen on those closed lists but I can't really comment there. I don't think anyone is upset that it took ~3 days. No one is making technical commentary on how the outage was recovered from. The problem is around communication and managing expectations. I appreciate the infra team's efforts and hope next time things go easier. Thanks Chad --047d7b343db634e58404c71948b8--