www-infrastructure-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chad Metcalf <metca...@gmail.com>
Subject Thoughts on the jira outage
Date Sun, 12 Aug 2012 22:41:46 GMT
tl;dr More information, in a public place, thank you for your hard work.

I'd like to start by thanking members of the infrastructure team. I realize
that you are volunteers. I also realize that you all have day jobs and
lives outside infra. So when things like jira go down I understand that
there are other commitments, priorities, and things happening. I also
understand that the infra team is acutely aware of the size and importance
of the Apache jira to users and companies around the Apache projects.

My first contact with the outage was trying to load a JIRA that I am
impacted by. I got the system maintenance page and I clicked through to the
status page. http://monitoring.apache.org/status/ At first glance Issues -
JIRA - General was ok. This check is misleading since of course the system
maintenance page is returning 200OK. I think the expectation for a "status"
page entry for JIRA is status on the actual JIRA service not a HTTP page.
At the time I first landed there wasn't any additional information about
Jira (that I could see). Later there was a message added by Daniel and the
additional "comment" and "scheduled downtime" icons to direct people were
to look. Had there been a message to the effect of "Jira is down for a host
migration we expect this to take 24-72 hours" I think that would have
been sufficient.

The second place I looked was Twitter. In the past I've appreciated the
quick updates and tidbits by @infrabot. Sure enough I see there are some
issues on Aug 7th. And over the next 12 hours.

http://id.apache.org  and jira are currently down. We are having problems
> with VMs. -- <danielsh> 1:34 AM - 10 Aug 12 via infrabot

> JIRA is still down, sorry, we are working to bing it back ASAP. --
> &lt;pctony&gt; 7:41 AM - 10 Aug 12 via infrabot

> @TheASF please don't ask @infrabot when jira is back --- we know it's down
> and we're working on it -- <danielsh> 12:31 PM - 10 Aug 12 via infrabot

JIRA is still down, and we expect it to be down for the next 6-8 hours at
> least. -- &lt;pctony&gt; 12:32 PM - 10 Aug 12 via infrabot

This marks the point in the response I think the response could use start
to use some improvement. One of the worst things a sysadm can do is that
second to last tweet. We've all been there. The Nth person has asked you if
you know the XYZ is down. You're working the issue, you're trying to get
things online, things may or may not be going well. Its frustrating and the
users won't leave you alone. The reason people are still asking is because
a) there is a lot of them and b) clearly the word isn't out. The status
page doesn't have any information at this point. The best I can tell the
most information, anyone outside of infra has, at this point is on
committers@apache.org that jira has overwhelmed its vm host and has to be
migrated back to a physical host.

I'm not sure why that wasn't communicated on status and twitter to begin
with. That information alone communicates a longer time-frame. "Oh they
have to migrate a host. That might take awhile." Instead the lack of
information has the opposite effect. "Why is this still down? What is
taking so long?"  I would also add that committers@apache.org is not the
only place to take communication. There are a lot of contributors and
"simple" users. Sending information to a closed list and then saying "we're
sending updates to a closed list" also counter productive. I'm a
contributor not a committer but my job needs jira just as much as anyone

The point for me that this outage switched from being a simple outage to a
problem was 6-8 hours after Tony's last tweet. If you're going to publicly
make an estimate for getting a service online great. If you miss that
estimate fine. Make a new estimate or say XYZ happened and you can't make
an estimate.

The JIRA move is still in full swing, we will update as soon as we know
> more. -- &lt;pctony&gt; 12:30 AM - 12 Aug 12 via infrabot

The next direct update was 36 hours later. Jira is a pretty major service.
People's day to day jobs were impacted. In the middle of that 36 hour
period there were also two two tweets directed to me about the
http://monitoring.apache.org/status/ page. Which was great.  It was the
first time publicly there was mention of a host migration. But by this
point its been almost 2 days. I know its a volunteer organization and I
know that infra knows that people's jobs are impacted. But there really
needs to be more communication. I understand you may not be able to put
a time-frame on it. Perhaps mention that a degraded vm host is causing the
migration to take forever. The response might be more "oh that sucks" then
"wtf is going on".

So to sum up here is one possible reality that would have been a more

Jira fails. The Issues - JIRA - General on
http://monitoring.apache.org/status/ goes red because the service is down.
Someone adds a comment about the same time the system maintenance page goes
up. Maybe that comment points to the twitter feed for more timely updates.
Or maybe updates go here as well.

http://id.apache.org  and jira are currently down. We are having problems
> with VMs.
> JIRA is still down, sorry, it looks like we need to migrate hosts. We are
> working to bring it back but it might take awhile.
> JIRA is still down, we're in the process of migrating hosts and we expect
> it to be down for the next 6-8 hours at least.
> Unfortunately, the host migration is going slowly due to a degraded vm
> host. Sorry but we don't have an estimate.
The JIRA move is still in full swing, we will update as soon as we know
> more.

jira is available again, thanks for your patience. It might be slow while
> folks catch up

Word-smithing aside, a little information goes a long way. Whatever
additional traffic happens on closed lists continues to happen on those
closed lists but I can't really comment there.

I don't think anyone is upset that it took ~3 days. No one is making
technical commentary on how the outage was recovered from. The problem is
around communication and managing expectations. I appreciate the infra
team's efforts and hope next time things go easier.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message