Mailing-List: contact repo-maintainers-help@maven.apache.org; run by ezmlm
Precedence: bulk
Reply-To: repo-maintainers@maven.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Message-ID: <4A51ABC9.80904@contegix.com>
Date: Mon, 06 Jul 2009 02:46:17 -0500
From: Contegix Notifications <notifications@contegix.com>
User-Agent: Thunderbird 2.0.0.22 (Macintosh/20090605)
MIME-Version: 1.0
To: notifications@contegix.com
Subject: Contegix Network Incident Report
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit

Contegix Customer:

Please do not reply to this email.  If you have any questions, please submit a support request to support@contegix.com.

At approximately 11:39 AM on July 2nd, our NOC engineers began to receive several monitor alarms alerting us of a potential network issue. 
We found our core switches were dropping packets to both internal and external traffic.

We began to investigate and found abnormal traffic lights on one of our intrusion prevention systems.  At that time, we believed this to be 
the cause and physically bypassed the units. We quickly determined that this was not the root cause and the problem still persisted. We then 
began to troubleshoot in our core switching.

At approximately 11:59 AM, we determined there was a multicast packet storm on our network. Due to the high number of packets, the CPUs in 
both core switches reached max capacity which caused packet loss. After further debugging we found that the storm was from a routing 
protocol (VRRP-E) multicast IP and originating from a specific customer core switch port. The customer connected to this port had had a 
switch malfunction a few minutes prior to the network issue and we determined this could be the cause. At approximately 12:05 PM, we 
disabled the customer port and the CPUs on our core switches began to stabilize.

Network availability to internal and external destinations were restored, but we found that we still could not reach a few external 
destinations. Also, traffic was increasing on our network but not at normal utilization. After further troubleshooting, we found that we 
could not route out Level(3)�s network. Based on our observations and data, we could not determine the reason for the Level(3) issues. At 
approximately 12:19 PM, we disabled BGP with Level(3). Once this was disabled, our network returned to normal and traffic flowed through to 
outbound routes correctly.

While the issue started when a customer replaced a switch, we do not believe this is the direct cause. We do suspect that it triggered a bug 
in our core switch software despite all engineered precautions.  We are working closely with the hardware manufacturer to determine the 
exact cause. We will forward any new information on this issue and long-term resolution. In the interim, we have placed a moratorium on 
adding new customer switching equipment connected to our core switches.  In addition, we restored our BGP session with Level(3) once it was 
determined to be safe.

We apologize for any inconvenience this may have created for you or your customers. Our reliable network is one of our great assets, and we 
place a great deal of emphasis on making sure it is working optimally. As mentioned before, we are working closely with the switch 
manufacturer to identify and fix this bug to make sure this does not occur again.


Sincerely,
Contegix Support

---
Contegix
900 Walnut Street
Suite 700
Saint Louis, MO  63102
Phone: 314.622.6200 ext. 3
Toll Free: 877.4.CONTEGIX ext. 3
Fax: 314.621.4422
E-mail: support@contegix.com
Beyond Managed Hosting(r) for Your Enterprise
Favorite Linux-Friendly Hosting Company - Linux Journal
http://www.contegix.com/linuxjournal