Monthly Archives: March 2009

The push was done with no incident.

We are pushing changes to the ey01, ey02, ey03 and ey05 gateways right now.

Dear Engine Yard Customers,

As many of you know, we experienced a severe outage at our west coast data center yesterday; many of our customers were affected and experienced several hours of downtime. Our engineers became aware of the problem as soon as it occurred, and began the relevant data center escalation procedures.

Engine Yard customers rely on us to run and support their business-critical applications, and that includes relying on our selection of vendors. In this case, we have failed to meet our service level agreements with our west coast customers, and we will, of course, be providing customers with the appropriate service credits.

In the attached report, I have detailed yesterday’s issues, as well as the swift steps we are taking to ensure that this does not happen again. We sincerely apologize for this outage, but are more committed than ever to providing the level of service Engine Yard customers have come to expect.

If you have any additional inquiries, members of our technical teams are available to answer any and all questions; emails can be sent to info@engineyard.com.

Here at Engine Yard we are major supporters of Ruby and Rails. We understand that in order to grow, our ecosystem needs a network of reliable and professional service providers, and we intend to deliver.

John Dillon
CEO

What Happened

Yesterday March 30th, at 9:00 a.m. (PST), our west coast data center experienced a loss of internet connectivity. Our support engineers detected the outage immediately and began investigating the cause. Once we confirmed that the cause was connectivity, we posted the first update to our status blog (9:19 a.m.). We continued to inform customers with new posts as new information was communicated from Herakles.

We were in touch with Herakles senior management for updates at 15 minute intervals. Connectivity began to be restored at approximately 1:30 p.m. and all customers were fully restored by 3:45 p.m. The outage affected about two thirds of our customer base.

Why Did It Happen

Our data-center provider — Herakles — maintains redundant internet uplinks with redundant equipment. Normally the failure of a single internet uplink or switch will prompt a failover event, with minimal loss of connectivity. In this case, however, the route processor of one of the redundant switches (a Cisco 6509) malfunctioned. As part of the malfunction, the device stopped seeing its BGP peers as active, and as such, determined them to have failed. As a result, the device incorrectly promoted itself to master switch and stopped passing traffic inbound or outbound. Complicating the matter, the alerts from the malfunctioning switch that should have notified Herakles monitoring systems of the failure were themselves not routed past the switch.

How It Was Repaired

Herakles data center network engineers worked with Cisco on-site engineers and began debugging the failed switch immediately. The first attempt to repair the switch — by replacing its route processor — failed. After additional trouble-shooting steps, the support engineers physically disconnected the malfunctioning switch, forcing the redundant switch to take over as master. This fully restored traffic, but has now left the internet uplink without switch-level redundancy.

Next Steps

Herakles is currently testing a new redundant switch in its test lab, and will install this during a scheduled maintenance window as soon as possible. When we receive notice of the scheduled maintenance window from Herakles, we will immediately communicate this to customers.

Engine Yard Plans

Starting in September 2008, we began the process of adding an alternative provider to our west coast data center. Our choice was to use our east cost data center connectivity provider as an alternative.

Since the new provider did not yet have a presence in Herakles, this process has taken several months to implement. By April 15th, we will be able to offer this provider as an alternative. At that time we will coordinate with customers who wish to move to the new provider.

We have an official update from Herakles.

1) Service has been restored and is once again stable. However, the network connection is not redundant at this point.
2) They have assured us that there will not be any changes made during production hours.
3) Changes will be made only after they have scheduled and announced a maintenance window.
4) The initial report indicates that the outage was due to a failed hardware card in a pair of redundant access switches.

Additional information will be forthcoming after we have had time to work out a root cause and preventative measures with Herakles. Thank you for your patience.

We are still seeing some restored connectivity to the data center. Herakles is still working to completely resolve this issue and there is still the chance that you will experience additional service interruptions at this time.

We have heard from the data center that some connectivity has been restored. However, they are still working to completely resolve this issue and there is still the chance that you will experience additional service interruptions at this time.

At this time we are starting to see connectivity into the data center again. However, we do not have an official resolved notification from the data center. We will continue to keep you updated.

Cisco is still on site at Herakles helping them resolve the core switching issues at the data center. We will continue to keep you updated every hour or as we learn new information.

We have just received another update from the Herakles. The issues has been escalated to Cisco, who now has engineers on-site assisting with this issue.

We have been in constant contact with our Herakles data center throughout this service interruption. Unfortunately, root cause determination at Herakles for the ISP outage is unknown at this time. Herakles started troubleshooting at the core switching, then to the routers coming into the facility, and now they are focused on the distribution switches inside the data center. Herakles data does not have an ETA for resolution of this problem. We will continue to keep you updated as we hear information from the Herakles data center operations team.