Monthly Archives: October 2008

All slices on ey01 affected by the issue have been restarted and seem to be working.  

Our engineers are continuing to investigate.

We’ve had to reboot the affected environments and the slices are coming back up. We’ll continue investigating what caused the issue.

We’re currently experiencing an issue on ey01.  Some sites are affected. 

Our engineers are currently investigating.

After researching the service interruption on EY02 today we determined that one of our clients had their traffic failover to display the maintenance page from the load balancer. This happened during a particularly heavy traffic spike for this client. The amount of traffic was sufficient enough to cause some problems with the nginx instance on the gateway, which in-turn caused the nginx logs to fill up. With the disk filling up on the gateway this lead to some routing issues which is the root cause of the service interruption.

 

While we do not believe this to be a dos attack the amount of traffic that hit the gateway was close to what we would expect to see during a dos attack. Until now we have never seen a single site get so much traffic that this issue occurs. We’re taking proactive measures in tuning the nginx configuration on the gateways to prevent this from occurring again.

We have currently have all sites back online. The cause is related to an issue we see on one of the gateways on EY02 and we’ve currently failed-over to our backup gateway. We’re continuing to investigate the root cause and will post another update once the root cause has been determined.

Several customers on ey02 are experiencing slow requests or outages currently.  We are working on diagnosing and resolving the problem.

Dear Engine Yard Customer,

On October 28, 2008 at approximately 8:00 PM PDT, cluster ey00 experienced a hardware failure that resulted in the loss of power to much of the cluster.  Four of five power distribution units (PDUs) were found to have logs indicating a reset at 8:00 pm, due to what appears to have been a power disruption on the feeds from the data center.

Utilizing our on site technician, and our remote Support and Engineering staff, we were able to replace a PDU which appeared to be continually resetting itself.  After verifying the other PDUs our staff restored power to the connected hardware which was online at approximately 9:52 PM.  Our staff had all slices online at approximately 11:00 PM.

Senior staff have requested the data center further investigate any power issues that may have occurred during this time.  To date, the data center has no knowledge or reports of power issues.  We will continue following up with the data center, since this is the second time this issue has presented itself.

Finally, we are working to determine if we can change our procedures related to PDU replacement and racking so that we can decrease the time necessary to replace failed hardware during emergency situations.

Thanks,

Engine Yard

All environments and sites have been verified operational by an Engineer.

This is the final post for tonight.  A follow up post will be made tomorrow at which time an email to all affected customers will be sent.

Almost all slices are online, and 85 to 95% of environments have been checked by a Support Engineer.  We will continue working to bring the remaining environments fully online.  Once we are sure everything is back in place we’ll post a final notice for the night here.

Our Engeering and Support staff will work on a notice to be posted here and sent to customers with our root cause analysis and the steps are are taking to prevent and correct this issue.

The new PDU is in, the nodes are all back online, and Engineers are verifying clustering.  Slices are coming online now and Engineers will be checking each environment when all slices are back online.

Mysql databases are unaffected by this outage, however we have DBA’s online and ready to assist with any mySQL issue.  Two Postgres databases are affected and those instances will be checked before they are started.