Google explains Google Docs outage

James Delahunty
10 Sep 2011 3:26

Not the best week for cloud services.
Users of Google Inc.'s popular Google Docs suite of web-based applications were unable to use the service for some time on Wednesday. Alan Warren, Engineering Director at Google posted an explanation on the official Google Docs blog.
The outage is the result of an update that should have made the service better. The change was designed to improve real-time collaboration within document lists, but a lurking memory management bug showed its ugly head when the servers came under heavy usage.
"Every time a Google Doc is modified, a machine looks up the servers that need to be updated. Due to the memory management bug, the lookup machines didn't recycle their memory properly after each lookup, causing them to eventually run out of memory and restart," Warrern writes.
"While they restarted, their load was picked up by the remaining lookup machines - making them run out of memory even faster. This meant that eventually the servers couldn’t properly process a large fraction of the requests to access document lists, documents, drawings, and scripts which led to the outage you saw on Wednesday."
Google's automated monitoring alerted the team to a spike in failure rate of attempts to access Google Docs. The engineering team atarted rolling it back 23 minutes after the first automated alert. They then doubled the capacity of the lookup service to mitigate the impact of the memory management bug, and the rollback completed 24 minutes later. Within five minutes the service was restored and everything got back to normal.
Read more: http://googledocs.blogspot.com/...

More from us
Tags
Google Docs
We use cookies to improve our service.