Friday, January 6, 2012

Google: 'At scale, everything breaks'

Jack Clark has an interesting interview of Urs Hölzle, Google's first vice president of engineering, on ZDNET in which Hölzle acknowledges the difficulties in maintaining massively scaled systems. The full interview is available here.
Automation is key, but it's also dangerous. You can shut down all machines automatically if you have a bug. It's one of the things that is very challenging to do because you want uniformity and automation, but at the same time you can't really automate everything without lots of safeguards or you get into cascading failures.

Keeping things simple and yet scalable is actually the biggest challenge.

Complexity is evil in the grand scheme of things because it makes it possible for these bugs to lurk that you see only once every two or three years, but when you see them it's a big story because it had a large, cascading effect.

Keeping things simple and yet scalable is actually the biggest challenge. It's really, really hard. Most things don't work that well at scale, so you need to introduce some complexity, but you have to keep it down.

No comments: