Software Engineering at Google

What software engineering practices does a company of 30,000 engineers need to develop quality software? Each day over 60,000 commits get made within Google by engineers and automated systems. Software development practices are quantifiable rules, not strong opinions.

Software Engineering at Google Cover

Software engineering at Google has been an extraordinary experiment in how to develop and maintain a large and evolving code base - Asim Husain, VP Engineering, Google

Hey, They Wrote a Book About It

Software Engineering at Google - Lessons Learned from Programming Over Time describes how Google approaches software engineering. It’s organised into 3 sections, each containing chapters by different authors within Google. The chapters are more like essays and are written in a conversational style which is a joy to read.

  • Culture - How to Work Well on Teams, Knowledge Sharing, Engineering for Equity, How to Lead a Team, Leading at Scale, Measuring Engineering Productivity.
  • Processes - Style guides & Rules, Code Review, Documentation, Testing (covered in depth), Deprecation.
  • Tools - VCS & Branch Management, Code Search, Build Systems & Philosophy, Static Analysis, Dependency Management, Large Scale Changes, Continuous Integration, Continuous Deployment, Compute as a Service.

The chapters cover both objective information about these topics, but also case studies from inside Google where these topics came into play. In the chapter on testing we learn about Google Web Server. I’m certain that most of us can relate to a similar codebase in a company we’ve worked at.

GWS is the web server responsible for serving Google Search queries… In 2005, as the project swelled in size and complexity, productivity had slowed dramatically. Releases were becoming buggier, and it was taking longer and longer to push them out. Team members had little confidence when making changes to the service, and often found out something was wrong only when features stopped working in production. At one point more than 80% of production pushes contained user-affecting bugs that had to be rolled back

To address these problems, the Tech Lead of GWS decided to institute a policy of engineer-driven, automated testing. As part of this policy, all new code changes were required to include tests, and those tests would be run continuously. Within a year of instituting this policy, the number of emergency pushes dropped by half. This drop occurred despite the fact that the project was seeing a record number of new changes every quarter.

Who Should Read This Book?

This book has a wide appeal and I would recommend it to anyone interested in Software Development or adjacent fields. New developers are going to learn a lot from the breadth of topics. Likewise experienced developers will appreciate some of the deeper dives in chapters such as Testing and Measuring Engineering Productivity (not as bad as it sounds, more on this below). Even if you’re not writing code every day it’s interesting to see the processes and methodologies used at Google.

I used to work at a startup and am now at a large product company. Both companies follow the typical microservices approach. Github/Gitlab for source control and reviews, code gets packaged into containers, deployed to AWS. You know, the stack that’s talked about on the internet to the point of absurdity.

Contrast this to Google’s stack. Throughout the book we learn about their use of a monorepo, as well as the tools they use for code review, code search, and static analysis. These were all written in-house and have evolved over years of use. Reading about the details caused me to reflect on the set of tools my company uses. It’s sort of a given that you use Git, and something like Github for code reviews and search. A fun thought experiment is to take the software development stack described in the book and apply it to the company you work at.

While each chapter dives into an individual topic, the book as a whole makes no attempt to sell the value of engineering. I’m assuming it’s an intentional choice, so I would not recommend this book to those who see software development as purely a means to a product or commercial end.

Favourite Chapters

Teams and Leadership

Chapter 2 - How to work well on teams, Chapter 5 - How to lead a team, Chapter 6 - Leading at scale

I learned from these chapters in 2 different areas: how the topics relate to me as an individual, and what a ‘good team’ looks like. How to lead a team describes the different leadership paths one can take. At Google ‘Managers’ are leaders of people, whereas ‘Tech Leads’ lead development and technology efforts. The book covers the responsibilities of both, and talks about why and how an individual contributor might start down a leadership path. I found it helpful for locating specific areas I need to develop to further my career.

You, as a member in your development team, are familiar with how your team works. Over the months any annoyances in the agile process, decision making frameworks, technical considerations, or meetings tend to get ironed out. This results in the team settling into a comfortable way of working. It’s often difficult to see how other teams work, making it hard to learn from them. These chapters sketch a model of what I think is a high performing team. The role of the leaders, the role of members, how decisions should get made, and the considerations team members should have. I’ve found it valuable to contrast my team at work to this model of a high performing team. This sparks ideas and improvements I can bring to my team.

Hypothesis and Decision Making Frameworks

Chapter 7 - Measuring Engineering Productivity

This chapter is better titled ‘Hypothesis and Decision Making Frameworks’. It describes a decision making framework which could be used for making both technical decisions, as well as product and feature decisions. To make a decision we might want to measure some metric. But is the metric worth measuring? Answer these questions first.

  • What result are you expecting, and why?
  • If the data supports your expected result, what action will be taken?
  • If we get a negative result, will appropriate action be taken?
  • Who is going to decide to take action on the result, and when would they do it?

The idea is that if the result of this metric is not actionable then it’s a waste of time to measure it. Assuming we do want to measure it, how do we go about it? Google uses the Goals/Signals/Metrics (GSM) framework.

  • A Goal is a desired end result. It’s phrased in terms of what you want to understand at a high level, and should not contain references on how to measure it.
  • A Signal is how you might know you’ve achieved the end result. Signals are things we would like to measure, but may not be measurable themselves.
  • A Measure is something we can actually measure.

Any goal we have is going to have trade offs. For example we could require all teams to implement end to end testing in their CI/CD pipelines. This is going to have a short term impact on the rate at which teams develop new features. It would likely hit teams who own legacy code a lot harder than those on newer projects. To look at these tradeoffs we should assess the goal in terms of QUANTS.

  • Quality of the code - Covering tests, architecture, ease of change, maintainability.
  • Attention from engineers - How an engineer works. Would they be distracted by this? Is what is proposed going to cause the engineer to context switch?
  • Intellectual complexity - Is this going to un-necessarily or un-reasonably add to the cognitive load of engineers?
  • Tempo and velocity - How quickly can the engineers write and release code and products?
  • Satisfaction - How happy are engineers with their tools and products they are developing? Are engineers feeling burned out? The book applies this framework to a case study within Google. I found this chapter valuable because the impact of non-considered product or architectural decisions is typically weeks or months of engineering time. It’s silly to think that we could always make the ‘right’ decision because circumstances change all the time. But by applying a framework to how we approach decisions gives us some avenue for improving our decision making process.

Thoughts

Let’s start with a couple of assumptions

  1. Companies write more code than their developers can maintain.
  2. Code which is unmaintained for years will rust (aka. bit rot).

Given these, what is a high level software development strategy a company could follow?

Variation in software development practices over a long period of time leads to an increase in complexity. This complexity builds up and starts making itself known through the unhappy faces of the development team at the pub on a Friday night. Flaky tests, unsupported libraries, unmaintained deployment pipelines, long deployment times, lack of documentation. By the time things get to that stage the damage is done, and product delivery timelines stretch out as teams now spend time battling the years of rust.

I interpreted the book as these are problems Google has faced since the early 2000s and still faces today. Through standardising the ways of developing software they look to reduce the variation in how software is developed between teams. Though there will still be unmaintained code rusting, if it’s rusting in the same way as other code, in theory it will be easier to one day pick up and bring up to scratch. Projects implemented in a variety of software stacks, libraries, and processes will be more difficult to revive as specialist skills will be needed to bring them back.

What I Would Have Liked To See

More emphasis on product. The book describes processes to achieve an optimal engineering outcome. For a lot of us what pays the bills is product development. Deadlines are set to important product or marketing dates and don’t reflect the time it takes to do quality engineering work. Yes, APIs are customer facing products, but this book has little on the customer and product side of things.

Stakeholder management. Software development teams face the infinite problem of trying to balance an implementation against time and resource constraints. So called clever implementations never get a team far, and more weight is put on communication and give and take with project stakeholders. A chapter on this is sorely missing.

There are no chapters on Agile development methodologies and this is a good thing.

Quotable Quotes

  • The Beyonce Rule to testing, the answer to what parts of my code should I actually test? “If you liked it then you shoulda put a test on it”.
  • What is the company standard way of implementing X? “There are 2 ways of doing things: the one that’s deprecated and the one that’s not ready yet”.
  • On documentation, “At Google, our more successful efforts have been when documentation is treated like code and incorporated into the traditional engineering workflow”. If code is a liability to the company, does this mean documentation is as well?
  • “Hope is not a strategy”. For engineering, certainly.

Conclusion

Sweet, let’s go and cargo cult all of Google’s software engineering practices. Yeah, nah. The book takes care to describe why the practices are in place at Google. It describes the pain Google was facing, their attempts to fix it, and the policies and processes they put in place. In this aspect I found it much more valuable than say Accelerate, which proposes a shortlist of policies and processes that are correlated between high performing companies, but very little nuance or background to them.

For me the value in the book was seeing Google’s perspective and approach to software engineering. I have a huge list of personal learnings, and I’m sure you would learn a lot from the book as well.


Related Posts