At Invisible we have a small but focused team of engineers that want to ship quickly. We are marching towards a goal of building a company and technology platform that can employ people everywhere in the world to do knowledge work at a fair wage. We manage complex project and operations teams with our custom tools, so we have a lot of software components that need to be developed and maintained.
We have adopted a monorepo structure for our platform so that everyone on the team remains in tight sync on what is happening around the different parts of the company. This also allows us to maintain some level of coupling between front-end and back-end services without creating multiple PRs on different repos. The unanticipated cost of the mono-repo for us was the time our CI pipeline took to run.
We were on CircleCI and had configured our CI test jobs to run all of the tests for every service in parallel every time we push code to a Pull Request. This configuration is perfect if you need to make sure that one service wont fail when another service or package changes. We treat our entire platform as one product, even though certain members of the team focus on just one part of the platform most of the time. But, our PR test runs were very slow, often over 20 minutes. This was happening because of several configuration choices we made early on, including running
yarn install with lerna across every service and package up front then caching, which resulted in a 900 MB workspace cache file that needed to be downloaded for every successive step in the CI work-flow.
A slow test run time is annoying, but it became unbearable when we started insisting on a linear git history for our master branch. This setting means that every PR has to be re-based any time another PR is merged by someone else. Since we often have a dozen or more active PRs being worked on in one repository, this timer would have to re-start many times each day. Developers would make a change to code, see the test results come back, but the master branch had already been merged into by someone else. The owner of the PR would need to go back, re-base, force push, and start waiting for tests to finish all over again. Getting a feature to staging would often take an hour of sitting and waiting, and coordinating with other devs. Deploys to production often required manual intervention by a developer using a command line tool.
The core problem here is “tight coupling”. One team’s ability to ship is impacted by another person or team’s actions (merging first). Developers were prevented from finishing their work by circumstances outside of their control. With a structure like this, each developer we add would slow down every other person and team, that’s inverse efficiency scaling! Adding more people slows us all down! On the road to speed all instances of tight coupling between teams and services need to be eliminated.
This resulted in a very slow “cycle-time” — the time it takes to get from PR open to deployed in production. Every time a developer has to switch contexts and wait for an async process results in massive inefficiency.
We began outlining what we wanted to change about our CI configuration and created a goal of every CI test run being finished in less than 5 minutes. And every deployment script finishing in less than 5 minutes. Here’s how we configured our work-flows to do that.
- We use path filtering in actions so that only the services that have changed actually run their unique tests.
- We parallelized our runs so that multiple kinds of tests run at once.
- We utilize small caches, unique to each service, instead of one big cache, so it can be downloaded and decompressed in seconds.
- The only “global” tests are very short linting tests that don’t require as much installation and setup time.
- We added a check to our back-end service that spots any breaking changes to our graphql schema and stops developers from shipping breaking changes, which helps eliminate another point of tight coupling between services.
After these changes our PR test suite runs between 1 and 4 minutes. All of this has significantly improved our cycle time. Everyone on the team can feel how much faster they are able to move with these shorter waits. We are also able to deploy more confidently because we know changes and fixes will be relatively quick and cheap to ship as well.
The switch from Circle to Github Actions will impact our bottom line as we will be using fewer minutes of compute time and we can use lower cost compute resources since each work-flow is doing less work. We can also take advantage of the 3000 minutes that are included with the Github pro team plan we were already paying for.
We are going to add a few more lint checks in the near future, and improve our automatic syncing of issue status in JIRA using these actions. I’m very excited about unblocking our engineering team to move faster. Instead of slowing down as we add more engineers, we can ensure each developer on the team is able to ship more often and more confidently.