A Software Architect Blog

Fast moving teams

March 01, 2020

We have been working on removing bottlenecks to allow our teams to move fast. Basically reducing the time to take a feature from idea to production. I want to review the journey from a large codebase released 5-10 times a week to over a dozen smaller code bases releasing ~250 times a week. This will look at the practical side of how this was achieved but also the cultural change that this brought to the company.

The Monolith

When we started this process we had a large (several million line) monolith codebase that was being released daily. There was one code repository that was deployed as two applications powering the mobile and desktop websites. The websites were released after a short manual regression usually daily. The regression testing would focus on core areas and changes highlighted by an automated change-log. After the release the site was monitored for any increase in errors.

Good

Usually only 24 hours from a code change to a release.
Easy to rollback a bad release
Larger features could be shielded and released to only a percentage of users

Bad

Slow build - lots of tests and code to build
Flakey tests causing lots of red builds - lots of slack * conversations about reverting changes
Slow to release, release process taking up to 1 hour - QA time was used to regression code, coordinate and process releases.
Manual steps required to communicate and process a release
Lots of squads in one codebase - too many owners - no-one taking responsibility for maintaining overal codebase quality

A new architecture

A natural step seemed to be to move toward a micro-frontend architecture. This would allow squads to work more independently and faster on smaller more specific codebases.

To enable this move we had to add a few pieces to our architecture.

Edge proxy - to control routing of paths to applications
Data Services - to serve data to different applications

Also it was important to centralise some parts so they can be improved and evolve overtime.

Component library - to allow for reuse of components between different applications
Platform servers - a cookie cutter image/server to host the smaller applications
Centralised build tools
Utility libraries

Smaller applications

Small code bases specific to areas of the website. Based on Create React App. Using the same build/deployment workflow. Using an edge proxy to route traffic to an application.

New Approach to regression testing

A new centralised testing framework, supporting various testing tools. Cypress, backstop, jest integration, PACT, Lighthouse, Auto-cannon. Easy to add to codebases by convention. They run during build pipelines or after deployment to environments. No manual regression testing.

Monitoring and visibility

More monitoring and application health visibility was required. It is much harder to monitor 20 applications than two applications. This was achieved by created overview dashboards in Kibana to show at a high level the overall health of an application. Then more detailed views and templated views for each application. We also created tools to show versioning of modules.

Time to live

To improve the time to live for a code change we introduced continuous deployment for new codebases. This was possible by a good automated regression test suite. Also centralised libraries were auto upgraded during the build process to always be deploying the latest versions.

Conclusions

The main change was code could be released with less risk faster. This changed people’s approach to code committing and releasing. It is now normal to commit and release small changes directly to live several times a day (usually 5-6 changes per application per day). These smaller changes has reduced the number of major incidents. In general it feels like we have more incidents but with significantly less impact overall than before. These incidents are a-lot faster to resolve and most issues are found in non-live environments and never get to production.

Good

Faster builds.
Less red builds as the whole test suite is fast to run locally.
Usually only 10 minutes from a code change to a release.
Fully automated release.
Easy to rollback a bad release
Larger features could be tested in more detail in non live environments by turning off CD to live for a short period of time.
Mainly one squads in one codebase.
Less major incidents.
More likely to try new features and experiment more.

Bad

Harder to release widely used components at one time. E.G. page headers or footers.
Harder to upgrade all codebases, but easier to upgrade individual codebases.

Follow me on twitter @andyianriley
or see andyianriley @ linkedin.