Traditional thinking in application development is there should always be at least 3 environments - dev, stage and prod(uction). Some teams or organisations use different labels, but the purpose of each environment is the same.
For over 5 years, my teams have used two environments with over 100 micro services. These environments are strictly segregated so dev can’t access production data or services and vice versa. If a service integrates with an external service, it connects to a dedicated instance. This strict separation avoids those inevitable “whoops! [Name] just messed up [important data] in prod”.
You’re probably wondering how we test with only two environments.
Keeping things small is a good place to start. We build small microservices and make small incremental changes. These two factors reduce the scope of testing. Engineers write and test their code locally. When they’re confident that their changes work, they deploy to AWS using an automated pipeline. This deployment runs off a dev branch. The test suite still needs to pass for the deployment to complete.
Rather than feeding and nurturing the dev environment as a pet, it is disposable. At any time an engineer can reset the dev environment running a few git commands (
git fetch && git reset --hard origin/main && git push origin dev --force). This means any drift is short lived.
Teams need to understand that a development environment will never be a perfect replica of production. Dev is a rough approximation, “close enough is good enough”. Past this point, it becomes a futile exercise with diminishing returns. It is impossible to catch all issues before releasing to production, the dev environment helps catch the low hanging fruit before users do. The harder to catch bugs are always found when the code lands in production. This is the case if you have one non production environments or ten.
This approach allows engineers to get fast feedback on their work as they go.
Adding a third environment increases complexity exponentially. Rather than having one stable environment - production - and a throw away environment for testing - “dev” - you now need to shepherd and coordinate changes through a series of environments. This impacts everything - spending, pipelines, communication, maintenance, etc.
Adding a staging environment means replicating everything, including the pipelines, infrastructure, backing services and third party integrations. All the extra pieces need to be maintained. That gets expensive fast.
A two environment model allows all changes to move independently of each other. If there is an issue with a change, it can be backed out and another change can still move to production. When using three (or more) environments, changes move through a pipeline. The changes must move sequentially through the various environments. A problem with one piece of work in an environment can delay all other pending changes. This results in less frequent deployments and releases.
With three or more environments, branching and merging are no longer a trivial operation. The branch that drives production is the only one that contains approved code. All new branches should be branched from it. Over time, fixed environment branches diverge from each other. Engineers spend more time resolving merge conflicts in each fixed environment.
Work items abandoned before they make it to production is a significant contributor to drift. This drift devalues the results of any testing. Abandoned changes linger in lower environments. In a two environment model, resetting dev and removing cruft is a trivial operation. With more environments, resetting environments ceases to be something that occurs on demand. Instead, multiple environments need to be reset. Any valid pending work is then merged back into the correct environment branches. Work slows down and communication overhead increases. Tracking down the change that breaks an environment can be time consuming with several change sets in flight. Finding a suitable reset window becomes challenging when users are testing changes across environments. This overhead results in resets being performed less frequently. As the drift delta increases, so does the risk of changes failing in production.
Ephemeral environments are a great alternative to fixed non production environments. When an engineer starts work on a new branch, a new cloud hosted environment is spun up for them to build and test their work.
Supporting ephemeral environments requires upfront effort. Services don’t run in an isolated jail. They need to talk to data stores, third party APIs and other backing services, while emitting and receiving events. Building out tooling to provision and manage access to backing services and configuration takes effort. Infrastructure as code ensures environments can be spun up and torn down, but that’s often the easiest pieces of the puzzle.
When done right, ephemeral environments can scale almost infinitely. Any manual configuration effort hampers such scaling.
Make sure your team isn’t getting comfortable in their ephemeral environments. Branches and the associated environment must be short lived. Include logic in your environment management tooling so if code isn’t pushed to a branch for more than x days, then tear down the environment. Not only does this save on costs, it reduces the risk that a stale environments become a security risk.
In an ideal world, a collection of micro services would be very loosely coupled. In reality there is always some coupling between services. This results in dependencies between environments too.
The more environments you have, the more backing services you need. The more connections that need to be maintained, the higher the communication overhead. With more than two environments, it is too much to expect each team member to do their bit to help maintain the environments. Instead you end up with at least one team member whose full time job is managing environments. This is busy work. Reduce the number of environments and increase the flow of work. 🌊