Simulation Driven Development

Simulation driven development, product discovery, and operations planning. Discrete event simulation for implementing simulation best practices.

Mar 15, 2022

It is vital to build iteratively when developing software products and systems, getting real feedback as quickly as possible. But when your system has significant real-world moving parts, it can be complex to reason about your change's impact and challenging to get timely feedback.

Long feedback cycle

The impact of software defects on real-world systems can be expensive. For example, causing physical resources to be used or wasted. A common solution is a wave and bake rollout process to limit the blast radius. But when real-world processes are involved, building confidence and feedback in a change might take a long bake period, limiting the speed at which you can gain feedback on the impact of your change across different contexts, slowing down learning and iteration speed.

Low discrimination

The real world is messy, and it is unlikely that your change is the only thing that has changed. Because of an extended rollout process, you might roll out multiple changes concurrently. There will inevitably be real-world variables changing during the rollout and evaluation period. You are also unlikely to have a consistent baseline across multiple changes. These factors limit your ability to isolate variables such as a single change and discriminate between the impact caused by your changes and what is caused by changes to the environment. Especially when the expected impact of a change is small, which is likely when making small iterative improvements, framing your change as an experiment with sufficient statistical power and significance level can be challenging.

Simulation Driven Development is an approach to working around these challenges, bringing isolation and repeatability for validating your changes, increasing your change delivery rate, and improving your work prioritisation.

Introducing Simulation-Driven Development

A simulation runner combines an executable simulation of a real-world system, sometimes called a Digital Twin, with the logic code used in your production systems to enable Simulation-Driven Development.

A simulation runner allows you to simulate the interactions of your software system with the real world without actually interacting with the real world. In particular, it can provide indications of how the software system will impact the real world and what changes in the real world will result from changes in the software system. And it can do so across many possible scenarios in accelerated time. Just because your real-world business process takes days no longer means you need to wait days for feedback.

In this way, you can drastically shorten the feedback cycle by putting your changes through the simulation runner. You can even run the simulation against your changes before they are complete, using simulation as a tool to prototype solutions and to guide your development, just like Test-Driven Development. This rapid feedback lets you make micro-pivots in your approach, reducing wasteful implementation that takes you down dead-ends.

When changes are ready for deployment, you can assess the likely impact before deployment. Defects can be detected using scenario tests running thousands of simulated hours and fixed before impacting operations.

Standardised, repeatable simulation scenarios allow you to benchmark your system in a way that is independent of changing real-world situations (though ensure you keep up-to-date scenarios to avoid optimising for a context that doesn't exist anymore). These scenarios let you separate improvements to your software system over time from improvements to real-world metrics caused by operational changes outside the software system.

Simulation-driven development goes beyond creating models of the real-world system and proposed changes to them by plugging in your production code and giving a much higher level of fidelity and resolution while speeding up your development and reducing defects.

Best practices Simulation-Driven Development

Deterministic and reproducible

Simulations should be deterministic and reproducible: they can be run repeatedly with a given input, and the simulation will generate the same output. Reproducibility allows you to reason about the impact of changes in inputs or logic.

Versioning

All simulation components, including inputs, simulation runner, core logic modules, should be versioned. The version of the outputs/results is a combination of everything else's versions. Versioning provides reproducibility and the ability to track changes in output with respect to specific changes in inputs.

Suitable fidelity models of the real world

Models of the real-world systems outside of your software system need to be sufficiently high accuracy for the simulation to give realistic and useful results. Evaluating the fidelity of your overall simulation against reality is vital for building confidence that your simulation isn’t misleading your efforts.

Discrete event simulation

Discrete event simulation is particularly well-suited for creating deterministic and reproducible behaviour for simulation driven development. Discrete event simulation models actions as a discrete sequence of events in time. DES uses a scheduler to determine specific points in time that actions will occur and a time provider to coordinate different components in the simulation. No actions occur in the system between consecutive events, and the system jumps from one event to the next without needing to wait in real-time. Hence discrete event simulations can also run faster than real-time.

When using DES, we must use a scheduler and time provider throughout the code, including the core logic modules and avoid non-deterministic operations such as calls to unseeded pseudo-random number generators (use a static seed) and asynchronous tasks. A benefit to avoiding non-deterministic operations is that tests are also reproducible.

Execution delays can be modelled using seeded pseudo-random number generators. Varying the seed input allows you to perform monte-carlo scans to discover edge cases, test for concurrency issues, and indicate the statistical distribution and variability of outputs with respect to inputs.

We can also use memoised versions of relevant modules to speed up execution.

Disadvantages of Discrete Event Simulation

A downside of discrete event simulation is that it must follow necessary standards to incorporate production code into the simulation, including avoiding non-deterministic operations and using a scheduler and time-provider. The benefits typically outweigh the disadvantages.

Another downside is that the simulation doesn't cover the whole codebase, and we cannot use many 3rd party libraries in the core logic code. Some production system logic lives outside the core logic of applications and even exists between applications by how distributed systems can interact non-deterministically. However, we minimise these impacts by keeping as much logic as possible in composable core logic modules.

Discrete event simulation for a microservice architecture

Readying core logic for simulation

In a microservice architecture, we package our core logic alongside other code, for example, code to access a database dependency or at least code to receive requests over an external communication protocol such as HTTP. To run core logic inside a simulation runner, we must decouple the logic code from these production concerns.

We can then export this core logic as an independent code module with a clean language-level interface. By separating the core logic from the other production code concerns, we can plug it into a simulation runner. By making the core logic a pure function without outside dependencies, by pushing in any required outside data as a function parameter, and receiving any output data as a function return value, we minimise the surface area of the interface and don't expose any internal production implementation details. Alternatively, we can use dependency injection where it is not possible to create a pure function. With dependency injection, we can inject doubles of the dependency for running the simulation, for example, replacing a database dependency with a local file data source. The simulation runner can then use these simulation doubles of dependencies.

We can then export this core logic as an independent code module with a clean language-level interface.

Building a simulation runner

A simulation runner for a microservice architecture ties together all the different domains and microservices into a single simulation that can analyse the impact of changes in one part of the wider system on the output of the wider system.

Take, for example, a route scheduling system. At the core is a schedule optimiser that will search for the best solution to a scheduling problem, deciding which deliveries will go on which routes. Such a scheduling problem will trade-off many dimensions of what a good route schedule looks like, for example, trading off drive time, on-time deliveries, and vehicle capacity while meeting delivery windows. A schedule optimiser requires many data inputs, including, for example, drive time predictions. Because minimising the drive time on the schedule isn't the only goal, it isn't trivial how an improvement to the accuracy or reliability of drive time predictions will affect on-time deliveries or even total drive time. By plugging in prospective changes to the drive time prediction logic into a simulation of the end to end system, we can simulate the impact and use this insight to steer implementation and rollout.

The simulation runner joins together the logic from many microservices and can eliminate the overheads of inter-service communication by replacing it with in-process or inter-process communication within a single machine. The simulation runner imports core logic modules into the framework, using the dependency management tool at compile time or the class loader at runtime. We enable easy portability of production code into the simulation by mirroring the production system in terms of logic modules connected to each.

Pulling the logic into a monolithic simulation runner avoids needing to instrument a distributed system for simulation purposes, needing to build deterministic logic throughout the entirety of the distributed system, and coordinating discrete events across the distributed system.

Simulation-driven product discovery

Once you have built a framework for simulation-driven development, its usefulness extends beyond just development. A simulation framework greatly benefits product discovery, supporting rapid validation and informed prioritisation.

Size of the prize

By plugging in mock logic modules that achieve perfect performance, for example using historical data, you can size how well the system would work if you could implement a perfect version of the production logic. In practice, you won't be able to get to perfect, but you can use this to identify promising areas for further investigation and investment.

Sensitivity analysis

Sensitivity analysis is a technique to see how much an output factor varies with respect to a change in an input factor. For example, how much does route efficiency increase when I increase the reliability of my drive time predictions? By implementing mock modules that achieve slightly better performance than your current production version, you can measure how much a marginal improvement to the logic will affect your macro-level metrics. Again this helps you prioritise the work with the highest cost of delay.

Prototyping

Quick prototypes and proofs-of-concept can be implemented and tested before committing further effort into productionising the change. Knowing more about the likely value informs prioritisation and roadmap decisions, ensuring you are working on high-value delivery. Prototyping is especially valuable when a full implementation would be costly, and you have broad uncertainty on the benefits.

Simulation-driven operations planning

Taking the simulation framework one step further by allowing self-service usage by operations managers via a UI leverages the investment even more. Users can simulate operational changes without costing real-world resources, quickly iterating on operational improvements. Sensitivity analysis can be done rapidly along many dimensions, global optima can be more rapidly found by making changes that would be risky in real operations, and the pareto frontier mapped.

Productising the simulation by making it self-service is a significant investment, but an investment that can drive operational excellence and efficiency. Self-service also makes using the simulation in the discovery process easier.

In this article, we've seen how simulation-driven development can improve the pace of your development and the measurement of outcomes. And how you can apply a productised, self-service, simulation framework to support product discovery and real-world operations planning. Let me know what you think in the comments! If you have any topic requests or suggestions, please drop these in the comments too.

Jason de Carvalho

Discussion about this post

Ready for more?