Boosting resilience through chaos testing: One test at a time

How Paddle uses chaos engineering practices to identify system weaknesses, improve fault tolerance, and enhance the resilience of their microservices architecture.

Building digital software that’s distributed at massive scale comes with layers of complexity, some that can cause failure, some that cause disruption.

While we can’t control or avoid failures in engineering, we can control the impact radius of the failure and optimize the time to recover and restore the systems. How? Purposefully trying out as many failures as possible to test your confidence in the system’s resilience.

Chaos testing is key to our engineering practices at Paddle, so we wanted to share some of the best practices we use, the experiences we’ve had, and why chaos testing should form part of your routine practices.

What is chaos engineering?

Chaos engineering is a practice to introduce intentionally controlled chaos into a system, with the goal of identifying weaknesses and increasing the resilience of your system.

When we have microservices, the interactions between the services might generate unexpected scenarios, even if the services are working properly. These unexpected scenarios might then generate a production issue.

The diagram below shows an example of running chaos engineering in a system with different microservices.

graph TB
    subgraph chaos ["Chaos engineering test"]
        A2[Checkout service ❓] -.-> C2[Customer service ❌]
        B2[Invoice service ❓] -.-> C2
        A2 -.-> Q[❓]

        style C2 fill:#ff9999
        style A2 stroke:#666,stroke-dasharray: 5 5
        style B2 stroke:#666,stroke-dasharray: 5 5
    end

    subgraph stable ["Stable scenario"]
        A1[Checkout service] --> C1[Customer service]
        B1[Invoice service] --> C1
    end

    classDef stable fill:#e1f5fe
    classDef failed fill:#ffebee
    classDef question fill:#fff3e0

    class A1,B1,C1 stable
    class C2 failed
    class Q question

Let’s imagine a system where two services depend on another one. In this case, Checkout and Invoice are two independent services that call Customer service. An example of running chaos engineering in this system would be if Customer service is not available.

During chaos engineering tests, we want to understand how Checkout and Invoice services will behave if Customer service is down.

Considerations before you get started

As an engineering team we have the responsibility to test our systems, know the behavior of them, and increase the resilience of the production environment.

A requirement before you get started is to identify all the dependencies between your services. This will help you to understand the current dependencies, and identify which part of your system you can introduce chaos.

While chaos engineering testing is a generic concept, depending on your system and priorities, you’ll need to run different kinds of scenarios. For example, it might be related to downtime, network connections, malformed responses, timeouts, cloud provider failures, or other factors.

A more complex scenario might be a system where you call an external service, and in the chaos engineering tests, you test the behavior of the system if the external system returns timeouts.

---
title: Stable scenario
---
graph LR
    A[Checkout service] --> B[Customer service]
    B --> C[Tax service]
    C --> D[External service]

    classDef stable fill:#e1f5fe,stroke:#0277bd
    class A,B,C,D stable

---
title: Chaos engineering test
---
graph LR
    A[Checkout service ❓] -.-> B[Customer service ❓]
    B -.-> C[Tax service ❌]
    C --x D[External service ❌]

    classDef uncertain fill:#fff3e0,stroke:#f57f17
    classDef failed fill:#ffebee,stroke:#d32f2f

    class A,B uncertain
    class C,D failed

In our Paddle example, to simulate this scenario and run the tests, we need to update the Tax Service and simulate a timeout from the external service. Once we run the experiment, we need to investigate how Checkout and Customer Service respond. Are the services down? Which timeout are the systems using in the client? How many retries are defined in the client?

Chaos engineering tests are going to help us to ask these questions and improve the resilience of our systems.

Running a chaos engineering test

When we run chaos testing, it should be a one-time testing round, done periodically. Your team of engineers should be constantly finding hypotheses to test, to improve overall performance of the systems.

In the following diagram we show the different steps to follow:

---
title: Chaos engineering test steps
---
graph TD
    A["<b>Steady state</b><br/>Identify service dependencies when system is stable"] --> B["<b>Hypothesis</b><br/>Create scenarios for potential issues and system impacts."]
    B --> C["<b>Run experiment</b><br/>Execute in controlled environment with proper logging."]
    C --> D["<b>Verify</b><br/>Check logs and compare results to hypotheses."]
    D --> E["<b>Improve</b><br/>Create improvement plan and prioritize fixes."]
    E --> A

Steady state

The point in time when your services are stable because the variables that impact it are constant or unchanging. Now is the time to identify all the dependencies between your services.

Hypothesis

After identifying the different dependencies between your services, create different hypotheses where you think an unplanned state could generate a potential issue to your service.

When you work to prepare these hypotheses, you need to have a global vision of all your services, as the unexpected behavior of one service might impact other parts of your system.

Once you identify where and how you want to introduce the chaos into your system, consider all the potential behaviors given that specific scenario.

Run the experiment

Prepare the systems and the code to run the planned experiment. It is very important to run the experiment in a controlled environment, and you should notify all users of the environment about this testing.

The most common environment where you can run the experiment is development or staging. Remember to add logging to your service so you can debug and verify later the behavior of the systems.

Verify

Once you finish the experiment, check the logs of all the involved services to see the behavior of your systems. Compare the results with the hypotheses you put forth in the beginning.

Improve

As you introduced some chaos in the system, you might get a different behavior than expected.

In this case, put together an improvement plan for your systems, and discuss with your team how to prioritize each improvement.

How to organize testing rounds

As chaos testing is crucial to helping us identify potential errors, we began running these tests early on at Paddle. It spans all of Paddle Billing’s microservices, involving all engineering teams.

To run the tests properly, each engineering team nominated an owner to prepare their own microservices for testing. The owner also prepared the different scenarios and the changes in the code required to run the tests.

This way, we got all the engineering teams involved in testing all our planned scenarios. As each owner was present during the testing rounds, this exercise also improved the understanding teams had of other services.

During the tests we focused on different scenarios, specifically downtime of a specific service, timeouts and malformed responses.

These scenarios help us to test in detail the behavior of the system when we have a timeout or other service is not available.

With this strategy, we successfully tested all the microservices and planned scenarios, involving ten different engineering teams.

The upside of embracing chaos

Chaos engineering tests provide us an approach to build and maintain robust and reliable systems for Paddle Billing’s microservices, ultimately reducing the risk of failures and enhancing Paddle’s resilience.

From our work at Paddle, here are some of the benefits of running chaos testing rounds:

Identifying unexpected behaviors: This practice gives us an opportunity to learn, in a controlled environment, how our systems behave in unexpected scenarios.
Enhanced fault tolerance: With a better understanding of how the system behaves in different failure conditions, engineers can take steps to make it more fault-tolerant and resilient.
Early detection of system weaknesses: During these tests, the system is not operating under normal conditions. It allows us to identify weaknesses early to prevent future issues or outages.
Reduced downtime: As the team continues to identify and learn of potential failure scenarios, it allows us to improve incident response times.
Increased collaboration between teams: As these tests are executed across various parts of our systems, it helps to increase the expertise and collaboration between the engineering teams.

The next phase of chaos engineering at Paddle

As we are constantly improving and adding new features to Paddle Billing, we want to empower each engineering team to run their own chaos engineering tests, giving more independence and agility to the teams.

We’re creating an internal package that all the microservices will use, with these benefits:

Each team doesn’t need to implement chaos engineering tests in their own way.
Since all microservices will use the same package, all engineers will understand how it works and they can run testing rounds without depending on other teams.
In the future we can automate chaos engineering tests across all microservices in the same way, since all microservices will implement the same package.

Since we began building Paddle Billing, our team has consistently implemented chaos engineering tests to our systems. Today, it’s a process everyone is familiar with and a key part of our team’s culture.

Boosting resilience through chaos testing: One test at a time

What is chaos engineering?

Considerations before you get started

Running a chaos engineering test

Steady state

Hypothesis

Run the experiment

Verify

Improve

How to organize testing rounds

The upside of embracing chaos

The next phase of chaos engineering at Paddle

About Jordi Pallarés

Build the stack that simplifies their stack