Designing for Downtime: Understanding the Cranky Path

As a self-taught, long-time application developer, Ive learned many hard lessons. Most are common to us all: backup the database before dropping tables, ground yourself before messing with your server’s motherboard, review infinite loops that might spam the CEO, etc. Recently, I was reminded of one of the toughest lessons to learn: network calls will inevitably fail. Whether its Skyping with a big client, a database call to write a new customer record or an API call to a third-party service, it is critical to prepare for transient errors and downtime. Ive learned the best way to keep customers happy is by designing for downtime from a user experience perspective. Check out the last paragraph for more information about an experimental feature in our QuickBooks SDK that will give you a head start.

Understanding the Cranky Path

When designing for downtime, the first step is to fully understand the user experience. We like to think of each step of the user experience in terms of the happy path, where everything is beautiful; the user clicks the right buttons (in the right order) and ends up on a congratulatory page, celebrating their delightful experience. Regrettably, the happy path is easy. But designing for the cranky path is what protects our user from having a frustrating experience; losing sight of the greater benefit of the application and ultimately dismissing our countless hours of work and neat lines of code.

Understanding the cranky path lets us prioritize application fixes, determine retry policies, set user expectations and develop thoughtful messaging. Here are a few of the methods that I hope will help you design for the cranky path and keep customers delighted.

Idempotent Requests

Idempotency, or however you pronounce it, sounds daunting, but is very important, especially when dealing with accounting data. Request idempotency ensures that repeating a request to the QuickBooks Online service will not result in a duplicate transaction, creating data inaccuracy. When making a service call, you must consider that a failure may occur after an underlying service has successfully executed the request. Without request idempotency, the application is unable to guarantee data accuracy, and risks duplicate or abandoned transactions. For a more detailed description, my colleague, Sridhar Kalaga, recently wrote an excellent post on the subject here. Consider idempotency a prerequisite to any and all retry policies.

Retry Policies

From the users perspective, we can pick a retry policy for a particular case. Fixed, incremental and exponential policies are common retry algorithms that define when the application should retry the request to the service. If the user is performing an action and expects an immediate response, a fixed, one-time retry, or no retry at all, may be most suitable. In this case, presenting the user with a thoughtful error message would likely make the difference in the user experience. However, if the application is running an unattended sync, longer intervals between retries is acceptable to increase the likelihood of a successful retry. This can be implemented with policies such as an incremental increase or exponential backoff. In this particular case, the goal of the application should be to finish the unattended sync successfully, resulting in a complete success from the users perspective. The .NET and Java QuickBooks SDKs have fixed, incremental, and exponential policies built-in, if you are leveraging these libraries while developing your application.

When implementing a retry policy, it is also imperative to consider when to short-circuit. If any service has a complete outage, the fancy exponential backoff algorithm that youve written is rendered useless. Once the application hits the defined maximum number of retries, you must short-circuit and wait an extended period of time before retrying. Consider a lightweight API as a health check before resuming the applications process to save resources, time, and logs full of errors.

Atomic Operations

When reviewing the service calls made by your application, consider which operation bundles must be atomic. This becomes very important when you consider sales or other transactions that directly affect accounting and reporting data. For example, if the user expects an Invoice with a linked Payment, but the application is only able to write the Invoice, it is essential to alert the user and retry the Payment operation until the Invoice is properly marked as paid.

Introducing Chaos

Introducing chaos into your application is another great way to understand the cranky path, test the applications retry policies and fully understand how your user would feel when services fail. Your application can mock the web service exceptions using tools like WireMock, MockServer, or even your own stubbed service.

Limited, Experimental Program to Test Chaos Mode

To help you start the process, we are offering a limited availability, experimental program for developers to test a Chaos Mode built directly into one of our QuickBooks SDKs. With Chaos Mode enabled, the SDK will throw mock service errors at a configurable tolerance level. We hope this feature can help developers deliver a delightful customer experience, even when services may not be cooperating. To apply for access to the experimental program, click here.

Questions, comments or concerns? Share them on our developer forums.