车讯:探界者将使用 雪佛兰公布全新9AT变速箱
State machine patterns, such as Stateful Workflows, Sagas, and Replicated State Machines, improve message reliability, sync consistency, and recovery.
Join the DZone community and get the full member experience.
Join For FreeAbstract
As a leader of projects for the backend of a global messaging platform that maintains millions of users daily, I was also responsible for a couple of efforts intended to enhance the stability and failure tolerance of our backend services. We replaced essential sections of our system with the help of the state machine patterns, notably Stateful Workflows. The usage of this model led to the elimination of problems in the field of message delivery, visibility of the read receipt, and device sync, such as a mismatch of phone directories.
The intention of this article is to let the reader know how to keep a messaging infrastructure highly available and adaptable by sharing the practicalities and trials one faces when bringing the said architectures into production.
Introduction
When dealing with distributed systems, you should always assume that failure will happen. In our messaging platform, it became very clear to us very quickly that unpredictable behavior was not something we should look at as a once-in-a-blue-moon occurrence, as it was in fact the standard state of affairs. Our infrastructure had to deal not only with network partitions and push notification delays but also with user device crashes, and our engineers did a great job in coping with such problems.
Up to that time, instead of having service-level retry logic scattered all over, we selected a more systematic way of achieving the task, which involved the use of state machines. In the end, when we reimagined our business-critical workflows as entities with state, we realized that we had really found the way not only to automate a proper failure recovery process but also to do it in a predictable, observable, and consistent manner.
This piece will focus on three main designs that we made use of — Stateful Workflows, Sagas, and Replicated State Machines — and how, through them, we not only built an impervious system but also let it respond to any failure scenario gracefully.
Using Stateful Workflows for Message Delivery
Message delivery is, without a doubt, the most crucial aspect of our system. In the beginning, we used a queue-based system without statefulness to send messages to devices. Unfortunately, we constantly faced unforeseen cases of the process stopping in the middle, which led to a situation where the user did not receive the message at all or received it with a significant delay.
We tackled this problem by introducing the Stateful Workflow Pattern with the help of Temporal:
Message Workflow States
- Send Message Initiated
- Message Stored
- Push Notification Dispatched
- Delivery Confirmed
- Read Acknowledged
Every transition from one state to another was done by events to which timers and retries were added. When a notification was not delivered (probably due to APNs/FCM complications), the system used an exponential backoff method to retry the request. In case the delivery confirmation failed to arrive in a timely manner, we made a note of the event, and if the customer wished, we might also trigger resolution mechanisms such as sending notifications by email.
Each step was stored in the database's memory, which later enabled workflows to restart from the place where they stopped most recently, even after the system crashed or the node restarted. As a result, the number of messages lost was significantly decreased and the error states were visual in our monitoring applications.
Implementing the Saga Pattern for Multi-Device Sync
Another vital point is the importance of staying identical in the status of read messages on all the user devices. It means that if the user reads the message on one gadget, the change should be instant on all other gadgets.
The above was implemented in a simple way, it was a Saga:
- Step 1: Mark the message as read on Device A.
- Step 2: Sync to cloud state.
- Step 3: Push read receipt to Devices B and C.
Each of the steps was a local transaction. We would just component the corresponding reactions if one of them fails, thus no consistency would be lost. For example, if the failure is a sync to the cloud, then we would change the state backward and inform A of the problem, so that the result is no partial changes made.
This very method lets us reach even consistency without the need for global locks or distributed transactions, which are both intricate and accident-prone.
Using Replicated State Machines for Metadata Storage
In order to keep the data, like the conversation state and preferences, in a consistent state, we have employed Replicated State Machines based on the Raft agreement protocol. It is this design that enabled us to:
- Appoint a leader to manage writes
- Copy the changes to all followers
- Bring the state back by getting logs, if there is a crash
This method was specifically beneficial for ensuring that we have a persistent chat indexing service and group membership management, where the state view was always correct.
Comparative Analysis of Patterns
I compared the most common state machine-based fault tolerance patterns to arrive at a solution that worked well for us.
Aspect |
Replicated State Machine |
Stateful Workflow |
Saga Pattern |
---|---|---|---|
Primary Goal |
Strong consistency & availability |
Long-running orchestration |
Distributed transaction coordination |
Consistency Model |
Strong (linearizable) |
Eventually consistent (recoverable) |
Eventually consistent |
Failure Recovery |
Re-execution from logs |
Resume from persisted state |
Trigger compensations |
Tooling Examples |
Raft (etcd, Consul), Paxos |
Temporal, AWS Step Functions |
Temporal, Camunda, Netflix Conductor |
Ideal For |
Consensus, leader election, config stores |
Multi-step business workflows |
Business processes with rollback needs |
Complexity |
High (due to consensus) |
Moderate |
High (compensating logic needed) |
Execution Style |
Synchronous (log replication) |
Asynchronous, event-driven |
Asynchronous, loosely coupled |
Results and Benefits
Implementing state machine patterns brought the following improvements that could be measured:
- Message delivery retries fell by 60%.
- Read receipt sync issues were cut down by 45%.
- Service crashes recovery time reached under 200ms.
- Incident resolution time thus got decreased by observability.
Furthermore, we managed to come up with internal tools such as dashboards obtained through the visualization workflow state per message during on-call incidents.
Conclusion
In a messaging system, reliability is not an add-on — it's a must. The users assume that their messages are delivered, read, and synchronized at the same moment. Therefore, using state machines to model essential workflows, we developed a fault-tolerant system that could gracefully recover from dangers. The decomposition of Stateful Workflows, Sagas, and Replicated State Machines gave us the means to regard faults as equal entities in our architecture.
Although the implementation was a bit of a hassle, the benefits of robustness, clarity, and operational efficiency were significant. These patterns are now the foundation of how we are thinking of building our services throughout the organization in a strong manner.
Opinions expressed by DZone contributors are their own.
Comments