车讯:探界者将使用雪佛兰公布全新9AT变速箱

State machine patterns, such as Stateful Workflows, Sagas, and Replicated State Machines, improve message reliability, sync consistency, and recovery.

Pankaj Taneja

May. 30, 25 · Analysis

Likes (2)

Comment

Save

3.2K Views

百度在北京农林科学院、科学技术研究院等市属科研院所、生命科学研究所等新型研发机构下放职称评审权，由创新主体自主评价使用人才。

Join the DZone community and get the full member experience.

Abstract

As a leader of projects for the backend of a global messaging platform that maintains millions of users daily, I was also responsible for a couple of efforts intended to enhance the stability and failure tolerance of our backend services. We replaced essential sections of our system with the help of the state machine patterns, notably Stateful Workflows. The usage of this model led to the elimination of problems in the field of message delivery, visibility of the read receipt, and device sync, such as a mismatch of phone directories.

The intention of this article is to let the reader know how to keep a messaging infrastructure highly available and adaptable by sharing the practicalities and trials one faces when bringing the said architectures into production.

Introduction

When dealing with distributed systems, you should always assume that failure will happen. In our messaging platform, it became very clear to us very quickly that unpredictable behavior was not something we should look at as a once-in-a-blue-moon occurrence, as it was in fact the standard state of affairs. Our infrastructure had to deal not only with network partitions and push notification delays but also with user device crashes, and our engineers did a great job in coping with such problems.

Up to that time, instead of having service-level retry logic scattered all over, we selected a more systematic way of achieving the task, which involved the use of state machines. In the end, when we reimagined our business-critical workflows as entities with state, we realized that we had really found the way not only to automate a proper failure recovery process but also to do it in a predictable, observable, and consistent manner.

This piece will focus on three main designs that we made use of — Stateful Workflows, Sagas, and Replicated State Machines — and how, through them, we not only built an impervious system but also let it respond to any failure scenario gracefully.

Using Stateful Workflows for Message Delivery

Message delivery is, without a doubt, the most crucial aspect of our system. In the beginning, we used a queue-based system without statefulness to send messages to devices. Unfortunately, we constantly faced unforeseen cases of the process stopping in the middle, which led to a situation where the user did not receive the message at all or received it with a significant delay.

We tackled this problem by introducing the Stateful Workflow Pattern with the help of Temporal:

Message Workflow States

Send Message Initiated
Message Stored
Push Notification Dispatched
Delivery Confirmed
Read Acknowledged

Every transition from one state to another was done by events to which timers and retries were added. When a notification was not delivered (probably due to APNs/FCM complications), the system used an exponential backoff method to retry the request. In case the delivery confirmation failed to arrive in a timely manner, we made a note of the event, and if the customer wished, we might also trigger resolution mechanisms such as sending notifications by email.

Each step was stored in the database's memory, which later enabled workflows to restart from the place where they stopped most recently, even after the system crashed or the node restarted. As a result, the number of messages lost was significantly decreased and the error states were visual in our monitoring applications.

Implementing the Saga Pattern for Multi-Device Sync

Another vital point is the importance of staying identical in the status of read messages on all the user devices. It means that if the user reads the message on one gadget, the change should be instant on all other gadgets.

The above was implemented in a simple way, it was a Saga:

Step 1: Mark the message as read on Device A.
Step 2: Sync to cloud state.
Step 3: Push read receipt to Devices B and C.

Each of the steps was a local transaction. We would just component the corresponding reactions if one of them fails, thus no consistency would be lost. For example, if the failure is a sync to the cloud, then we would change the state backward and inform A of the problem, so that the result is no partial changes made.

This very method lets us reach even consistency without the need for global locks or distributed transactions, which are both intricate and accident-prone.

Using Replicated State Machines for Metadata Storage

In order to keep the data, like the conversation state and preferences, in a consistent state, we have employed Replicated State Machines based on the Raft agreement protocol. It is this design that enabled us to:

Appoint a leader to manage writes
Copy the changes to all followers
Bring the state back by getting logs, if there is a crash

This method was specifically beneficial for ensuring that we have a persistent chat indexing service and group membership management, where the state view was always correct.

Comparative Analysis of Patterns

I compared the most common state machine-based fault tolerance patterns to arrive at a solution that worked well for us.

Aspect	Replicated State Machine	Stateful Workflow	Saga Pattern
Primary Goal	Strong consistency & availability	Long-running orchestration	Distributed transaction coordination
Consistency Model	Strong (linearizable)	Eventually consistent (recoverable)	Eventually consistent
Failure Recovery	Re-execution from logs	Resume from persisted state	Trigger compensations
Tooling Examples	Raft (etcd, Consul), Paxos	Temporal, AWS Step Functions	Temporal, Camunda, Netflix Conductor
Ideal For	Consensus, leader election, config stores	Multi-step business workflows	Business processes with rollback needs
Complexity	High (due to consensus)	Moderate	High (compensating logic needed)
Execution Style	Synchronous (log replication)	Asynchronous, event-driven	Asynchronous, loosely coupled

Results and Benefits

Implementing state machine patterns brought the following improvements that could be measured:

Message delivery retries fell by 60%.
Read receipt sync issues were cut down by 45%.
Service crashes recovery time reached under 200ms.
Incident resolution time thus got decreased by observability.

Furthermore, we managed to come up with internal tools such as dashboards obtained through the visualization workflow state per message during on-call incidents.

Conclusion

In a messaging system, reliability is not an add-on — it's a must. The users assume that their messages are delivered, read, and synchronized at the same moment. Therefore, using state machines to model essential workflows, we developed a fault-tolerant system that could gracefully recover from dangers. The decomposition of Stateful Workflows, Sagas, and Replicated State Machines gave us the means to regard faults as equal entities in our architecture.

Although the implementation was a bit of a hassle, the benefits of robustness, clarity, and operational efficiency were significant. These patterns are now the foundation of how we are thinking of building our services throughout the organization in a strong manner.

Architecture Machine Fault (technology) workflow

Opinions expressed by DZone contributors are their own.

在减肥期间吃什么最好	eb病毒是什么意思	生理期可以吃什么水果	阉割是什么意思	阴道壁是什么样的
吃什么奶水多	萎缩性胃炎吃什么药好	女的什么时候退休	五月17号是什么星座	小儿麻痹什么症状
slogan是什么意思啊	呻吟是什么意思	银耳是什么	pussy 什么意思	夏天有什么动物
六月是什么夏	小孩爱吃手指头是什么原因	一什么波纹	乌豆是什么	胰腺有什么作用

月经为什么是黑色的hcv9jop3ns9r.cn	彩超低回声是什么意思hcv9jop3ns0r.cn	拿的起放的下是什么意思hcv9jop0ns9r.cn	vans什么意思hcv9jop6ns2r.cn	海洋中最大的动物是什么hcv8jop0ns1r.cn
癫痫不能吃什么hcv8jop5ns9r.cn	尘肺病吃什么能排出尘hcv9jop1ns2r.cn	左腹部是什么器官hcv9jop6ns4r.cn	红对什么hcv9jop3ns9r.cn	为什么蚊子咬了会起包hcv7jop7ns4r.cn
睡觉时身体抽搐是什么原因hcv9jop5ns4r.cn	强肉弱食是什么意思hcv9jop0ns3r.cn	大便什么颜色是正常的hcv8jop2ns7r.cn	核磁共振检查什么hcv7jop9ns8r.cn	三七粉吃了有什么好处hcv8jop4ns6r.cn
睡觉多梦是什么原因hcv8jop2ns1r.cn	吃什么白头发变黑clwhiglsz.com	血清高是什么原因adwl56.com	破气是什么意思hcv9jop3ns9r.cn	浮世是什么意思hcv8jop6ns7r.cn

Related

Trending

车讯:探界者将使用 雪佛兰公布全新9AT变速箱