Skip to main content

Command Palette

Search for a command to run...

Choreographing Distributed Sagas with State Machines

Leveraging State Machines for Distributed Saga Management

Updated
Choreographing Distributed Sagas with State Machines

Ever wondered how your backend knows what a user can and can't do? Welcome to the world of Finite State Machines—where business logic meets bulletproof guarantees, and where every state change becomes a conversation with your entire system.

Think of user registration like a carefully choreographed dance. But here's the thing: it's not a solo performance. When a user transitions from "pending verification" to "active," that dance move needs to ripple across your entire application ecosystem. Miss that signal, and suddenly your billing service is charging unverified users while your email service never sends the welcome sequence.

The Scattered Logic Problem

Most developers wing it with state management. They scatter if-statements throughout their code, hoping they've covered all the edge cases. But what happens when a user tries to reset their password before confirming their email? Or when they attempt to delete an account that's already been suspended?

// Instead of this chaos:
if (user.isActive && !user.isDeleted && user.emailVerified) {
  // Maybe allow password reset?
  if (user.lastLoginAttempt < maxAttempts) {
    // Actually, what if they're suspended?
    // And who needs to know about this change?
    // This is getting messy...
  }
}

Enter the FSM—your business logic's best friend and your distributed system's reliable narrator.

Building the Foundation: Local State Integrity

export class UserEntity extends BaseEntity<UserState, UserAction> {
  transition(action: UserAction): this {
    super.transition(action); // FSM validates the transition
    return this;
  }
}

// Usage: Clean, explicit, impossible to mess up
const entity = UserEntity.create(user)
  .transition(UserAction.REGISTER)
  .transition(UserAction.CONFIRM_EMAIL)
  .persist(repo, trx);

But here's where it gets really interesting—we've accidentally created a domain-specific language that speaks to both humans and machines.

const userTransitions = buildTransitionMap([
  allow(UserState.NEW, UserAction.REGISTER, UserState.PENDING_VERIFICATION),
  allow(UserState.PENDING_VERIFICATION, UserAction.CONFIRM_EMAIL, UserState.ACTIVE),
  allow(UserState.ACTIVE, UserAction.SUSPEND, UserState.SUSPENDED),
  allow(UserState.ACTIVE, UserAction.DELETE, UserState.DELETED),
  // What's NOT here? You CAN'T go from DELETED to ACTIVE. Ever.
]);

The Distributed Reality: When State Changes Must Travel

Now here's where most FSM tutorials stop—but where real systems begin. In a distributed architecture, your user's journey from PENDING_VERIFICATION → ACTIVE isn't just an internal state flip. It's a domain event that needs to cascade through your entire system:

  • Billing Service: "Start the 14-day trial"

  • Email Service: "Send welcome sequence"

  • Analytics Service: "User activated - update conversion metrics"

  • Recommendation Engine: "Begin building user profile"

Every state transition is actually a promise to the rest of your system that something meaningful has happened.

FSM + Event-Driven Architecture = System-Wide Consistency

export class UserEntity extends BaseEntity<UserState, UserAction> {
  transition(action: UserAction): this {
    const previousState = this.currentState;
    super.transition(action);

    // FSM transition succeeded - now tell the world
    this.addDomainEvent(new UserStateChangedEvent({
      userId: this.id,
      from: previousState,
      to: this.currentState,
      action,
      timestamp: new Date()
    }));

    return this;
  }
}

But wait—we've just created the classic dual-write problem. What if the database transaction succeeds but the event publishing fails? Your local state says "ACTIVE" but the billing service never got the memo.

The Outbox Pattern: Guaranteed Event Delivery

This is where the Transactional Outbox Pattern becomes your FSM's distributed companion:

// Within the same database transaction
await trx.batch([
  userRepo.save(entity), // Update user state
  outboxRepo.save(entity.domainEvents) // Store events to publish
]);

// Separate process reliably publishes events
const outboxEvents = await outboxRepo.findUnpublished();
for (const event of outboxEvents) {
  await messageQueue.publish(event);
  await outboxRepo.markPublished(event.id);
}

Now your FSM transitions become atomic broadcasts—either the state changes and all downstream services get notified, or nothing happens at all.

Real-World Event Flow

Let's trace what happens when a user confirms their email:

  1. FSM validates: PENDING_VERIFICATION → CONFIRM_EMAIL → ACTIVE ✅

  2. Database transaction: User state updated + Event stored in outbox

  3. Event published: UserActivatedEvent → SNS/SQS/Kafka

  4. Services react:

    • Billing: Creates trial subscription

    • Email: Sends welcome series

    • Analytics: Records activation

    • Auth: Enables full feature access

If any step fails, the entire transition is rolled back. Your business rules aren't just enforced locally—they're guaranteed system-wide.

Choreographed Sagas: Failure States as Compensation Triggers

Here's where FSMs truly shine in distributed systems: failure states become distributed compensation coordinators.

When user activation involves billing setup, email verification, and resource provisioning, any step can fail. Instead of introducing a centralized saga orchestrator (another thing to maintain, another point of failure), let the FSM failure states choreograph compensation:

const userTransitions = buildTransitionMap([
  // Happy path
  allow(UserState.ACTIVATING, UserAction.ACTIVATION_SUCCESS, UserState.ACTIVE),

  // Failure states - each represents a specific failure scenario
  allow(UserState.ACTIVATING, UserAction.BILLING_SETUP_FAILED, UserState.BILLING_FAILED),
  allow(UserState.ACTIVATING, UserAction.EMAIL_VERIFICATION_FAILED, UserState.EMAIL_FAILED),

  // Compensation paths
  allow(UserState.BILLING_FAILED, UserAction.COMPENSATE, UserState.COMPENSATING),
  allow(UserState.COMPENSATING, UserAction.COMPENSATION_COMPLETE, UserState.PENDING),
]);

When billing setup fails, the FSM transitions to BILLING_FAILED state and publishes a UserBillingFailedEvent with full context:

{
  userId: "user-123",
  fromState: "ACTIVATING",
  toState: "BILLING_FAILED",
  failureReason: "Payment method declined",
  timestamp: "2025-09-29T10:30:00Z",
  context: {
    attemptedTrialId: "trial-456",
    emailVerificationStatus: "completed"
  }
}

Each service receives the same event and handles its own compensation:

// Billing Service - knows how to undo its own work
class BillingEventHandler {
  async onUserBillingFailed(event: UserBillingFailedEvent) {
    const trial = await this.findTrial(event.context.attemptedTrialId);
    if (trial) {
      await this.cancelTrial(trial.id);
      await this.releaseReservedQuota(trial.id);
    }
  }
}

// Email Service - manages its own side effects
class EmailEventHandler {
  async onUserBillingFailed(event: UserBillingFailedEvent) {
    await this.cancelScheduledWelcomeSequence(event.userId);
    await this.sendActivationFailedNotice(event.userId, event.failureReason);
  }
}

// Auth Service - handles its domain
class AuthEventHandler {
  async onUserBillingFailed(event: UserBillingFailedEvent) {
    await this.revokeTemporaryPermissions(event.userId);
    await this.clearActivationSession(event.userId);
  }
}

No centralized orchestrator. No additional coordination layer. Just:

  1. FSM transitions to failure state

  2. Event published with full context

  3. Each service independently compensates what it created

  4. Services report completion back via events

  5. FSM transitions to next appropriate state

This is single responsibility at the system level. Each service:

  • Owns its compensation logic

  • Reacts to relevant failure states

  • Doesn't need to know about other services' compensation strategies

  • Can evolve its compensation independently

The Architecture Patterns That Emerge

Choreographed Sagas: FSM failure states coordinate distributed compensations without a central orchestrator. Services autonomously react to failure events, each handling their own rollback.

Event Sourcing Flavor: Instead of just storing current state, persist the transition events themselves. Your FSM becomes both the state machine and the audit log.

CQRS Integration: Commands trigger FSM transitions, queries read from projections built from the events. Clean separation of concerns.

Why This Transforms Your Entire System

🛡️ Impossible states stay impossible—across all services
🎯 Business rules live in one place—but their effects are system-wide
🔍 Distributed debugging—trace events across service boundaries
🚀 Service decoupling—services react to events, not direct calls
📝 Living documentation—your FSM + event schema IS your system contract

The Real-World Impact

In our production auth system since implementing FSM + Event-driven architecture:

  • Zero distributed state inconsistencies (user active in auth but inactive in billing)

  • 90% reduction in "user stuck in weird state" support tickets

  • Cross-service feature rollouts that used to take weeks now take hours

  • New team members understand user lifecycle across all services in their first day

Beyond User Management

This pattern transforms any stateful domain:

Order Processing: NEW → PAID → FULFILLING → SHIPPED → DELIVERED

  • Each transition notifies warehouse, shipping, customer service

Content Moderation: SUBMITTED → UNDER_REVIEW → APPROVED → PUBLISHED

  • Events trigger ML pipelines, human reviewers, CDN updates

Subscription Management: TRIAL → ACTIVE → PAST_DUE → CANCELED

  • Billing, feature access, and communication all stay in sync

The Questions This Raises

Think about your most complex business workflows. Where do you have:

  • Services getting out of sync?

  • "How did this entity get into this state?" debugging sessions?

  • Fear of changing business rules because the ripple effects are unpredictable?

These are all FSM + Event-driven architecture problems waiting to be solved.

The FSM doesn't just manage local state—it becomes the authoritative narrator of your domain's story, ensuring every service hears the same tale at the same time.

What distributed state challenges are keeping you up at night? There's probably an FSM + messaging solution hiding in plain sight.

#SoftwareArchitecture #DomainDrivenDesign #StateManagement #EventDrivenArchitecture #DistributedSystems #TypeScript #BackendEngineering