Leader Election and Agents

Who's in charge?

Wolverine has a couple important features that enable Wolverine to distribute stateful, background work by assigning running agents to certain running nodes within an application cluster. To do so, Wolverine has a built in leader election feature so that it can make one single node run a "leadership agent" that continuously ensures that all known and supported agents are running within the system on a single node.

Here's an illustration of that work distribution:

Work Distribution across Nodes

Within Wolverine itself, there are a couple types of "agents" that Wolverine distributes:

The "durability agents" that poll against message stores for any stranded inbox or outbox messages that might need to be recovered and pushed along. Wolverine runs exactly one agent for each message store in the system, and distributes these across the cluster
"Exclusive Listeners" within Wolverine when you direct Wolverine to only listen to a queue, topic, or message subscription on a single node. This happens when you use the strictly ordered listening option.
In conjunction with Marten, the Wolverine managed projection and subscription distribution uses Wolverine's agent assignment capability to make sure each projection or subscription is running on exactly one node.

Enabling Leader Election

Leader election is on by default in Wolverine if you have any type of message persistence configured for your application and some mechanism for cross node communication. First though, let's talk about message persistence. It could be by PostgreSQL:

var builder = WebApplication.CreateBuilder(args);
var connectionString = builder.Configuration.GetConnectionString("postgres");

builder.Host.UseWolverine(opts =>
{
    // Setting up Postgresql-backed message storage
    // This requires a reference to Wolverine.Postgresql
    opts.PersistMessagesWithPostgresql(connectionString);

    // Other Wolverine configuration
});

// This is rebuilding the persistent storage database schema on startup
// and also clearing any persisted envelope state
builder.Host.UseResourceSetupOnStartup();

var app = builder.Build();

// Other ASP.Net Core configuration...

// Using JasperFx opens up command line utilities for managing
// the message storage
return await app.RunJasperFxCommands(args);

^{snippet source | anchor}

or by SQL Server:

var builder = WebApplication.CreateBuilder(args);
var connectionString = builder.Configuration.GetConnectionString("sqlserver");

builder.Host.UseWolverine(opts =>
{
    // Setting up Sql Server-backed message storage
    // This requires a reference to Wolverine.SqlServer
    opts.PersistMessagesWithSqlServer(connectionString);

    // Other Wolverine configuration
});

// This is rebuilding the persistent storage database schema on startup
// and also clearing any persisted envelope state
builder.Host.UseResourceSetupOnStartup();

var app = builder.Build();

// Other ASP.Net Core configuration...

// Using JasperFx opens up command line utilities for managing
// the message storage
return await app.RunJasperFxCommands(args);

^{snippet source | anchor}

or through the Marten integration:

// Adding Marten
builder.Services.AddMarten(opts =>
    {
        var connectionString = builder.Configuration.GetConnectionString("Marten");
        opts.Connection(connectionString);
        opts.DatabaseSchemaName = "orders";
    })

    // Adding the Wolverine integration for Marten.
    .IntegrateWithWolverine();

^{snippet source | anchor}

or by RavenDb:

var builder = Host.CreateApplicationBuilder();

// You'll need a reference to RavenDB.DependencyInjection
// for this one
builder.Services.AddRavenDbDocStore(raven =>
{
    // configure your RavenDb connection here
});

builder.UseWolverine(opts =>
{
    // That's it, nothing more to see here
    opts.UseRavenDbPersistence();
    
    // The RavenDb integration supports basic transactional
    // middleware just fine
    opts.Policies.AutoApplyTransactions();
});

// continue with your bootstrapping...

^{snippet source | anchor}

Next, we need to have some kind of mechanism for cross node communication within Wolverine in the form of control queues for each node. When Wolverine bootstraps, it uses the message persistence to save information about the new node including a Uri for a control endpoint where other Wolverine nodes should send messages to "control" agent assignments.

If you're using any of the message persistence options above, there's a fallback mechanism using the associated databases to act as a simplistic message queue between nodes. For better results though, some of the transports in Wolverine can instead use a non-durable queue for each node that will probably provide for better results. At the time this guide was written, the Rabbit MQ transport and the Azure Service Bus transport support this feature.

Disabling Leader Election

If you want to disable leader election and all the cross node traffic, or maybe if you just want to optimize automated testing scenarios by making a newly launched process automatically start up all possible agents immediately, you can use the DurabilityMode.Solo setting as shown below:

var builder = Host.CreateApplicationBuilder();

builder.UseWolverine(opts =>
{
    opts.Services.AddMarten("some connection string")

        // This adds quite a bit of middleware for
        // Marten
        .IntegrateWithWolverine();

    // You want this maybe!
    opts.Policies.AutoApplyTransactions();

    if (builder.Environment.IsDevelopment())
    {
        // But wait! Optimize Wolverine for usage as
        // if there would never be more than one node running
        opts.Durability.Mode = DurabilityMode.Solo;
    }
});

using var host = builder.Build();
await host.StartAsync();

^{snippet source | anchor}

For testing, you also have this helper:

// This is bootstrapping the actual application using
// its implied Program.Main() set up
// For non-Alba users, this is using IWebHostBuilder 
Host = await AlbaHost.For<Program>(x =>
{
    x.ConfigureServices(services =>
    {
        // Override the Wolverine configuration in the application
        // to run the application in "solo" mode for faster
        // testing cold starts
        services.RunWolverineInSoloMode();

        // And just for completion, disable all Wolverine external 
        // messaging transports
        services.DisableAllExternalWolverineTransports();
    });
});

^{snippet source | anchor}

Likewise, any other DurabilityMode setting than Balanced (the default) will disable leader election.

Writing Your Own Agent Family

To write your own family of "sticky" agents and use Wolverine to distribute them across an application cluster, you'll first need to make implementations of this interface:

/// <summary>
///     Models a constantly running background process within a Wolverine
///     node cluster
/// </summary>
public interface IAgent : IHostedService // Standard .NET interface for background services
{
    /// <summary>
    ///     Unique identification for this agent within the Wolverine system
    /// </summary>
    Uri Uri { get; }
    
    // Not really used for anything real *yet*, but 
    // hopefully becomes something useful for CritterWatch
    // health monitoring
    AgentStatus Status { get; }
}

^{snippet source | anchor}

/// <summary>
///     Models a constantly running background process within a Wolverine
///     node cluster
/// </summary>
public interface IAgent : IHostedService // Standard .NET interface for background services
{
    /// <summary>
    ///     Unique identification for this agent within the Wolverine system
    /// </summary>
    Uri Uri { get; }
    
    // Not really used for anything real *yet*, but 
    // hopefully becomes something useful for CritterWatch
    // health monitoring
    AgentStatus Status { get; }
}

public class CompositeAgent : IAgent
{
    private readonly List<IAgent> _agents;
    public Uri Uri { get; }

    public CompositeAgent(Uri uri, IEnumerable<IAgent> agents)
    {
        Uri = uri;
        _agents = agents.ToList();
    }

    public async Task StartAsync(CancellationToken cancellationToken)
    {
        foreach (var agent in _agents)
        {
            await agent.StartAsync(cancellationToken);
        }

        Status = AgentStatus.Running;
    }

    public async Task StopAsync(CancellationToken cancellationToken)
    {
        foreach (var agent in _agents)
        {
            await agent.StopAsync(cancellationToken);
        }

        Status = AgentStatus.Running ;
    }

    public AgentStatus Status { get; private set; } = AgentStatus.Stopped;
}

^{snippet source | anchor}

Note that you could use BackgroundService as a base class.

The Uri property just needs to be unique and match up with our next service interface. Wolverine uses that Uri as a unique identifier to track where and whether the known agents are executing.

The next service is the actual distributor. To plug into Wolverine, you need to build an implementation of this service:

/// <summary>
///     Pluggable model for managing the assignment and execution of stateful, "sticky"
///     background agents on the various nodes of a running Wolverine cluster
/// </summary>
public interface IAgentFamily
{
    /// <summary>
    ///     Uri scheme for this family of agents
    /// </summary>
    string Scheme { get; }

    /// <summary>
    ///     List of all the possible agents by their identity for this family of agents
    /// </summary>
    /// <returns></returns>
    ValueTask<IReadOnlyList<Uri>> AllKnownAgentsAsync();

    /// <summary>
    ///     Create or resolve the agent for this family
    /// </summary>
    /// <param name="uri"></param>
    /// <param name="wolverineRuntime"></param>
    /// <returns></returns>
    ValueTask<IAgent> BuildAgentAsync(Uri uri, IWolverineRuntime wolverineRuntime);

    /// <summary>
    ///     All supported agent uris by this node instance
    /// </summary>
    /// <returns></returns>
    ValueTask<IReadOnlyList<Uri>> SupportedAgentsAsync();

    /// <summary>
    ///     Assign agents to the currently running nodes when new nodes are detected or existing
    ///     nodes are deactivated
    /// </summary>
    /// <param name="assignments"></param>
    /// <returns></returns>
    ValueTask EvaluateAssignmentsAsync(AssignmentGrid assignments);
}

^{snippet source | anchor}

In this case, you can plug custom IAgentFamily strategies into Wolverine by just registering a concrete service in your DI container against that IAgentFamily interface (services.AddSingleton<IAgentFamily, MySpecialAgentFamily>();). Wolverine does a simple IServiceProvider.GetServices<IAgentFamily>() during its bootstrapping to find them.

As you can probably guess, the Scheme should be unique, and the Uri structure needs to be unique across all of your agents. EvaluateAssignmentsAsync() is your hook to create distribution strategies, with a simple “just distribute these things evenly across my cluster” strategy possible like this example from Wolverine itself:

csharp

public ValueTask EvaluateAssignmentsAsync(AssignmentGrid assignments)
{
    assignments.DistributeEvenly(Scheme);
    return ValueTask.CompletedTask;
}

If you go looking for it, the equivalent in Wolverine’s distribution of Marten projections and subscriptions is a tiny bit more complicated in that it uses knowledge of node capabilities to support blue/green semantics to only distribute work to the servers that “know” how to use particular agents (like version 3 of a projection that doesn’t exist on “blue” nodes):

csharp

public ValueTask EvaluateAssignmentsAsync(AssignmentGrid assignments)
{
    assignments.DistributeEvenlyWithBlueGreenSemantics(SchemeName);
    return new ValueTask();
}

The AssignmentGrid tells you the current state of your application in terms of which node is the leader, what all the currently running nodes are, and which agents are running on which nodes. Beyond the even distribution, the AssignmentGrid has fine grained API methods to start, stop, or reassign individual agents to specific running nodes.

To wrap this up, I’m trying to guess at the questions you might have and see if I can cover all the bases:

Is some kind of persistence necessary? Yes, absolutely. Wolverine has to have some way to “know” what nodes are running and which agents are really running on each node.
How does Wolverine do health checks for each node? If you look in the wolverine_nodes table when using PostgreSQL or Sql Server, you’ll see a heartbeat column with a timestamp. Each Wolverine application is running a polling operation that updates its heartbeat timestamp and also checks that there is a known leader node. In normal shutdown, Wolverine tries to gracefully mark the current node as offline and send a message to the current leader node if there is one telling the leader that the node is shutting down. In real world usage though, Kubernetes or who knows what is frequently killing processes without a clean shutdown. In that case, the leader node will be able to detect stale nodes that are offline, eject them from the node persistence, and redistribute agents.
Can Wolverine switch over the leadership role? Yes, and that should be relatively quick. Plus Wolverine would keep trying to start a leader election if none is found. But yet, it’s an imperfect world where things can go wrong and there will 100% be the ability to either kickstart or assign the leader role from the forthcoming CritterWatch user interface.
How does the leadership election work? Crudely and relatively effectively. All of the storage mechanics today have some kind of sequential node number assignment for all newly persisted nodes. In a kind of simplified “Bully Algorithm,” Wolverine will always try to send “try assume leadership” messages to the node with the lowest sequential node number which will always be the longest running node. When a node does try to take leadership, it uses whatever kind of global, advisory lock function the current persistence uses to get sole access to write the leader node assignment to itself, but will back out if the current node detects from storage that the leadership is already running on another active node.

Message Handlers

Rabbit MQ

Azure Service Bus

Amazon SQS

Google PubSub

Marten Integration

Entity Framework Core Integration

Leader Election and Agents

Enabling Leader Election

Disabling Leader Election

Writing Your Own Agent Family

Rabbit MQ

Azure Service Bus

Amazon SQS

Google PubSub

Leader Election and Agents ​

Enabling Leader Election ​

Disabling Leader Election ​

Writing Your Own Agent Family ​

Leader Election and Agents

Enabling Leader Election

Disabling Leader Election

Writing Your Own Agent Family