Modeling Rolling Deployments in Alloy and TypeScript

13th February 2019

Introduction

This post has a few goals. One is to outline how to think about a simple imperative process in a declarative language/system like Alloy. Another is to make a case for why this kind of formalization is a worthwhile exercise for software engineers by showing how formalizations can be used as guides for actual implementations.

There are 3 sections: informal definitions, formalization in Alloy, implementation in TypeScript. If you're familiar with rolling deployments then the first section is safe to skip.

Hillel Wayne has also written about deployments in the context of TLA+ so if you want to see another angle for this I recommend taking a look.

Rolling Deployments

Folks that are familiar with rolling deployments and have dealt with them in a practical setting can safely skip this section. Everyone else should read and make sure it makes some sense to them because this is the foundation for the Alloy specification/model presented in the next section.

There are a few ways to distribute software/code updates for a typical web application sitting behind a load balancer. One way is to provision a new/disjoint set of servers and reconfigure the load balancer to gracefully (or abruptly) transfer connections from the old servers to the new ones. This is called blue/green deployments. Another way is to not worry about any new servers and roll out the new code in stages/batches to the existing servers by taking them out of the load balancer and recommissioning them. This is called rolling deployments.

Depending on various requirements and restrictions one or the other strategy might be the only viable option but at large enough scale rolling deployments are the only reasonable option because the cost and time involved in spinning up new servers might be prohibitive or downright impossible (think about bare metal servers and account based limitations in cloud environments). The main issue is that for blue/green deployments we need 1x slack in the system which means half of our capacity must be set aside to be either green or blue and not doing any useful work behind the load balancer. For rolling deployments we only need just enough slack to update one or several servers at a time and don’t need blue/green servers and idle capacity. There is a way to play with the blue/green ratios to reduce the required slack but then it ends up being a hybrid model of blue/green and rolling. I’m going to leave exploration of that extension as an exercise for the reader.

Most of this also holds in any environment where compute capacity can be doled out in chunks. The load balancer is just one way of distributing resources so you can also imagine rolling deployments happening in other kinds of environments and instead of a load balancer there is some other general purpose scheduler. The same ideas apply in the more generic context as long as there is some notion of being able to mark resources as out of commission.

With the informal foundation out of the way we can start thinking about how to model what seems like an imperative process in a declarative context.

Modelling Rolling Deployments in Alloy

There is a bit of an art to specification and modelling when it comes to bridging the gap between reality and formality. We want our models to tell us useful things but we also can’t model every detail so we must choose which aspects we want to focus on and decide for ourselves whether the resulting model is a good approximation of what we’re trying to understand.

In the previous section I mentioned load balancers and schedulers but the specific thing to be concerned about is the state of each server. We really only care about what step of the recommissioning process a server is currently at. This is a deliberate modelling choice and someone else could have made another decision and gone down another path. Although the specification below is a good model it is not the only way to specify rolling deployments

open util/ordering[State]
 
some sig Machine { }
 
// State is defined by the aggregate state of the machines
sig State {
  // Working with old software
  old: set Machine,
  // Working with new software
  new: set Machine,
  // Being updated and not currently doing work
  undefined: set Machine
} {
  // Machines can't randomly disappear
  Machine = old + new + undefined
  // There must always be at least 1 machine in the old or new states
  // doing some useful work
  some (old + new)
  // The machines can't be in multiple states at once
  disj[old, new, undefined]
}

Notice how we have several restrictions of what comprises the aggregate state of a set of servers/machines

// Machines can't randomly disappear
Machine = old + new + undefined
// There must always be at least 1 machine in the old or new states
// doing some useful work
some (old + new)
// The machines can't be in multiple states at once
disj[old, new, undefined]

I initially did not have these restrictions and discovered them interactively in Alloy’s graphical analyzer. It is hard to think of all the implicit assumptions we have in our mental models and Alloy is an excellent tool for uncovering those hidden assumptions.

Next thing we need to do is specify the dynamics of rolling deployments. The way this is accomplished in most declarative languages is by having some kind of ordering between states and then specifying what is a valid transition from one state to the next by relating the ordered states in a way that respects the ordering. For our purposes we’ll only need to worry about the current and the next state

pred transition(s, s': State) {
  // New machine stay where they are.
  // This essentially pins the state of new machines.
  s.new in s'.new
  // Old machines can't magically become new machines.
  no (s.old & s'.new)
  // Now we specify the mapping for what happens to machines
  // that are being updated.
  some s.undefined => {
    // If there are machines in an undefined state then they
    // must move to the new state but they don't all have to
    // move together
    some (s.undefined & s'.new)
    // We don't specify what is happening to the old machines.
    // Think about why this is ok (or not).
  } else {
    // There are no machines in an undefined state so pick
    // some from old ones and move them to an undefined state.
    some (s.old & s'.undefined)
  }
}

Just like with the specification for the aggregate state I did not come up with this on the first try. Initially there were many low level details and the model was more restrictive than necessary. I had implicitly and inadvertently excluded concurrent operations. The important part of the specification is how set intersections are used to specify the compatibility between different states, e.g. some (s.undefined & s'.new). This specific condition says that on each transition there must be at least one server that moves from an undefined state to being a new/updated server. The other conditions are similar, by leveraging the set based specification we provide enough slack for the Alloy analyzer to perform concurrent operations when finding the models.

There is nothing left to do now other than run the specification to see what Alloy comes up with

run {
  first.old = Machine
  last.new = Machine
  all s: State, s': s.next | transition[s, s']
} for 4 State, exactly 3 Machine

I highly recommend trying this at home and seeing what transitions Alloy comes up with. It really feels kinda magical when declarative descriptions become “animated” in Alloy’s graphical analyzer. Here’s a formatted version of one of the examples Alloy came up with

State0 = {
  old: {M0, M1, M2}
}
State1 = {
  old: {M1}
  undefined: {M0, M2}
}
State2 = {
  new: {M2}
  undefined: {M0, M1}
}
State3 = {
  new: {M0, M1, M2}
}

Next, I'm going to use this model as a guideline for writing an implementation/simulation of rolling deployments in TypeScript. The implementation will be in a form of a discrete time simulation which can be further refined into a real implementation talking to real servers and load balancers.

Implementation in TypeScript

First thing we need is some way to work with machines and their state (I will outline the interfaces and signatures for various classes and leave further exploration of the code to the reader)

type MachineState = 'old' | 'new' | 'updating' | 'updated';

/**
 * This is really a state machine even though it doesn't look like one.
 * The transitions are of the following form:
 * 'old' -> 'updating' -> ... -> 'updated' -> 'new'. Each machine can
 * be in the 'updating' state for a configured amount of "time" to
 * simulated the fact that different machines can finish their
 * update process at different times because of various
 * vagaries of networked processes.
 */
class Machine {

  // 'old' is the start state for all machines.
  private state: MachineState = 'old';

  /**
   * @param updateDelay Length of "time" the machine should be updating.
   */
  constructor(
    private updateDelay: number = 2 // Default length of "time" for updating
  ) { }

  /**
   * Perform the necessary violence to update the server. If the requisite
   * amount of "time" has passed then we also update the interal state to 'updated'.
   */
  doUpdateWork() { ... }

  /**
   * Start the update process for an old server. Just transitions from 'old' to 'updating'.
   */
  startUpdating() { ... }

  /**
   * Put the server into production by updating the state from 'updated' to 'new'.
   */
  moveToProduction() { ... }

  get isUpdated() { ... }
  get isUpdating() { ... }
  get isOld() { ... }
  get isNew() { ... }
}

With the state machines for our simulated servers out of the way we can think about how the simulation driver should interact with them. I chose to split things up into two classes. One class worries about the servers collectively and acts like a data access layer and the other class "drives" the simulation

/**
 * Where we interact with the servers.
 */
class Environment {

  /**
   * @param machines List of machines in the environment. Passed in by whoever is running the simulation.
   */
  constructor(
    private machines: Machine[]
  ) { }

  get updatedServers() { ... }
  get updatingServers() { ... }
  get allServers() { ... }
  get newServers() { ... }

  /**
   * Pick some number of old servers to start the update process on them
   * @param count Number of old servers we want to concurrently start updating.
   */
  pickOldServers(count: number) { ... }

}
/**
 * The simulation driver.
 */
class Simulation {

  constructor(
    private environment: Environment, // Someone else must give us the environment
    private concurrency: number = 2 // How many machines can we mess with at a time
  ) { }

  /**
   * One tick of the simulation:
   * 1. Move updated servers to production
   * 2. Verify updating servers are making progress
   * 3. Start updating old servers if we are below concurrency limit
   * 4. Check if all servers are up to date and mark process as completed
   */
  doTick() { ... }

  /**
   * Run the simulation to completion, i.e. perform the
   * update process to convert all old servers to new servers
   * in a rolling fashion.
   */
  run() { ... }

}

Note that the simulation is not a statistical simulation but more like one that can potentially be extended to a full fledged deployment controller. This refinement process will involve making further choices and extensions/rewrites of the "implementation".

Final Words

Hopefully that shows some of the thinking process involved in specifying models in a real world context and how formal methods can help with day to day tasks. Formal specification doesn’t have to be an academic exercise and can be fruitfully used for modeling more pedestrian processes in day to day software engineering.

For more examples of stuff like this in day to day software engineering I recommend checking more of Hillel's posts: Proving Games are Winnable with Alloy, Formally Specfying a Package Manager, Formally Specifying UIs, Modelling Message Queues in TLA+.

$ ./keygen.sh 6428f5771007cf005037d47c9aeac9bfcc8925f9  -