Cloudbootup: Validating a Coordination Protocol with Alloy

Validating a Coordination Protocol with Alloy

20th February 2019

Introduction

One good way to use model checkers like the Alloy analyzer is to verify that a set of requirements is logically consistent. If the Alloy analyzer can find a model then that means the requirements are consistent and we can be reasonably sure we are not trying to do something nonsensical.

I recently outlined a coordination protocol and gave some informal reasons why it worked as I expected it would work. The trick with informal reasoning though is that it's very easy to accidentally introduce inconsistencies. One way to increase our confidence in the correctness of the protocol is to model it with Alloy.

Modeling Temporal Properties with Alloy

Alloy uses the language of first order relational logic which is not the first thing most people think of when modelling algorithms with temporal properties. Fortunately, first order relational logic is expressive enough and Alloy gives us several abstractions like ordered sets to model the passage of time.

Reminder of what the protocol looked like

Initialize the appendable data structure A
Append to it a unique identifier I
Poll until A has a specific length/size L
While polling make sure I is still in A and re-add if it is not
Once A reaches length/size L continue with the rest of the work

The standard interpretation of the protocol is about a set of interacting network processes. The appendable data structure is accessible over the network by HTTP (for example) and other nodes talk to that server and evolve the state/data structure.

Temporality

There are a few ways to model the protocol but I'm going to use an evented model. By an evented model I mean that evolution of the state in the model will be based on a set of events at a point in time and their effects on the state at a subsequent point in time

// We will be dealing with sequences of events and states indexed by time
open util/ordering[Time]

// Events and states will be ordered/indexed by Time
some sig Time { }

// An event happens at a point in time and is related to some nodes
abstract sig Event {
  nodes: Node set -> Time
}

// There are 2 events sent/created by nodes: Reset, Register
some sig Register extends Event { }
some sig Reset extends Event { } {
  // Reset events can't happen at the last moment
  no nodes.last
}

I've chosen to model the interactive aspects of the protocol with two event types: Register, Reset. At any point in time some set of nodes (Node set -> Time) can send a Reset event to initialize the data structure and at some later point in time the same or other nodes can send Register events to add their unique identifier to the data structure. Since the unique identifiers are abstract we'll just identify them with the nodes themselves when modelling the process/data structure that responds to initialization and registration queries.

Notice that the initialization/reset event has a restriction that says it can't happen at the very last moment. This is not anything inherent to the protocol and is just there to make sure we find some non-trivial models. The protocol is technically unbounded in time but we can only work with finite time in Alloy so that restriction means at least something will happen before we run out of time. You can think of the finiteness of time in Alloy as an implicit timeout condition. There are also a few more restrictions that potentially make that restriction redundant but that's not obvious so having it at this stage of specification doesn't hurt (we can always remove it later when we spell out more details).

State Evolution

With the query/event/time components specified we can start specifying the processes/elements/actors that participate in the protocol

some sig Node { } {
  // The reset event must happen only once
  one (Reset.nodes[this])
  // Register and reset events don't overlap
  no (Reset.nodes[this] & Register.nodes[this])
  // All register events must happen after the reset event
  Register.nodes[this] in Reset.nodes[this].nexts
}

// The state consists of a set of registered nodes at some point in time
some sig State {
  time: Time,
  registrants: Node set -> time
}

// We have 1-1 mapping between State and Time
fact {
  time in State one -> one Time
}

The important parts in the snippet above are about the nodes and how they're related to events. Each node can only send out a reset/initialization event at a single point in time (one (Reset.nodes[this]), registration and reset events happen at different points in time (no (Reset.nodes[this] & Register.nodes[this]), and all registration events happen after reset events because that's what we have in our protocol (Register.nodes[this] in Reset.nodes[this].nexts).

That last restriction is where we start using the ordering abstractions that Alloy gives us with util/ordering[Time]. For a specific point in time t we can ask for all subsequent points of time with t.nexts. So in plain english the last restriction tells us that the time points for registration events are an arbitrary subset of all time points that follow a reset event. If a reset event happened at time 1 and we had 10 time points total then registration events would be some arbitrary subset of 2-10.

We didn't specify that there must be a registration event. It could be that a node sends a reset event and never gets around to registering. When specifying models it's always better to have as few restrictions as possible. With this model we can also explore potential failure scenarios where some nodes never get around to registering. If we had added a restriction that forced registration for all nodes then that would rule out certain failure scenarios and it's not clear that would be justified so we err on the side of minimalism and leave room for failure.

Temporal State Evolution

So far our model is static and there are no restrictions on what can happen at different points in time other than what we have already spelled out. To fill in the relevant details of the protocol we need to specify what happens at adjacent points in time. The standard way to do that is to specify a transition relation/predicate

pred transition(t, t': Time) {
  some (Reset.nodes.t) => {
    // Reset event clears the registrants at the next point in time
    no State.registrants.t'
  } else {
    // Otherwise there are no reset events at this time so register the nodes
    State.registrants.t' = State.registrants.t + Register.nodes.t
  }
}

In plain english the transition predicate says the following: if at time t we have some reset events then at the next time point t' we wipe all the nodes that had registered and if there are no reset events at time t then at the next time point t' the list of registered nodes must have what it had before plus the nodes that registered at the previous time t.

Consistency with Intuitions

With the above we can verify that our intended/standard interpretation lines up with models that Alloy finds

run {
  // Initially there are no registered nodes
  no State.registrants.first
  // At the last moment we must have all the nodes
  Node in State.registrants.last
  // All intermidate points must respect the transition relation
  all t: Time, t': t.next | transition[t, t']
} for exactly 2 Node, 3 Time, 3 State, 3 Event

The models that Alloy finds are mostly permutations of the "same" model because we are working in such a small scope. What we can conclude from this is that our requirements are consistent and there is at least one sequence of events that validates our intuitions.

It's also worthwhile to explore failure modes and the only one I can think of is failure to have all nodes registered by the last moment in time

fact {
  no State.registrants.first
  all t: Time, t': t.next | transition[t, t']
}

check {
  Node in State.registrants.last
}

Alloy finds a counterexample and it's the counterexample where a single node sends a reset event and then never sends a registration event. This again lines up with our intuitions. We could rule out this counterexample by forcing every node to register at least once. I highly recommend exploring the resulting model because there will still be failure modes depending on the order that nodes send reset and registration events.

Conclusion

I'm now reasonably sure that the protocol as it stands is workable. Alloy validated our informal intuitions and also allowed us to explore various failure modes in terms of sequencing reset and registration events. I'm going to leave adding further restrictions to rule out the obvious counterexamples as exercises for the reader.

The code (slightly cleaned up and improved) can be found over at GitHub.

$ ./keygen.sh 6428f5771007cf005037d47c9aeac9bfcc8925f9  -