Cloudbootup: Cloud Management with Prolog

Cloud Management with Prolog

7th April 2019

Introduction

This is a small update and tutorial on something I've been working on over the last few weekends to make cloud management better. There are a few pain points I'm trying to address and my solution is to treat the cloud as a Prolog database. Infrastructure as code is nice if that's your starting point and you already have "code" but most people start without the code and then hit a wall because clicking around in a GUI stops scaling. Or even if they do start out with "code" some failure along the way invalidates an assumption/invariant and the tool that was used for the code part is missing some feature that would allow restoring the violated invariant or makes it much more inconvenient that it needs to be. My hypothesis is that Prolog gets rid of this entire class of problems because it is Turing complete and truly declarative.

Infrastructure as a Prolog Database

If you read the previous article in this series then this will mostly be a review but if you didn't then here is a short summary.

The cloud can be viewed as a graph database. The various resources are nodes in the graph and the edges describe attributes and dependencies on other resources, i.e. an EC2 instance lives in a VPC, has associated security groups that are linked to other security groups, has an IP address, and sits behind a load balancer and serves traffic on some port. Putting all these facts together gives us a kind of attributed graph and if we express these facts in Prolog then we can write predicates to query and operate on it.

If you're familiar with AWS then you can probably imagine a few use cases for such a database. The most frequent issues I deal with are security groups and policies around snapshot management. So one thing I can do with a database like this is write a predicate that enforces an association between EC2 instances and security groups, i.e. every "type" of EC2 instance must be tagged with a type tag and must have a specific security group associated with it. You can certainly write a script for this use case or just force everyone to use a specific workflow that enforces this "invariant". I have tried both approaches enough times to know that it's not scalable. Forcing people into a specific workflow curtails their choices if they don't understand the tools used in the workflow (not everyone knows or needs to know how terraform modules work) and it makes people think that infrastructure/operations is someone else's problem. By decoupling the policies from personal workflows as much as possible we let people use whatever is best for them and through proper tooling provide the feedback necessary for them to learn why they should change their workflows without being authoritarian about it.

The policy described in the previous paragraph is just a few lines of Prolog (this is an aspirational example so don't expect it to work with the current code just yet)

valid_sg_association(E) :-
  ec2(E),
  link(E, tag, "type", V),
  security_group(S),
  link(S, name, V),
  link(E, security_group, S).

The above predicate says that an EC2 instance E has a valid security group association if it has a tag with key "type" and the value V of that tag is the name of the security group that is associated with it. So by iterating through the set of EC2 instances we can find all the instances that satisfy that policy and then take the complement to get all the instances that violate the policy. With a bit more tooling we could automate telling people their instance is in violation of the above policy by pinging them on slack or sending them an email with instructions on how to rectify the situation. Ideally this tooling would also be in Prolog but doesn't have to be.

The predicates/policies can be as complicated or as simple as we want because we are working in a Turing complete and declarative language. There are no theoretical limits to what we can express and simple properties end up being a few lines of Prolog instead of 100s of lines of Ruby or Python. My opinion is that Prolog scales sub-linearly with the complexity of the policy whereas other languages scale super-linearly and hit a wall pretty quickly.

Having this database and being able to enforce policies isn't the end of the story. Ideally we'd treat this database as the source of truth and evolve it with "migrations" to keep things synchronized. I'm still working on this part and how best to do it. Currently there is code to take the graph from the cloud provider and turn the resources into a set of Prolog files that can be loaded into any compliant interpreter (I've been using SWI-Prolog). The next milestone will be figuring out how to integrate it with an imperative component that will take a "migration" and affect the remote state. I have a few ideas around this as well. The current plan is to have a sub-process that takes action plans from the Prolog process, executes them, and then reports back the results to the main Prolog process so the database can be updated with the relevant results.

Small Tutorial

The code I have so far lives at cwacop. It stands for "cloudy with a chance of prolog". It's a typical Ruby project so cloning and running bundle will get you everything you need (except for a Prolog interpreter)

$ git clone https://github.com/cloudbootup/cwacop.git
$ cd cwacop && bundle
$ bundle exec ruby graphs.rb

The last line is what generates the graph and right now it just gathers resources and facts from a few AWS regions. The resulting Prolog files are segmented by region for convenience and nothing prevents you from loading all the regional files into the interpreter and querying and working with multiple regions at once. Feedback is welcome and if these ideas sound appealing to you then don't hesitate to get in touch.

$ ./keygen.sh 6428f5771007cf005037d47c9aeac9bfcc8925f9  -