Build Caching and Strace

28th January 2019

Introduction

If you look at a build process abstractly then it is basically a function that uses some files as inputs and creates some files as outputs. We can peek into this input/output process by invoking the build script with strace and then asking it to log all file operations. After we recover the input and output files we can retrofit a caching mechanism on top of the build process by hashing the input files and using that as a key to save the output files.

This is not an innovative idea and there are even build systems1 utilizing this approach by invoking all build commands with strace to recover the dependencies in a language agnostic way. I'm going to outline the idea with concrete steps and hopefully with enough detail so that you can apply it to your own build scripts.

Stracing

To make things more concrete I’m going to use a simple script as a stand-in for a build process

#!/bin/bash -eu
set -o pipefail
 
# Outputs
cat input/a input/b > output/ab
cat input/a input/b input/c > output/abc

If you’re following along then your folder structure should look like below

.
├── build.sh
├── output
└── input
    ├── a
    ├── b
    └── c

In practice the build script and folder structure won’t be so simple but this is good enough for a demonstration.

Now if we invoke this script with strace we can monitor all file operations and “reverse engineer” what the build script is doing

$ strace -f -s 500 -e trace=file -o build.output ./build.sh

When you look at the output it might not look like mine but should be close enough

589   execve("./build.sh", ["./build.sh"], 0x7fffd6c4ac88 /* 18 vars */) = 0
589   access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589   access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
589   openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
589   access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589   openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libtinfo.so.5", O_RDONLY|O_CLOEXEC) = 3
589   access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589   openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
589   access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589   openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
589   openat(AT_FDCWD, "/dev/tty", O_RDWR|O_NONBLOCK) = 3
589   openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
589   stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat(".", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat("/mnt", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache", O_RDONLY) = 3
# ...

First time I did this I was surprised to see how many file operations are involved in such a basic script. In real build scripts there will be even more going on but what we’re interested in are the stat and openat operations. The reason is hopefully obvious because those are the system calls that let us figure out which files the build script is reading from and writing to.

If we just look at the stat calls then we see nothing interesting is going on but in a real build script this would probably have some useful information

$ grep 'stat' build.output
589   stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat(".", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat("/mnt", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("./build.sh", {st_mode=S_IFREG|0777, st_size=116, ...}) = 0
589   stat(".", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat("/usr/local/sbin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589   stat("/usr/local/bin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589   stat("/usr/sbin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589   stat("/usr/bin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589   stat("/sbin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0

Now let’s look at what files the script was reading from (there is again some extraneous cruft involved but it can be easily filtered out if desired)

$ grep 'O_RDONLY' build.output
# ...
590   openat(AT_FDCWD, "input/a", O_RDONLY) = 3
590   openat(AT_FDCWD, "input/b", O_RDONLY) = 3
# ...
591   openat(AT_FDCWD, "input/a", O_RDONLY) = 3
591   openat(AT_FDCWD, "input/b", O_RDONLY) = 3
591   openat(AT_FDCWD, "input/c", O_RDONLY) = 3

This mirrors exactly what we had in our build script. We first took 2 files to create one output and then took 3 files to create another output. In a real build script things would be more complicated but the gist of the idea is that looking at what files are read during the build process gives us an idea of what the inputs are. So we just recovered the input files to our build process

input/a
input/b
input/c

Now we need to recover the output files but that’s easy because instead of looking for files that were read from we look for files that were written to

$ grep 'O_WRONLY' build.output
590   openat(AT_FDCWD, "output/ab", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
591   openat(AT_FDCWD, "output/abc", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3

This again mirrors what we had in the build script so this gives us the output files

output/ab
output/abc

We now have everything to retrofit a cache system on top of the build process.

Caching

First thing we need to do is make a key for the output files and we want that key to use the content of the input files in a non-trivial way. One way to accomplish this is to hash the contents of the input files in a deterministic order (I’m piping a tar file into shasum because in practice this seems to be the fastest way to generate a content hash that depends on various files and folders)

#!/bin/bash -eu
set -o pipefail
 
find input -type f -print0 | sort -z > inputs
tar -P --mtime='1970-01-01' --null \
  --format=ustar \
  --files-from=inputs \
  -cf - | shasum -

Running that script should give you the following output and we can use the hash as a key for saving the output files

$ ./keygen.sh
6428f5771007cf005037d47c9aeac9bfcc8925f9  -

Now we can use that key to generate the cache file by compressing all the output files into a file named by the key we just generated

#!/bin/bash -eu
set -o pipefail
 
key="$(./keygen.sh | awk '{print $1}')"
rm -f "${key}.txz"
tar cJf "${key}.txz" output

Conclusion

And that’s it. We just retrofitted a caching system for a build process after reverse engineering the input and output files with strace. To use this cache in a production setting you’d write another script to compute the cache key and then see if the cache file exists and skip the build process if it does by just decompressing the cache file.

If you're interested in speeding up and streamlining your build processes then get in touch and I'm happy to help out.


1: Thanks to Joe Ardent for pointing this out.