If you look at a build process abstractly then it is basically a function that uses some files as inputs and
creates some files as outputs. We can peek into this input/output process by invoking
the build script with strace
and then asking it to log all file operations. After we recover the
input and output files we can retrofit a caching mechanism on top of the build process by hashing the input
files
and using that as a key to save the output files.
This is not an innovative idea and there are even build systems1 utilizing this approach by invoking all
build commands with strace
to recover the dependencies in a language agnostic way. I'm going to
outline the idea with concrete steps and hopefully with enough detail so that you can apply it to your own build
scripts.
To make things more concrete I’m going to use a simple script as a stand-in for a build process
#!/bin/bash -eu
set -o pipefail
# Outputs
cat input/a input/b > output/ab
cat input/a input/b input/c > output/abc
If you’re following along then your folder structure should look like below
.
├── build.sh
├── output
└── input
├── a
├── b
└── c
In practice the build script and folder structure won’t be so simple but this is good enough for a demonstration.
Now if we invoke this script with strace
we can monitor all file operations and “reverse
engineer”
what the build script is doing
$ strace -f -s 500 -e trace=file -o build.output ./build.sh
When you look at the output it might not look like mine but should be close enough
589 execve("./build.sh", ["./build.sh"], 0x7fffd6c4ac88 /* 18 vars */) = 0
589 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
589 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
589 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libtinfo.so.5", O_RDONLY|O_CLOEXEC) = 3
589 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
589 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
589 openat(AT_FDCWD, "/dev/tty", O_RDWR|O_NONBLOCK) = 3
589 openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
589 stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589 stat(".", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589 stat("/mnt", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589 stat("/mnt/c", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589 stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589 stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589 stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589 openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache", O_RDONLY) = 3
# ...
First time I did this I was surprised to see how many file operations are involved in such a basic script. In
real build scripts there will be even more going on but what we’re interested in are the stat
and
openat
operations. The reason is hopefully obvious because those are the system calls that let us
figure out which files the build script is reading from and writing to.
If we just look at the stat
calls then we see nothing interesting is going on but in a real build
script this would probably have some useful information
$ grep 'stat' build.output
589 stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589 stat(".", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589 stat("/mnt", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589 stat("/mnt/c", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589 stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589 stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589 stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589 stat("./build.sh", {st_mode=S_IFREG|0777, st_size=116, ...}) = 0
589 stat(".", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589 stat("/usr/local/sbin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589 stat("/usr/local/bin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589 stat("/usr/sbin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589 stat("/usr/bin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589 stat("/sbin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589 stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589 stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589 stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589 stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589 stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589 stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
Now let’s look at what files the script was reading from (there is again some extraneous cruft involved but it can be easily filtered out if desired)
$ grep 'O_RDONLY' build.output
# ...
590 openat(AT_FDCWD, "input/a", O_RDONLY) = 3
590 openat(AT_FDCWD, "input/b", O_RDONLY) = 3
# ...
591 openat(AT_FDCWD, "input/a", O_RDONLY) = 3
591 openat(AT_FDCWD, "input/b", O_RDONLY) = 3
591 openat(AT_FDCWD, "input/c", O_RDONLY) = 3
This mirrors exactly what we had in our build script. We first took 2 files to create one output and then took 3 files to create another output. In a real build script things would be more complicated but the gist of the idea is that looking at what files are read during the build process gives us an idea of what the inputs are. So we just recovered the input files to our build process
input/a
input/b
input/c
Now we need to recover the output files but that’s easy because instead of looking for files that were read from we look for files that were written to
$ grep 'O_WRONLY' build.output
590 openat(AT_FDCWD, "output/ab", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
591 openat(AT_FDCWD, "output/abc", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
This again mirrors what we had in the build script so this gives us the output files
output/ab
output/abc
We now have everything to retrofit a cache system on top of the build process.
First thing we need to do is make a key for the output files and we want that key to use the content of the input files in a non-trivial way. One way to accomplish this is to hash the contents of the input files in a deterministic order (I’m piping a tar file into shasum because in practice this seems to be the fastest way to generate a content hash that depends on various files and folders)
#!/bin/bash -eu
set -o pipefail
find input -type f -print0 | sort -z > inputs
tar -P --mtime='1970-01-01' --null \
--format=ustar \
--files-from=inputs \
-cf - | shasum -
Running that script should give you the following output and we can use the hash as a key for saving the output files
$ ./keygen.sh
6428f5771007cf005037d47c9aeac9bfcc8925f9 -
Now we can use that key to generate the cache file by compressing all the output files into a file named by the key we just generated
#!/bin/bash -eu
set -o pipefail
key="$(./keygen.sh | awk '{print $1}')"
rm -f "${key}.txz"
tar cJf "${key}.txz" output
And that’s it. We just retrofitted a caching system for a build process after reverse engineering the input
and
output files with strace
. To use this cache in a production setting you’d write another script to
compute the cache key and then see if the cache file exists and skip the build process if it does by just
decompressing the cache file.
If you're interested in speeding up and streamlining your build processes then get in touch and I'm happy to help out.
1: Thanks to Joe Ardent for pointing this out.