Tinkering: 2017-12-23: Regex Parser in C With Continuation Passing and O(1) Stack

A few days ago, I posted an article about implementing a regular expression (regex) parser using continuation passing style (CPS) in C. It was an experiment to show that one-phase parsing and matching of a regex is possible this way. The drawbacks are the somewhat obscure style of programming and that stack usage is practically out of control. In that first example, caring for stack usage was not my goal – and I still think the technique may be usable if the input regex only comes only from trusted sources, because then, a stack overflow is essentially a self-denial of service. Here's the code again:

regexp.c
regexp.h
regexp_internal.h
test.c
Makefile
gpl-3.0.txt

An interesting discussion about the idea followed on Hacker News. That discussion in turn sparked my interest to drive this further to achieve O(1) stack usage with this technique.

Program Structure

The first will be an intermediate solution where the previous code is restructured to be actually optimisable into O(1) stack. This step will neglect memory management, which will be tackled in the final solution.

To allow the compiler to use tail calls, we need to make sure that no data is live across a recursive function invocation, and that no result of a recursive call is used except in the top-level call.

Easy thing first: no data passed to a recursive call by reference (via a pointer) must be locally stored on the stack. This happens in functions like the following:

void bar(int a, cont_t cont, data_t *data)
{
    data_t data2 = { cont, data };
    foo(a, bar_cont, &data2);
}

This will prevent optimisation, because xdata2 is live across the call to foo and can only be discarded after the recursion returns, so bar's stack frame cannot be discarded before the recursion, and as a result, the C compiler cannot tail-call optimise the call.

In an intermediate solution, we will solve this by allocating a heap object instead of storing data2 on the stack. We will not deallocate anything here, so this will cause memory leaks that will be fixed later:

data_t *clone(const data_t *orig)
{
    data_t *copy = malloc(sizeof(*copy));
    *copy = *orig;
    return copy;
}

void bar(int a, cont_t cont, data_t *data)
{
    data_t data2 = { cont, data };
    foo(a, bar_cont, clone(&data2));
}

In this code, data2 is dead and can be discarded before foo is invoked, and the tail-call can be optimised.

The more complicated things that prevent tail-call optimisation need some thinking: in the first examples, there is code like this:

int foo(int a, cont_t cont, data_t *data)
{
    return
        cont(a*a, data) &&
        cont(a*a*a, data);
}

In this example, in order to invoke cont(a*a*a, data) after invoking cont(a*a, data), the value of a needs to be kept on the stack across the recursive call, and the result of that recursive call is checked to decide how to continue. So the first call cannot be transformed into a tail call. To make this possible, all code needs to immediately return from a recursion, i.e., the code must have an immediate return as follows for all invocations of continuations, without exception:

...
    return recurse(...);
...

However, to restructure this is not trivial. foo surely needs to be split and the cont(a*a*a, data) part (and the '&&') needs to be executed in a separate continuate (a separate function) after the recursion into cont(a*a, data) finishes.

To solve this, a second continuation will be passed down the recursion to be used at the very end of the recursion instead of just returning.

But first, to make this a little easier to handle, the continuation function pointer can be stored inside the data_t. data_t will be renamed to cont_t to make it clear that it is not only the continuation data. Here's a simple example rewritten in this style. It will print 5*5 (i.e., 25) and 5*5*5 (i.e., 125). main_cont will return the program's exit code, 0 or 1, and foo will return 1 immediately if the first call returns non-0.

#include <stdio.h>
#include <stdlib.h>

typedef struct cont cont_t;

typedef int (*code_t)(int, cont_t *);

struct cont {
    code_t code;
    cont_t *cont;
};

cont_t *new_cont(code_t code, cont_t *cont)
{
    cont_t *copy = malloc(sizeof(*copy));
    copy->code = code;
    copy->cont = cont;
    return copy;
}

int foo(int a, cont_t *cont)
{
    return
        cont->code(a*a, cont) ||
        cont->code(a*a*a, cont);
}

int main_cont(int a, cont_t *cont)
{
    printf("%d\n", a);
    return 0;
}

int main(void)
{
    return foo(5, new_cont(main_cont, NULL));
}

main now needs to allocate a continuation in order to pass the function pointer.

With this structure, the continuation to be run at the very end (in main_cont) can be introduced, in order to rewrite foo to return immediately after the invocation of the continuation. The new continuation will be called final. So now, two continuations will be passed down: one for the immediately next step, and one for returning after the result was processed (printed, in this case). This also requires some space for an integer in the context data to store foo's local variables, and to pass on the global result code.

#include <stdio.h>
#include <stdlib.h>

typedef struct cont cont_t;

typedef int (*code_t)(int, cont_t *, cont_t *);

struct cont {
    code_t code;
    int a;
    cont_t *cont;
    cont_t *final;
};

cont_t *new_cont(code_t code, int a, cont_t *cont, cont_t *final)
{
    cont_t *copy = malloc(sizeof(*copy));
    copy->code = code;
    copy->a = a;
    copy->cont = cont;
    copy->final = final;
    return copy;
}

int foo_final(int a, cont_t *cont, cont_t *final)
{
    if (a) { /* this implements '||' used previously */
        return a;
    }
    return final->cont->code(
        final->a * final->a * final->a,
        final->cont, final->final);
}

int foo(int a, cont_t *cont, cont_t *final)
{
    return cont->code(a*a, cont,
        new_cont(foo_final, a, cont, final));
}

int main_cont(int a, cont_t *cont, cont_t *final)
{
    printf("%d\n", a);
    return final->code(0, cont, final);
}

int main_final(int a, cont_t *cont, cont_t *final)
{
    return a;
}

int main(void)
{
    return foo(5,
        new_cont(main_cont,  0, NULL, NULL),
        new_cont(main_final, 0, NULL, NULL));
}

Now finally, every function immediately returns after invoking the continuation call, and no local data is left live across any continuation call. So now, the compiler should be able to apply tail-call optimisation in each function.

Indeed, for the above program, Gcc does that. The only call instructions left are those to printf, to malloc, and the initial call from main to foo. All other calls are converted to jmp instructions.

However, this is not the case for the more complex regexp example presented in the previous post. Although it was rewritten to use exactly the structure described above, Gcc's tail-call optimisation still does not kick in 100% of the time.

Optimisation Problems

Examining the reasons why the regexp example does not optimise as expected, I found the following problems:

Functions that are too complex are sometimes not optimised. It seems that Gcc loses overview, or optimises in other ways that stand in the way of the tail-call optimisation.
Gcc cannot optimise well if functions with different prototypes call each other, i.e., different numbers or sizes of parameters of the caller and the callee may cause the optimisation to fail.

The above toy code has no functions that are too complex to be optimised, so it works. But for the regex code, I had to split a few functions to make simpler ones. And some functions with different prototypes, I had to make into macros to enforce inlining. In the resulting code, all functions to be tail-call optimised use exactly the same prototype now.

Then I found that since I made all functions static, two other Gcc optimisations kick in and ruin my efforts:

Gcc auto-inlines static functions for heuristic reasons, and this causes functions to become to complex again for tail-call optimisation. (Also some of my manually split functions were merged again my this auto-inlining.)
Gcc changes the calling conventions to use more register parameters, i.e., it uses regparm(3) instead of regparm(0) as prescribed by the extern ABI. This means despite having the same prototype in C, functions are regarded different, and this makes the tail-call optimisation fail.

The answer to these problems was to use __attribute__((regparm(0))) for all static functions explicitly to avoid the calling convention to be changed, and to use -fno-inline to switch off auto-inline optimisation.

This achieved my goal: no call instructions to recursive functions anymore, but only jmp instructions. This code finally uses O(1) stack. But only on Gcc, with special hacks. And maybe only on the tested x86-32 architecture. And maybe only with exactly this version of Gcc.

And with memory leaks...

I also tried Clang: not a single tail-call was optimised, so this is a highly compiler specific method. In production code, surely one cannot rely of the O(1) stack with this method.

The following is the code of the regex parser/matcher of this intermediate stage.

regexp2.c
regexp.h
regexp2_internal.h
test.c
Makefile
gpl-3.0.txt

Compiler and Architecture Independent O(1) Stack

The answer to removing the dependency on compiler optimisations is relatively easy: since all functions now immediately return after their tail call, we can rewrite them to return the new function and parameters to invoke instead of invoking it and returning its result. There will be a central loop that executes those function invocations. (This was already mentioned as a way to become compiler independent in the HN discussion of the first article.)

Luckily, the cont_t type already has all the ingredients to store the function pointer and the function's parameters. So cont_t can be returned instead of int. By convention, computation will terminate when code == NULL.

#include <stdio.h>
#include <stdlib.h>

typedef struct cont cont_t;

typedef cont_t (*code_t)(int, cont_t *, cont_t *);

struct cont {
    code_t code;
    int a;
    cont_t *cont;
    cont_t *final;
};

cont_t *new_cont(code_t code, int a, cont_t *cont, cont_t *final)
{
    cont_t *copy = malloc(sizeof(*copy));
    copy->code = code;
    copy->a = a;
    copy->cont = cont;
    copy->final = final;
    return copy;
}

cont_t foo_final(int a, cont_t *cont, cont_t *final)
{
    if (a) {
        return (cont_t){ .a = a };
    }
    return (cont_t){ final->cont->code,
        final->a * final->a * final->a,
        final->cont, final->final };
}

cont_t foo(int a, cont_t *cont, cont_t *final)
{
    return (cont_t){ cont->code, a*a, cont,
        new_cont(foo_final, a, cont, final) };
}

cont_t main_cont(int a, cont_t *cont, cont_t *final)
{
    printf("%d\n", a);
    return (cont_t){ final->code, 0, cont, final };
}

cont_t main_final(int a, cont_t *cont, cont_t *final)
{
    return (cont_t){ .a = a };
}

int main(void)
{
    cont_t c = { foo, 5,
        new_cont(main_cont,  0, NULL, NULL),
        new_cont(main_final, 0, NULL, NULL) };
    while (c.code != NULL) {
        c = c.code(c.a, c.cont, c.final);
    }
    return c.a;
}

This code now has O(1) stack usage regardless of how well the compiler does tail-call optimisation, e.g., even with Clang, this is now good, because the code does not rely on Gcc, not on -fno-inline and not on __attribute__((regparm(0))) to work. I.e., the O(1) stack usage is not a proof of concept anymore, but is actually achieved using only pure C.

Admittedly, all this formal restructuring does not make the code more easy to follow, because there is more and more boilerplate overshadowing the few lines of significant code that does the squaring, the cubing, and the printing.

After making O(1) stack compiler independent, the final problem that needs tackling is the memory leaks that were introduced to get data away from the stack.

Removing Memory Leaks

Tackling the memory management is not easy in the regex code, because pointers to objects are passed on via parameters, stored in nested structs, etc. Basically, it was too complicated for me to see an easy way to sprinkle the code with a few free calls and be done. Instead, I implemented reference counting.

The idea is as follows:

new_cont returns a new object with refcnt==0,
when linking an object into another struct, this increments refcnt,
the main loop refs and unrefs the parameters to first protect and then to collect them.

That's really all that is needed, because no local objects are created and forgotten: all stuff is either a parameter, or will be stored in a cont_t object. So luckily, reference counting is easy.

The following code implements reference counting for the above example.

#include <stdio.h>
#include <stdlib.h>

typedef struct cont cont_t;

typedef cont_t (*code_t)(int, cont_t *, cont_t *);

struct cont {
    code_t code;
    int a;
    cont_t *cont;
    cont_t *final;
    int refcnt;
};

void unref(cont_t *cont)
{
    if (cont == NULL) return;
    if (--cont->refcnt == 0) {
        unref(cont->cont);
        unref(cont->final);
        free(cont);
    }
}

void ref(cont_t *cont)
{
    if (cont == NULL) return;
    cont->refcnt++;
}

cont_t *new_cont(code_t code, int a, cont_t *cont, cont_t *final)
{
    ref(cont);
    ref(final);
    cont_t *copy = malloc(sizeof(*copy));
    copy->code = code;
    copy->a = a;
    copy->cont = cont;
    copy->final = final;
    copy->refcnt = 0;
    return copy;
}

cont_t foo_final(int a, cont_t *cont, cont_t *final)
{
    if (a) {
        return (cont_t){ NULL, a, NULL, NULL };
    }
    return (cont_t){ final->cont->code,
        final->a * final->a * final->a,
        final->cont, final->final };
}

cont_t foo(int a, cont_t *cont, cont_t *final)
{
    return (cont_t){ cont->code, a*a, cont,
        new_cont(foo_final, a, cont, final) };
}

cont_t main_cont(int a, cont_t *cont, cont_t *final)
{
    printf("%d\n", a);
    return (cont_t){ final->code, 0, cont, final };
}

cont_t main_final(int a, cont_t *cont, cont_t *final)
{
    return (cont_t){ NULL, a, NULL, NULL };
}

int main(void)
{
    cont_t c = { foo, 5,
        new_cont(main_cont,  0, NULL, NULL),
        new_cont(main_final, 0, NULL, NULL) };
    cont_t c2 = { NULL, 0, NULL, NULL };
    while (c.code != NULL) {
        ref(c.cont);
        ref(c.final);
        unref(c2.cont);
        unref(c2.final);
        c2 = c;
        c = c.code(c.a, c.cont, c.final);
    }
    unref(c2.cont);
    unref(c2.final);
    return c.a;
}

In the corresponding regex parser/matcher version of this structure, I added some instrumentation to count objects and to confirm that all objects are correctly collected. Indeed, all memory leaks are gone.

Eliminating Recursive Unref()

But there is a new problem: the unref function is recursive, and, indeed, the problem if deletion is recursive, because whole trees may be deallocated at once, e.g., when an exception is thown after a longer computation – a lot of objects will then deallocated after the loop in main. I could not come up with a guarantee of O(1) stack usage in unref by just thinking a bit. Essentially, the ref/unref now destroys the O(1) stack usage feature of the code.

In a final step, O(1) is reintroduced by restructuring the unref to use a linked list of moribund objects that are freed in a loop. This requires another pointer in cont_t to point to the next unreferenced object, and it requires rewriting unref as follows.

struct cont {
    code_t code;
    int a;
    cont_t *cont;
    cont_t *final;
    int refcnt;
    cont_t *next;
};

cont_t *chain_unref(cont_t *chain, cont_t *cont)
{
    if (cont != NULL) {
        if (--cont->refcnt == 0) {
            cont->next = chain;
            chain = cont;
        }
    }
    return chain;
}

cont_t *chain_drop(cont_t *chain, cont_t *cont)
{
    chain = chain_unref(chain, cont->cont);
    chain = chain_unref(chain, cont->final);
    free(cont);
    return chain;
}

void unref(cont_t *cont)
{
    cont_t *chain = chain_unref(NULL, cont);
    while (chain != NULL) {
        chain = chain_drop(chain->next, chain);
    }
}

Now finally, this is the result. It uses continuation passing style (CPS) to implement one-phase regex parsing and matching, it has O(1) stack usage independent of the compiler optimisation by eliminating all recursion, with data on the heap, and a memory management of that data without leaks.

Here is the regex version of this. Again, it becomes longer and more obscure. Sorry! I also added some instrumentation to count the computation steps (time) and object count (space). The version regexp3b.c is the same as regexp3.c without instrumentation.

regexp3.c
regexp3b.c
regexp.h
regexp3_internal.h
test.c
Makefile
gpl-3.0.txt

Unbounded Computation

Since this code does not compile the regex into a deterministic finite automaton (DFA), it is theoretically prone to unbounded computation, e.g. in patterns like (o?)* when the matcher starts searching for the longest sequence of empty strings... This trap can be avoided for successful matches by first matching the non-empty case before trying the empty case. This works here because no particular order of matching is required here, and the code does not need to find the longest sequence, because it only matches, but does not capture.

For unsuccessful matches, however, avoiding the problem needs extra code: trying to match "()*" against "a" or "()*a" against "" will cause unbounded recursion or unbounded object creation and/or unbounded runtime. It can be fixed by prohibiting two consecutive unsuccessful zero matches for sub-patterns (see perlre manpage for an explanation). The following code was added to handle this:

    if ((d->incarn > lo) && (s == d->str)) {
        return RE_E_FAIL;
    }

And equivalently, in the regexp3 version:

    if ((f->incarn > f->lo) && (f->str == f->cont->str)) {
        return (re_cont_t){ f->final->code, d, f->final, p, s, .err = RE_E_FAIL };
    }

Again, I very much appreciate comments and bug reports! :-)