Notes
2026/03/31

A sketch for a minimal staged programming language

The goal is to build a language that use the same bytecode as input and output, so that the same bytecode VM can operate in stages.

As a base, we'll use a more or less lambda calculus inspired functional language with some additional data types and operations on them. The exact details don't matter here, but we'll implicitly assume that we have at least a sequential data structure with efficient access and string constants.
To make the bytecode usable as an efficient format for the runtime representation of values in the VM, we need to ensure that bytecode instructions encode how many stack slots they “claim”, so that we can quickly access elements lower on the stack without having to walk the entire stack. An example would be a TUPLE instruction that doesn't just encode how many (logical) elements on the stack it contains (that's the bytecode-as-input perspective) but also how many stack slots the entire tuple occupies, which is the sum of the sizes of its elements and all their descendants (that's the bytecode-as-runtime-representation perspective).
Other than using bytecode as both input and runtime representation / output, the only real difference compared to a “normal” VM is that we need a staged CALL instruction, which specifies at which stage it is called, from 0 to n. This number effectively represents a “delay”, because calls at stage 0 are called immediately, while calls at stage 1 will be called when the output bytecode of stage 0 is evaluated in the next stage.
While evaluating a stage, the VM keeps track of the minimum stage of any of the call instructions ending up on the output stack. After evaluating a stage, all CALL instructions in the resulting output bytecode are decremented by that minimum stage value.
Should it be possible for a stage to to emit deferred CALL instructions targeting the stage currently being evaluated? This would make it possible for a stage to decide whether or not it's done, but would break the guarantee of having a fixed number of stages. It nevertheless seems like supporting an arbitrary number of (dynamic) stages would be a valuable feature, as it allows a stage to signal that some of its calls aren't done, while allowing the rest of the evaluation of that stage to continue. A stage would then have to emit a CALL instruction of stage 0 (the one currently being executed), possibly through a special DELAYED_CALL instruction.
In the surface syntax, it probably makes sense to invert this 0 to n semantics and to keep runtime calls unannotated, while annotating comptime stages, e.g. with @foo(...) for a comptime call that happens before runtime and @@foo(...) for a call that happens at the comptime stage before the comptime stage before runtime. (It might also make sense to have a separate “delay” annotation that goes the opposite way and marks function calls as happening in a later runtime stage.)
Since the goal is to evaluate parts of the bytecode while keeping the rest of it for later stages, the current stage can only evaluate functions that are fully resolved at the current stage. This means that trying to call a value that hasn't been resolved to a function at the current stage is an error.
Function arguments do not need to be resolved for staged evaluation to happen, a function might thus operate on variables that will only be resolved in a later stage. To be able to use de Bruijn indices in the VM, a staged call must shift the de Bruijn indices accordingly so that they continue to refer to the same bindings and remain valid in the next stage.
Given that functions that contain calls of stage n might be called in stage n, we want to guarantee that staged evaluation happens bottom-up, so that functions fully evaluate any staged calls inside their bodies before they are themselves called. If function instructions are assumed to point to an offset in the (input) bytecode, we need to maintain a mapping from input bytecode offset to (already evaluated) functions.
After evaluating a stage, we need to traverse the output bytecode to check which functions are still referenced and then copy only those functions (and their referenced descendants) into the output bytecode, while also replacing the function pointers to point to the new offsets (effectively doing a mark-and-sweep for the functions, on the stack).