The previous article is here: Mainframes, Monads, and Stack Machines: Extending Siunertaq to the Batch Layer
This is the second follow-up to Dhall-to-Effect and provably safe task orchestration and Mainframes, Monads, and Stack Machines. The short version of both: we model build graphs as norm-bounded quivers, verify them with Z3/Yices before any effect fires, and orchestrate steps using JCL-inspired ADTs backed by Dhall and Spring Batch + Pekko.
The previous posts worked at the graph level: vertices, edges, the shape of dependencies. This one goes one level down β to registers.
The insight that forced this descent was noticing that Ref[IO, Array[Byte]] is a dangerous type: anyone who calls .get holds a mutable alias to the array and can change its contents without going through the Ref. We needed something with the guarantees of a GPU register file: fixed width, typed, atomically updated, never aliased. That observation turned out to connect directly to SIMD/SIMT architecture β and from there, to complex Berkovich heights and braid group theory.
Here is how it unfolded.
ποΈ The Problem with Ref[IO, Array[Byte]]
Consider the naive state representation for a Golay codeword stored in a Cats Effect Ref:
val codewordRef: Ref[IO, Array[Byte]] = ???
// Looks safeβ¦
codewordRef.get.flatMap { arr =>
arr(0) = 0xFF.toByte // β mutation outside the Ref boundary. Ref never sees this change.
IO.unit
}
This is the array-aliasing problem. Array[Byte] is mutable; Ref only protects against concurrent writes through the Ref, not against mutation of the value you extracted. The fix isn't synchronized β it's to make the inner value structurally immutable.
IArray[Byte] (Scala 3's immutable array wrapper) closes that gap. But there is a second problem: Ref[IO, IArray[Byte]] carries no information about which codeword this is, or how long it should be. The length constraint (exactly 23 bytes for a binary Golay word) has to live somewhere. In a GPU, the register width is a hardware invariant; in Scala, we encode it as an opaque type.
π opaque type as a Register Width Guarantee
opaque type GolayWord = IArray[Byte]
object GolayWord:
def apply(bytes: IArray[Byte]): Either[String, GolayWord] =
if bytes.length == 23 then Right(bytes)
else Left(s"Golay(23) requires exactly 23 bytes; got ${bytes.length}")
val zero: GolayWord = IArray.fill(23)(0.toByte)
extension (w: GolayWord)
def xor(other: GolayWord): GolayWord =
IArray.tabulate(23)(i => (w(i) ^ other(i)).toByte)
// xor is closed: result is always 23 bytes, so the invariant is preserved
Outside the GolayWord companion, callers hold an abstract token. They cannot:
- Construct a
GolayWordof the wrong length. - Call
IArraymethods directly and accidentally truncate it. - Pass it where an arbitrary
IArray[Byte]is expected without an explicit.bytescall.
This is identical to what a typed register does in hardware: the width is enforced at the boundary, and the CPU never sees a "34-bit float" just because someone miscounted.
A second opaque layer wraps the Ref itself:
opaque type GolayWordRef = Ref[IO, GolayWord]
object GolayWordRef:
def make(initial: GolayWord): IO[GolayWordRef] = Ref.of[IO, GolayWord](initial)
extension (ref: GolayWordRef)
def xorWith(other: GolayWord): IO[Unit] = ref.update(_.xor(other))
def hammingWeight: IO[Int] = ref.get.map(_.count(_ != 0))
// The raw Ref is never exposed; only algebra-preserving updates are permitted.
The invariant is now structural: because GolayWord.xor is closed (it always returns 23 bytes), every permitted update through GolayWordRef preserves the length. You cannot break the 23-byte constraint without defeating the opaque type β which requires touching the companion object source.
π SIMD Lane Counts: the Orbit Sizes
Now consider the Golay code's five weight classes: W0, W8, W12, W16, W24.
W0 β orbit size 1 (the zero codeword)
W8 β orbit size 759 (octads; Steiner S(5,8,24) blocks)
W12 β orbit size 2576 (dodecads)
W16 β orbit size 759 (complement octads)
W24 β orbit size 1 (the all-ones codeword)
Total: 4096 = 2ΒΉΒ²
Each weight class is not a single value β it is a lane group. When the Frobenius morphism acts on W8, it acts on all 759 octads simultaneously. This is SIMD in the mathematical sense: one instruction (FVRole.Frobenius), 759 data elements, all updated in lockstep.
extension (gw: GolayWeight)
def orbitSize: Int = gw match
case GolayWeight.W8 => 759 // β 759 SIMD lanes
case GolayWeight.W12 => 2576 // β 2576 SIMD lanes (the densest class)
...
The SIMT side is the Ref[IO, State] cells β one per quiver vertex. Each cell is a thread register running its own fiber under Cats Effect:
class BSDQuiverManager(
val leechRef: Ref[IO, LeechState], // Thread 0 register file
val affineRef: Ref[IO, AffineDualState], // Thread 1 register file
val padicRef: Ref[IO, PadicState], // Thread 2 register file
val selmerRef: Ref[IO, SelmerState] // Thread 3 register file
)
IO.parTraverse over the BSDArrow list is the warp: all four fibers execute the same instruction stream (executeArrow) on their own registers, concurrently. The quiver morphism is the instruction; the Ref cells are the per-thread register file.
| GPU | Siunertaq |
|---|---|
| SIMD lane |
GolayWeight.orbitSize elements per weight class |
| SIMT thread register |
Ref[IO, State] per BSDVertex |
| Warp instruction |
BSDArrow[S, T] morphism |
| Warp execution |
IO.parTraverse over arrow list |
| Thread mask (predication) |
galoisHeight semistability bound |
| Register width guarantee |
opaque type GolayWord / opaque type PhantomCarabinerRef
|
| IEEE 754 machine epsilon | MachineConstants.machineEpsilonReal = 2β»β΅Β² |
| Mantissa bits | MachineConstants.valuationDepth = 52 |
machineEpsilonReal and valuationDepth are not decorative. The qAdicEquivalent tolerance check uses machineEpsilonReal as its floor, and the p-adic valuation tower truncates at depth 52 β the point where the floating-point representation runs out. The theory is literally grounded in the hardware.
π§΅ The Braid Word Problem
BSDArrow is a phantom-typed enum:
enum BSDArrow[Src <: BSDVertex, Tgt <: BSDVertex](
val src: Src, val tgt: Tgt, val role: FVRole, val effect: IO[Unit])
A List[BSDArrow[? <: BSDVertex, ? <: BSDVertex]] is a braid word: a sequence of typed generators where each generator knows which strands it connects. Reading the list left-to-right and applying each generator is the JVM analogue of evaluating a braid group element one letter at a time.
The complication: BSDArrow has no unapply, so you cannot destructure it with a constructor pattern:
// ILLEGAL β the compiler rejects this
case BSDArrow(src, tgt, FVRole.Frobenius, _) => ...
// Error: BSDArrow cannot be used as an extractor; lacks unapply
The fix is what we call a braid pop: extract only the field you need, and let the declared type bound discharge the existential automatically.
// Pop .role β it is a concrete FVRole; no existential type involved.
a.role match
case FVRole.Frobenius => ThresholdConstraint.FrobeniusGE(a.src, a.tgt)
case FVRole.Verschiebung => ThresholdConstraint.VerschiebungLE(a.src, a.tgt)
// a.src has type ? <: BSDVertex; since BSDVertex is the upper bound,
// it is accepted wherever BSDVertex is expected β no cast, no ascription.
Alternatively, a type-test pop fully resolves the existential by naming the concrete case:
case a: BSDArrow.TensorBang =>
ThresholdConstraint.FrobeniusGE(a.src, a.tgt)
// a.src: Leech.type β a singleton type, provably <: BSDVertex
// a.tgt: AffineDual.type β same; the compiler checks this at compile time
The type-test form is used when exhaustiveness matters (e.g. BSDQuiverManager.executeArrow, which must cover all four cases). The .role match form is used when the semantic partition β Frobenius vs Verschiebung β is what matters (e.g. ThresholdProblem.fromArrows). Either way, no runtime cast is needed; the phantom types do the work.
π Lifting to the Complex Plane
The discrete GolayWeight encodes five points on the Berkovich tree. To model continuous motion along the tree β which is what the IUT Ξ-link and p-adic Frobenius actually do β we need a complex-valued position.
opaque type ComplexWeight = (Double, Double)
// re: Berkovich height h β [0, 8] (where does the point sit on the tree?)
// im: obstruction residual degree (how far off the real axis?)
This is the step from u8 (discrete weight) to c64 (complex float) in GPU register-type terminology. The width is still enforced by the opaque type; the algebra is now richer.
PhantomCarabiner holds one ComplexWeight and a β€/4β€ phase, and exposes three core operations:
complement w β¦ 6 β conj(w) self-duality; h + hΜ = 6 on β
verschiebung w β¦ w / 2 Witt-vector V for p = 2; contracts toward 0
thetaLink w β¦ βwΒ·i IUT Ξ-link; rotates by βΟ/2; period 4
thetaLink has β€β periodicity: apply it four times and you return to the start. This is exactly the Pauli group action, and it mirrors how a 90Β° rotation in a SIMD lane accumulates to a full 360Β° after four passes.
π Berry Phase: the Observable That Survives Orbit-Averaging
After a long sequence of quiver morphisms, what can you actually read out of a PhantomCarabiner?
def berryPhaseAngle: Double // arg(weight) β (βΟ, Ο]
def weightNormSq: Double // |weight|Β² = reΒ² + imΒ²
These two quantities are gauge-invariant: they do not depend on the path taken through the braid word, only on the accumulated holonomy.
-
weightNormSqis preserved bythetaLink(rotation, no scaling) and halved byverschiebungΒ². -
berryPhaseAngleadvances by exactly βΟ/2 under eachthetaLinkand is unchanged byverschiebung.
In physics terms, berryPhaseAngle is the holonomy of the adiabatic connection β the phase your state picks up when you move it around a closed loop in parameter space. In GPU terms, it is the value in the accumulator register after the warp finishes: the only thing the CPU can measure once the thread's intermediate states are gone.
PhantomCarabinerRef wraps the mutable state and exposes only the gauge-invariant projections as reads:
opaque type PhantomCarabinerRef = Ref[IO, PhantomCarabiner]
extension (ref: PhantomCarabinerRef)
def berryPhaseAngle: IO[Double] = ref.get.map(_.berryPhaseAngle)
def weightNormSq: IO[Double] = ref.get.map(_.weightNormSq)
def applyThetaLink: IO[Unit] = ref.update(_.thetaLink)
// The raw Ref is inaccessible; only algebra-preserving transitions are exposed.
π Two Heights, Two Semistability Conditions
One of the more subtle points in the codebase is that there are two height functions, used for two different purposes:
octadHeight(w) = w / 3 linear [0, 8]; Berkovich tree position
galoisHeight(n) = 8 Β· log(n) / log(24) logarithmic; GIT semistability mask
octadHeight is the per-thread register value: where is this thread on the tree? It satisfies h(w) + h(24-w) = 8 (the fan functional equation), making the Golay route self-dual.
galoisHeight is the thread mask. Orbits whose "instruction count" grows faster than log are semistability-filtered out before the SMT solver even sees them β exactly as divergent SIMT threads are predicated off before the instruction is retired. The two functions are numerically different:
galoisHeight(8) β 5.23 β representation-theoretic weight
octadHeight(8) = 2.67 β geometric height on the Berkovich tree
Conflating them (the mock galoisHeight(tau) = tau.toDouble in earlier versions of BSDQuiver.scala) is the equivalent of using the lane index as the accumulator value β a type error that the compiler cannot catch but the math will eventually surface.
π What's Next: Wiring the Register File into the Batch Layer
With PhantomCarabinerRef in place, the remaining step is to wire it into BSDQuiverManager so that berryPhaseAngle is tracked across arrow firings:
// Planned: each vertex Ref also holds a PhantomCarabiner
class BSDQuiverManager(
val leechRef: Ref[IO, (LeechState, PhantomCarabiner)],
...
):
def executeArrow[S, T](arrow: BSDArrow[S, T]): IO[Boolean] = ???
// After SMT confirms the norm bound is satisfiable:
// fire arrow.effect
// update the phantom register for tgt via thetaLink or verschiebung
// expose berryPhaseAngle for the monitoring layer
Once that is in place, the batch layer gains a continuous diagnostic signal: the Berry phase accumulated across the job graph tells you how much the evaluation has rotated away from the real axis β in other words, how much p-adic obstruction the completed steps have introduced into the downstream norm bounds.
The mlir-bridge will then map BSDArrow decompositions to MLIR Affine Dialect norm constraints, JIT-compiling the verified bounds via LLVM IR. At that point, the stack machine in core is no longer just an evaluator β it is an IR generator, and the Dhall REPL becomes a live assembler for a JIT-compiled Berkovich-tree traversal.
The codebase is at github.com/Yoshyhyrro/Siunertaq. The Lean 4 specs for PhantomCarabiner, YangBaxterBanach, Carabiner, and MachineConstants are in the repository. PRs improving proof coverage β especially closing the open sorry on complement_theta_link_comm and yang_baxter_eq β are very welcome. π
Top comments (0)