demonstrandom■

Building a Minimal Computational Invariant Theory Library

Tue, 17 Mar 2026 04:00:00 GMT

Introduction

Now that we’ve investigated the basics of invariant theory, we can look at how to compute invariants with code. In this post, I’ll investigate computational invariant theory, with an emphasis on the algorithmic aspects. I’m vaguely following Derksen and Kemper’s book.

This is once again a long post, mixing theory, algorithms, and code. I used Claude to help with the code and the writing, but I also steered and reviewed heavily. Full code will be made available shortly.

Background

We are studying the action of a group on a vector space . We want to understand the structure of the ring of invariants , which consists of all polynomial functions on that are invariant under the action of .

That is:

Last time, we established a rudimentary process for computing invariants based on properties of the group action. The process was (roughly):

Is finite? If so, we can compute the invariants by averaging over the group using the Reynolds operator.
Is the group infinite, but compact? If so, we can compute the invariants by integrating over the group with respect to the Haar measure (also Reynolds operator). The Haar measure depends on the topology of :
1. If is a Lie group, we can sometimes construct the Haar measure explicitly via the Maurer-Cartan form.
2. There are other cases, but we won’t be concerned with them here.
Is the group noncompact but reductive? If so, a Reynolds operator exists, but the actual computation uses representation theory to identify the equivariant projections directly.
Is the group non-reductive? Idiosyncratic methods exist for specific cases.

In this post I’ll focus on the first three cases, which are the most common and well-studied. In particular, we’ll look at how to compute invariants for finite groups, compact Lie groups, and reductive groups.

Data Structures

What are we actually computing here? As inputs to the functions, we’ll need a way to encode a group and its action on a structured set (usually a vector space ). As outputs, we’ll want to compute a generating set for the ring of invariants . That is, a set of invariants such that any invariant can be expressed as a polynomial in the generators.

How should we represent these as data structures in code?

Review of Existing Libraries

The data structures will ultimately dictate the algorithms we can use, so we want to choose them carefully. How do existing libraries for computational invariant theory, such as Magma, SageMath, and Singular, represent groups, group actions, and invariants? I took a quick look (mostly not that helpful) at the documentation and source code for these libraries to get a sense of how they approach these problems.

Magma

Magma is a general computer algebra system maintained by the University of Sydney that includes functionality for computational invariant theory. The library uses a custom language, with the performance-critical algorithms implemented in C at the kernel level. The actual library is closed-source (which makes it hard to investigate the data structures).

Based on the documentation, Magma apparently uses a typing system that corresponds to algebraic categories, and the type system enforces particular mathematical structures. Unfortunately I couldn’t investigate too deeply.

SageMath

SageMath is an open-source computer algebra system with various libraries for different areas of mathematics. The chief language is Python, with performance-critical algorithms implemented in Cython or C. The architecture is “federated” in some sense, with interfaces to ~100 different external packages for different areas of mathematics, all made interoperable via a central type coercion system. It seems like SageMath is a nightmare to distribute due to the large number of dependencies, but the upside is that it has a huge variety of functionality.

The entire library is GPL, so we can look inside. For invariant theory, SageMath has a package called “invariant_theory”, which mostly focuses on the action of on homogeneous polynomials (as in the classical invariant theory post).

Under the hood, there is a “parent/element” framework (which is used to deal with the federated nature of the library). The parent object encodes the algebraic structure, and the element objects encode specific instances of that structure. So for a polynomial, the parent would be the space it lives in () and the element would be a specific polynomial (). The parents are organized in a hierarchy of algebraic categories and then there’s a bunch of infra for managing all the types.

For invariant theory in particular, the wrapped library is Singular, so we should just look at that.

Singular

Singular looks like it’s designed for polynomial computations, especially commutative and non-commutative algebra, algebraic geometry, and singularity theory.

It also has an open-source C++ library for invariant theory. A bunch of the documentation is here.

Since Singular is designed for polynomial computations, a lot of its data structures look to be defined for those purposes. For example, a monomial is a vector of exponents , and a polynomial is a list of monomials and coefficients. Rings are a sort of global context, and ideals are represented as arrays of polynomial generators (it seems like they are actually arrays of pointers to polynomials).

Singular does have some functionality for group actions and invariants (including algorithms for Grobner bases and Hilbert series) but it’s focused on specific cases (like finite groups acting on polynomial rings) rather than a general framework for group actions.

In short, Singular looks like a cool/good library, but since we are mostly interested in group actions and invariants, it departs pretty heavily from the abstractions I think we would ideally want.

Data Structure Implementations

How ought we design our data structures? We want to be able to represent different types of groups (finite, classical, reductive) and their actions on different types of structured sets (vector spaces, affine varieties, etc). We also want to be able to represent the invariants themselves, which are usually polynomials or rational functions, as well as the relations among them, presentations, etc. Furthermore, I want to follow my minimalist, compositional style.

Polynomials

Let’s start with the invariants themselves, which are usually polynomials (or sometimes rational functions). We need a way to represent multivariate polynomials with rational coefficients. We need arithmetic operations on these polynomials (addition, multiplication, scaling), and the ability to evaluate them at specific points.

A monomial can be identified with its exponent tuple . So a polynomial is just a finite linear combination of monomials with rational coefficients. Internally, we can use a dict mapping tuples to Fractions, with associated arithmetic operations:

class Poly:

    Type = dict[tuple[int, ...], Fraction]

    @staticmethod
    def add(f, g): ...
    @staticmethod
    def mul(f, g): ...
    @staticmethod
    def scale(c, f): ...
    @staticmethod
    def evaluate(f, point) -> Fraction: ...

    # -- Constructors --
    @staticmethod
    def mono(alpha, c=1): ...       
    @staticmethod
    def var(i, n_vars): ...          
    @staticmethod
    def const(value, n_vars): ...    

    # -- Leading term operations --
    @staticmethod
    def leading_monomial(f, order=grlex): ...
    @staticmethod
    def leading_coefficient(f, order=grlex) -> Fraction: ...

    # -- Monomial operations (for Gröbner bases) --
    @staticmethod
    def mono_divides(a, b) -> bool: ...   
    @staticmethod
    def mono_lcm(a, b): ...               
    @staticmethod
    def mono_mul(a, b): ...               
    @staticmethod
    def mono_div(a, b): ...               

    # -- Orderings --
    @staticmethod
    def grlex(alpha):
        """Graded lexicographic: total degree first, then lex."""
        return (sum(alpha), alpha)

    @staticmethod
    def elimination_order(k):
        """Orders the variables so that x_0,...,x_{k-1} are ordered before the remaining variables. For Gröbner elimination."""
        def order(alpha):
            return (alpha[:k], sum(alpha[k:]), alpha[k:])
        return order

For example, implementing the polynomial in three variables:

f = {(2, 1, 0): Fraction(3), (0, 3, 0): Fraction(-1), (0, 0, 0): Fraction(7)}

Why use Fraction instead of floats? For now, we will use exact arithmetic (where possible, there is at least one exception since I don’t want to implement a full computer algebra system). With Fraction, the Reynolds operator, orbit sums, and Gröbner reductions stay exact.

The method implementations aren’t shown above, but they are pretty straightforward. The notable ones are the leading term operations, which are defined with respect to a monomial ordering (we will need this for Gröbner bases).

Group Actions

We can represent the group as a set of generators and relations, or as a matrix group acting on (or some ). The choice of representation will depend on the specific group and the context of the problem. For example, if is a finite group, we can represent it as a list of its elements or as a permutation group. If is a Lie group, we can represent it using its Lie algebra and the exponential map. So we’ll need some abstraction to ensure that we can work with different types of groups in a unified way.

Let’s assume we have some group object that can act on an object . We need something along the lines of:

class Group:
    def identity(self):
        raise NotImplementedError

    def multiply(self, g, h):
        raise NotImplementedError

    def inverse(self, g):
        raise NotImplementedError

class GroupAction:
    def act(self, g, x):
        """Apply group element g to object x."""
        raise NotImplementedError

As written, this is too abstract to be useful, since different group types compute invariants in different ways.

We will look at at least a few different types of groups (tori, finite groups, classical groups, and reductive groups), and the algorithms for computing invariants differ in each case. A torus solves an integer linear system, a finite group averages over its elements using the Reynolds operator, and a classical group hard-codes generators from the First Fundamental Theorems. There is no single act method we can implement that covers all of these.

However, the downstream API should be the same regardless of group type. In all cases, we test invariance, compute generators, compute the Hilbert series, test orbits, etc. So we should organize the abstractions around the actions, with closures that each group type can fill in as needed:

@dataclass(frozen=True)
class Action:
    """Bundle of closures encoding a (group, space) invariant theory problem."""
    is_invariant:        Callable[[Poly], bool]
    invariants_of_degree: Callable[[int], list[Poly]]
    hilbert_coeffs:      Callable[[int], list[int]] | None = None
    orbit_test:          Callable | None = None
    apply_element:       Callable | None = None
    elements:            Callable | None = None
    n_vars:              int = 0

We will have each group class provide an .action(space) method that attaches each of its primitives and returns an Action. The derived API is then group-agnostic, and operates on Action objects.

What are the different closures we need to fill in for different group types? We need to be able to check if a polynomial is an invariant, find invariants of a given degree, compute the Hilbert series, test if two points are in the same orbit, apply a group element to an object, and list the group elements (if finite). Not all of these will be implemented for every group type, but if we can implement them we should.

def invariant_theory(group, space) -> Action:
    """Assemble an Action from a group and a space descriptor."""
    return group.action(space)

The idea behind this design is to encapsulate all the group-specific logic inside the group classes, and then have a uniform API for working with invariants that is independent of the group type. The Action object serves as a bridge between the group and the algorithms for computing invariants, allowing us to write algorithms that are agnostic to the specific group structure. So adding a new group type requires only implementing a class with .action(space) -> Action. Everything else (generators, Hilbert series, orbit tests, separators) should work “automatically”.

Torus Actions

A torus acts on via an integer weight matrix (). A monomial is invariant if and only if . We will go through the actual theory for a torus action below. This case is simple enough that we don’t need Reynolds operators or representation theory, we can just use integer linear algebra.

class Torus:
    def __init__(self, W: np.ndarray):
        self.W = np.asarray(W, dtype=int)
        self.rank = self.W.shape[0]
        self.n_vars = self.W.shape[1]

    def is_invariant_monomial(self, alpha: tuple[int, ...]) -> bool:
        return np.all(self.W @ np.array(alpha, dtype=int) == 0)

    def hilbert_basis(self, max_degree: int = 20) -> list[tuple[int, ...]]:
        ...

Finite Groups

For finite groups, we need an explicit list of matrices (one for each group element). The core operations are the Reynolds operator (average over the group), orbit sums (a particularly clean basis construction in the monomial/permutation cases), and the Molien series (Hilbert series via eigenvalues).

class FiniteGroup:
    def __init__(self, matrices: list[np.ndarray]):
        self.matrices = matrices
        self.order = len(matrices)
        self.n_vars = matrices[0].shape[0]

    def reynolds(self, f: Poly) -> Poly:
        total: Poly = {}
        for g in self.matrices:
            total = poly.add(total, self.apply_to_poly(g, f))
        return poly.scale(Fraction(1, self.order), total)

    def orbit_sum(self, f: Poly) -> Poly:
        seen = set()
        total: Poly = {}
        for g in self.matrices:
            gf = self.apply_to_poly(g, f)
            key = frozenset(gf.items())
            if key not in seen:
                seen.add(key)
                total = poly.add(total, gf)
        return total

    def molien_coeffs(self, max_d: int) -> list[int]:
        ...

For common finite groups, we can provide constructors in a group library:

def symmetric(n: int) -> FiniteGroup:
    """S_n acting on C^n by permutation matrices."""
    ...

def cyclic(n: int, dim: int = 2) -> FiniteGroup:
    """Z/nZ acting on C^dim by rotation."""
    ...

def dihedral(n: int) -> FiniteGroup:
    """D_n acting on C^2 by rotations and reflections."""
    ...

Classical Groups

Classical groups (, , ) are infinite and continuous, so in principle the Reynolds-operator story is more complicated.

We could construct the Maurer-Cartan form, compute the Molien-Weyl integral for the Hilbert series, and then project the result onto trivial representations to extract the invariants. Alternatively, the First Fundamental Theorems of Invariant Theory (FFTs) give us explicit generators for the invariant ring, so we could just hard-code those and build the invariants as products.

def orthogonal_action(n: int, k: int) -> Action:
    """O(n) acting diagonally on k copies of C^n.
    Generators: inner products ."""
    ...

def sl_action(n: int, k: int) -> Action:
    """SL(n) acting diagonally on k copies of C^n.
    Generators: n x n bracket determinants."""
    ...

def symplectic_action(n: int, k: int) -> Action:
    """Sp(2n) acting diagonally on k copies of C^{2n}.
    Generators: symplectic pairings omega(v_i, v_j)."""
    ...

For , the generators are inner products. For , the generators are determinantal brackets. For , the generators are symplectic pairings.

Each of these returns an Action (the same interface as finite groups and tori), so downstream code for Hilbert series, presentations, and orbit separation should work unchanged. In the easy cases, we can get the Hilbert series by counting monomials in the generators rather than evaluating the Molien-Weyl integral.

Reductive Groups

For more general reductive groups (of particular interest is acting by conjugation), we’d need the full representation-theoretic machinery. This means decompose the polynomial ring into irreducible representations and extract the trivial summands.

There’s no general algorithm to handle all possible cases. For some cases (once again, ) invariants are generated by traces of products.

This is out of scope for this post, but in principle we could implement the algorithms in the same framework as the other group types, with the Action object providing the necessary closures for testing invariance, computing generators, and so on.

Spaces

We need to encode how the group action affects the vector space the group acts on. For polynomials:

@dataclass(frozen=True)
class Space:
    n_vars: int
    apply_matrix: Callable  
    add: Callable
    scale: Callable
    zero: Callable

def polynomial_ring(n_vars: int) -> Space:
    """Standard polynomial ring C[x_0, ..., x_{n-1}]"""
    ...

Since the space is defined separately, the algorithms can be group-agnostic. The Reynolds operator and orbit sums in FiniteGroup use space.add, space.scale, and space.apply_matrix rather than calling polynomial arithmetic directly. So we can extend this library to work on new object types by implementing new space descriptors with the appropriate operations without adjusting the group classes or derived API.

Computational Tasks

Now that we have the data structures in place, we can ask what kinds of computations actually arise in invariant theory. From the last section, we have data structures for groups, group actions, and invariants. What are the key computational tasks we want to perform with these objects? That is, what should the API of our computational invariant theory library look like?

Could be something like this:

# Decision problems
def is_invariant(action: Action, f: Poly) -> bool: ...
def in_null_cone(action: Action, v: np.ndarray, max_degree: int = 6) -> bool: ...

# Construction problems
def compute_generators(action: Action, max_degree: int) -> list[Poly]: ...
def compute_hilbert_series(action: Action, max_degree: int) -> list[int]: ...

# Presentation problems (Gröbner-based)
def compute_relations(generators: list[Poly], n_vars: int) -> list[Poly]: ...
def normal_form(f: Poly, basis: list[Poly]) -> Poly: ...
def in_ideal(f: Poly, basis: list[Poly]) -> bool: ...
def eliminate(generators: list[Poly], k: int, n_vars: int) -> list[Poly]: ...

# Orbit problems
def same_orbit(action: Action, v: np.ndarray, u: np.ndarray) -> bool: ...
def find_separator(action: Action, v: np.ndarray, u: np.ndarray, max_degree: int = 6) -> Poly | None: ...

Let’s look through these in more detail.

Decision Problems

In these types of problems, we are checking an input for some property, and the output is a boolean.

Testing Whether a Polynomial Lies in

Probably the most fundamental decision problem is to determine whether a given polynomial is actually invariant under the action. This is the most direct membership test for the invariant ring.

This suggests an operation of the form

def is_invariant(action: Action, f: Poly) -> bool: ...

Testing Whether an Object Is Invariant Under the Action

More generally, we may want to test if some explicitly represented object is invariant under the action.

def is_invariant(action: Action, x) -> bool: ...  # same interface, different object types via Space

Testing Whether a Point Lies in the Null Cone

We have yet to introduce the concept of the null cone, but the idea is that, given the quotient map , we want to test whether a point maps to the origin in the quotient. This is equivalent to testing whether all positive-degree invariants vanish at . This has important geometric implications for the quotient space.

def in_null_cone(action: Action, v: np.ndarray, max_degree: int = 6) -> bool: ...

Construction Problems

These operations attempt to compute explicit invariants or structured generating data for the invariant ring.

Computing Generators of

We might want to compute a generating set for the invariant ring. This gives a finite description of all polynomial invariants and is often the starting point for further structural work.

In API terms, this means we want an operation

def compute_generators(action: Action, max_degree: int) -> list[Poly]: ...

Computing Primary and Secondary Invariants

If is a finite group acting on , then the invariant ring is a finitely generated module over a polynomial subring generated by a homogeneous system of parameters (HSOP). The generators of the polynomial subring are called primary invariants, and the generators of the module are called secondary invariants. Computing these can give us a more structured understanding of the invariant ring.

def compute_primary_secondary(action: Action, max_degree: int) -> tuple[list[Poly], list[Poly]]: ...

Computing Separating Invariants

A separating set of invariants is a subset of the invariant ring that can distinguish between different orbits of the group action. That is, if we have two points , then invariants in the separating set can distinguish them whenever they have different images in the quotient. For finite groups, this is the same as distinguishing different orbits. Think of this as a “weaker” version of a generating set that is only concerned with separating orbits rather than generating the entire ring. Computing a separating set can be easier than computing a full generating set, and it is often sufficient for many applications.

As an API operation:

def compute_separating_invariants(action: Action, max_degree: int) -> list[Poly]: ...

These are apparently even more important when the invariant ring has bad properties, such as being non-finitely generated, which can happen for non-reductive groups. In that case, we may not be able to compute a full generating set, but we can still compute a separating set. Not sure if/when this will show up.

Structural Computations

These operations try to understand the algebraic structure of the invariant ring once invariants have been found.

Computing the Hilbert Series

The Hilbert series is a generating function that counts the invariants by dimension. It is defined as:

where is the space of homogeneous invariants of degree . The Hilbert series encodes important information about the invariant ring and the structure of the invariants. For example, in the cases we care about here, the Hilbert series is rational once the invariant ring is finitely generated.

For finite groups, the Hilbert series can be computed using the Molien formula:

We will examine this in more detail in subsequent sections.

As an API operation:

def compute_hilbert_series(action: Action, max_degree: int) -> list[int]: ...

Computing Structural Properties of

There are several structural properties of the invariant ring that we may want to compute. These properties tell us how complicated the ring is, how many relations we should expect, and whether the ring admits especially efficient descriptions.

To define the most common ones, we first isolate the role of a polynomial subring inside the invariant ring.

Definition 1 Let be a graded invariant ring. A collection of homogeneous elements

is called a homogeneous system of parameters (HSOP) if is finitely generated as a module over the polynomial subring

Intuitively, an HSOP is a choice of basic algebraically independent parameters over which the whole invariant ring is finite.

Definition 2 We say that is Cohen-Macaulay if, for some HSOP

the ring is a free module over the polynomial subring . Equivalently, there exist homogeneous elements

such that

as a module over .

This is important computationally because it means the invariant ring has a simple description in terms of primary and secondary invariants.

Definition 3 Assume is Cohen-Macaulay, and let

be the quotient by an HSOP. Then is a finite-dimensional graded algebra:

Let be the largest degree for which .

We say that is Gorenstein if is one-dimensional, and for every , multiplication

is nondegenerate. By nondegenerate, we mean that 1. For every nonzero , there exists some such that , and 2. For every nonzero , there exists some such that .

Intuitively, this means after we quotient the polynomial part coming from the HSOP out, the remaining finite-dimensional algebra has a strong symmetry between complementary degrees.

Definition 4 Suppose we present the invariant ring as

where the correspond to chosen generators of . Let be the number of parameters in an HSOP for . We say that is a complete intersection if the ideal of relations can be generated by

elements.

That is, once generators have been chosen, the ring is determined by as few relations as possible. This is one of the best possible situations computationally, since the presentation is controlled by a minimal number of equations.

As API operations, we might have:

def is_cohen_macaulay(invariant_ring) -> bool: ...
def is_gorenstein(invariant_ring) -> bool: ...
def is_complete_intersection(invariant_ring) -> bool: ...

Presentation Computations

These operations try to find explicit presentations of the invariant ring in terms of generators and relations.

Computing Relations Among Generators

The simplest form of this problem is to compute the ideal of relations among a given set of generators. That is, if we have a set of generators for , we want to find the ideal such that:

This is important because it gives us a complete presentation of the invariant ring as a quotient of a polynomial ring by the ideal of relations. We can use Gröbner bases to compute this ideal, which allows us to perform various algebraic operations on the invariant ring.

In terms of the API:

def compute_relations(generators: list[Poly], n_vars: int) -> list[Poly]: ...

Computing Syzygies Among Relations

Once we have a set of relations among the generators, we may have syzygies among the relations themselves. We can represent these as tuples of polynomials such that:

The trivial syzygies come from commutativity (). The non-trivial syzygies help show the structure in the relation ideal.

In the last post, we saw that the computations are finite (by Hilbert’s syzygy theorem), and the resolution encodes homological invariants like depth and projective dimension. Computationally, we can use Gröbner bases to compute the syzygies among the relations, which gives us a deeper understanding of the structure of the invariant ring and can help us compute further invariants.

def compute_syzygies(relations: list[Poly]) -> list[Poly]: ...

Solving Ideal-Membership and Normal-Form Problems

If we have a presentation of the invariant ring in terms of generators and relations, we can use this to solve ideal-membership problems. For example, given a polynomial and an ideal generated by some relations, we can test whether belongs to by computing the normal form of with respect to a Gröbner basis for . If the normal form is zero, then ; else, .

def normal_form(f: Poly, basis: list[Poly]) -> Poly: ...
def in_ideal(f: Poly, basis: list[Poly]) -> bool: ...

Quotient and Orbit Computations

These operations try to compute the geometry of the quotient space and the orbits of a group action.

Determining Whether Two Points Have the Same Image in the Quotient

Let’s say we have two points , and we want to determine whether they map to the same point in the quotient . For finite groups, this is equivalent to asking whether there exists a group element such that .

We can run a computation:

def same_orbit(action: Action, v: np.ndarray, u: np.ndarray) -> bool: ...

Separating Points, Orbits, or Orbit Closures

Similarly, we may want to determine whether two points lie in different orbits, or whether their orbit closures are different. This is related to the concept of separating invariants, which can distinguish between different orbits.

def find_separator(action: Action, v: np.ndarray, u: np.ndarray, max_degree: int = 6) -> Poly | None: ...

Example Workflow

Now that we know the types of computations we want to perform, we can outline the workflow we might use to actually compute invariants for a specific group action. This will help us understand how the different computational tasks fit together in practice.

Probably the sequence looks something like this:

Encode the group action and the structured object in our data structure.
Compute Hilbert/Molien data for the action to understand the structure of the invariant ring.
Search for candidate generators for the invariant ring, using the Hilbert series to guide our search.
Compute relations by Gröbner elimination to find a presentation of the invariant ring.

Once we have the presentation, we can use it to gain understanding of the object:

Test ideal membership and compute normal forms. Given a polynomial , determine whether it lies in the ideal of relations (equivalently: can it be written in terms of the generators?). The normal form gives a canonical representative.
Compute structural properties of the invariant ring, such as whether it is Cohen-Macaulay, Gorenstein, or a complete intersection.
Compute syzygies and higher-order relations among the generators.
Compute separating invariants and solve orbit-separation problems.
Compute primary and secondary invariants, and understand the module structure of the invariant ring over a polynomial subring.

More advanced computations (which we may or may not do here) might include:

Compute the “Krull dimension” of the invariant ring. This is the number of algebraically independent generators, equal to . It tells you the dimension of the quotient variety .
Compute degree bounds. For finite groups in characteristic zero, all generators appear by degree (Noether’s bound). The Hilbert series predicts how many generators to expect in each degree before computing them, giving a stopping criterion.

The generators and relations also determine the geometry of the quotient (its defining equations, dimension, and singularities), but that is algebraic geometry proper, and we won’t pursue it here.

An example program putting this all together might look like this:

import numpy as np
from invariants.groups.constructors import symmetric
from invariants.spaces import polynomial_ring
from invariants.action import (
    invariant_theory, compute_generators,
    compute_hilbert_series, in_null_cone, same_orbit,
)
from invariants.groebner import compute_relations

# 1. Encode the group action
G = symmetric(3)
ring = polynomial_ring(3)
action = invariant_theory(G, ring)

# 2. Hilbert series: how many invariants in each degree?
hs = compute_hilbert_series(action, 6)
# [1, 1, 2, 3, 4, 5, 7]

# 3. Find generators
generators = compute_generators(action, max_degree=3)
# [x0+x1+x2, x0^2+x1^2+x2^2, x0^3+x1^3+x2^3]

# 4. Relations among generators (Gröbner elimination)
relations = compute_relations(generators, n_vars=3)
# [] — S_3 invariant ring is freely generated

# 5. Orbit and quotient geometry
v = np.array([1.0, 2.0, 3.0])
u = np.array([3.0, 1.0, 2.0])
same_orbit(action, v, u)                    # True — same multiset
in_null_cone(action, np.array([0, 0, 0]))   # True — all invariants vanish at origin

For this particular example, the outputs look something like this:

Hilbert series: [1, 1, 2, 3, 4, 5, 7]
Generators (3): [x0+x1+x2, x0^2+x1^2+x2^2, x0^3+x1^3+x2^3]
Relations: 0 (freely generated)
Same orbit (1,2,3) ~ (3,1,2): True
In null cone (0,0,0): True

Hilbert Series

In the code above, step 2 was “compute the Hilbert series.” Before we implement anything, we should define this properly, since it will guide every computation that follows.

For any graded algebra (where each is a finite-dimensional vector space), the Hilbert series is the generating function:

Applied to the invariant ring , the coefficient of counts the number of linearly independent invariant polynomials of degree . This is the single most useful piece of information you can have before searching for generators: it tells you how many to expect in each degree, and when you can stop looking.

Each group type gives a different formula for the Hilbert series. For finite groups, Molien’s theorem expresses it as a sum over group elements. For tori, it reduces to counting lattice points in a cone. For compact Lie groups, it becomes an integral over the maximal torus (the Molien-Weyl formula). We will derive and implement each of these.

The Hilbert series also encodes structural information. If is Cohen-Macaulay (which it always is for finite groups in characteristic zero, by the Hochster-Roberts theorem), then the series factors as:

where are the degrees of the primary invariants and the numerator counts secondary invariants. We will return to this decomposition in the section on primary and secondary invariants.

Tori

Now that we’ve laid out the data structures and the Hilbert series as our main bookkeeping tool, we can start implementing. We begin with the simplest case: torus actions on a vector space.

Introduction to Tori

Let’s define the torus and its action on a vector space.

Given a group , we say that is a torus if it is isomorphic to (for some integer ).

An element of is therefore a tuple

and multiplication is componentwise.

Let act linearly on a finite-dimensional complex vector space . Then we can choose a basis

of such that each basis vector is just rescaled by the action. That is, for each basis vector and each , there is some scalar such that

Proof: See this StackExchange post here, which shows that the action is diagonalizable.

The group law forces these scalar functions to behave multiplicatively. So

implies

Thus each is a group homomorphism.

Definition 5 A group homomorphism

is called a character of the torus.

Integer Linear Algebra of Invariants

So each basis vector comes with a character , and under the right basis, the action is of the form

Because

every character is given by a monomial in the torus coordinates:

for some integers

These integers are called the weights of the action. So after choosing this basis, the torus action can be written as

If we write a vector

then

So a torus action is encoded by a list of integer weight vectors

or equivalently by an integer matrix of weights.

Now let

denote the coordinates corresponding to the basis

Since each basis vector is scaled by a weight, the coordinates transform by

where

Now take a monomial

Then the torus acts on it by

Substituting in the weight formula gives

So the monomial is again scaled by a single weight (the “total weight” of the monomial):

If we assemble the weight vectors into a matrix

then the total weight can be written as

Thus

It follows that the monomial is invariant if and only if its total weight is zero:

So to find invariant monomials, just have to solve the integer linear system

subject to the constraint that the exponents are nonnegative integers:

So the exponent vectors of invariant monomials form the set

This is a semigroup under addition, and the invariant ring is generated by the corresponding monomials.

Implementation

What do our data structures and computational tasks look like in the case of a torus action? Let’s take an (abbreviated) look.

Invariance Test

As we saw, for a torus action, we can represent the action by an integer weight matrix . The invariant monomials correspond to integer solutions of the linear system with . A polynomial is invariant if and only if every monomial in its support passes this test.

class Torus:
    """Torus T = (C*)^r acting on C^m via weight matrix W."""

    def __init__(self, W: np.ndarray):
        self.W = np.asarray(W, dtype=int)
        self.rank = self.W.shape[0]
        self.n_vars = self.W.shape[1]

    def monomial_weight(self, alpha: tuple[int, ...]) -> np.ndarray:
        """Total weight W @ alpha of monomial x^alpha."""
        return self.W @ np.array(alpha, dtype=int)

    def is_invariant_monomial(self, alpha: tuple[int, ...]) -> bool:
        return np.all(self.monomial_weight(alpha) == 0)

    def is_invariant(self, f: Poly) -> bool:
        """A polynomial is torus-invariant iff every monomial has weight zero."""
        return all(self.is_invariant_monomial(alpha) for alpha in f)

The penultimate method checks whether a monomial is invariant by checking if its weight is zero. The last method checks whether a polynomial is invariant by checking if every monomial in its support is invariant.

Enumerating Invariants

So given the invariance test, we can enumerate all invariant monomials of a given degree by filtering over all monomials. The Hilbert series (we will review in a subsequent section) counts how many there are in each degree:

    def invariants_of_degree(self, d: int) -> list[Poly]:
        """Basis of invariant monomials of degree d."""
        return [
            poly.mono(alpha)
            for alpha in poly.monomials_of_degree(self.n_vars, d)
            if self.is_invariant_monomial(alpha)
        ]

    def hilbert_coeffs(self, max_d: int) -> list[int]:
        """Number of invariant monomials in each degree."""
        return [len(self.invariants_of_degree(d)) for d in range(max_d + 1)]

The minimal generators of the semigroup of invariant monomials are the invariant monomials that are not products of simpler ones. These are the Hilbert basis of the semigroup:

    def hilbert_basis(self, max_degree: int = 20) -> list[tuple[int, ...]]:
        """Minimal generators of the semigroup ker(W) ∩ N^m.
        """
        all_inv = []
        for d in range(1, max_degree + 1):
            for alpha in poly.monomials_of_degree(self.n_vars, d):
                if self.is_invariant_monomial(alpha):
                    all_inv.append(alpha)

        inv_set = set(all_inv)
        generators = []
        for alpha in all_inv:
            a = np.array(alpha)
            is_reducible = any(
                np.all(a - np.array(beta) >= 0)
                and np.any(a - np.array(beta) > 0)
                and tuple(a - np.array(beta)) in inv_set
                for beta in generators
            )
            if not is_reducible:
                generators.append(alpha)
        return generators

Example

Consider acting on with weights .

So we have

We are looking for a polynomial such that

Given a monomial , we thus have:

Which is invariant when , or .

Therefore, if we set , then we have , or . Thus, since degrees are integers, we conclude that the invariant monomials exist only in degrees divisible by 3.

Degree 0

One solution: . So the only invariant monomial is the constant .

Degrees 1, 2

is not an integer. No invariants.

Degree 3

, .

There are three solutions: . Our invariant polynomials are and .

Degree 6

, . Now there are five solutions: .

General

The degree has invariant monomials (choose how to split among two variables). So the Hilbert series is

Let’s verify:

T = Torus(np.array([[1, 1, -2]]))

# Hilbert series: count invariant monomials by degree
T.hilbert_coeffs(6)    # [1, 0, 0, 3, 0, 0, 5]

# Hilbert basis: minimal generators of the invariant semigroup
T.hilbert_basis()      # [(2, 0, 1), (1, 1, 1), (0, 2, 1)]
# These correspond to x0^2*x2, x0*x1*x2, x1^2*x2

Therefore, the three generators correspond to the monomials , , and . Every invariant monomial is a product of these three.

For this example, it’s easy to see why these are the generators. The degree 6 invariants are all products of the degree 3 invariants, and the degree 3 invariants are not products of simpler invariants. So the degree 3 invariants are the minimal generators. We got the Hilbert basis just by brute force enumeration in this case, but in general we might need a more sophisticated algorithm to find the minimal generators of the semigroup.

Relations

Now that we have the three generators, we can ask about the relations among them.

In the example above, there are three generators , , , which satisfy the relation .

It makes sense that there is one relation, as the space has dimension 3 and the torus has dimension 1, so the quotient has dimension 3 - 1 = 2. With 3 generators being mapped to a 2-dimensional space, we expect 3 - 2 = 1 relation among them.

How can we find these relations? We need Gröbner elimination.

As a general introduction, introduce new variables (one per generator), form the ideal in the extended ring , and eliminate . The surviving polynomials in are the relations among the generators. In this case, we find the relation .

from invariants.groebner import compute_relations
from invariants.poly import format_poly

generators = [poly.mono(a) for a in T.hilbert_basis()]
relations = compute_relations(generators, n_vars=3)
# One relation: y0*y2 - y1^2

Let’s understand Grobner elimination more deeply in the next section.

Gröbner Bases and Presentations

Let’s say we have a polynomial ring over a field , and we have an ideal generated by some polynomials . A Gröbner basis for is a particular kind of generating set that allows us to perform algorithmic operations on the ideal.

Definition 6 A Gröbner basis for an ideal in a polynomial ring is a generating set of such that the leading term for all is divisible by the leading term of some . That is:

Given a Gröbner basis for an ideal , any polynomial can be uniquely expressed as:

where are polynomials and is a polynomial that cannot be reduced further by the (i.e., no term of is divisible by any leading term of the ).

Proof Sketch: This follows from the division algorithm for polynomials. We repeatedly divide by the until we can no longer reduce it, which gives us the desired expression.

Buchberger’s Algorithm

Define the S-polynomial of two polynomials and as:

The algorithm proceeds as follows:

Start with a set of polynomials .
For each pair of polynomials , compute their S-polynomial .
Reduce the S-polynomial modulo the current set of polynomials.
If the reduction is non-zero, add it to the set .
Repeat steps 2-4 until no new polynomials are added.

We can think of this as similar to Gaussian elimination. In Gaussian elimination, we perform row operations on linear equations:

and so on, eliminating each variable in turn. In Buchberger’s algorithm, we perform similar operations on polynomials to eliminate leading terms and find a Gröbner basis.

As an example, suppose we have the ideal generated by the polynomials and . Impose an ordering on the monomials ().

The leading terms are and . The least common multiple of the leading terms is . Since and , multiply the first polynomial by and the second by to get:

Now subtract the second from the first (this is where the S-polynomial comes from):

If we fix the maximum degree of the polynomials we want to consider in addition to the ordering, we can even use a matrix representation of the polynomials and perform Gaussian elimination on the coefficients to find the Gröbner basis. Essentially, we’ve replaced the notion of “leading term” with the notion of “leading monomial” in the context of polynomials (by constructing some symbols that represent each monomial), and we perform operations to eliminate these leading monomials until we have a basis that allows us to rewrite any polynomial in the ideal in a unique way.

The primary gotcha is that in Buchberger’s algorithm, we can end up with polynomials of a higher degree than the original generators, which can lead to combinatorial explosion. This is one of the main computational challenges in using Gröbner bases for invariant theory.

Here’s what this looks like in our Python code:

def s_poly(f: Poly, g: Poly, order=poly.grlex) -> Poly:
    """S-polynomial of f and g"""
    lm_f = poly.leading_monomial(f, order)
    lm_g = poly.leading_monomial(g, order)
    lc_f = poly.leading_coefficient(f, order)
    lc_g = poly.leading_coefficient(g, order)
    gamma = poly.mono_lcm(lm_f, lm_g)
    t_f = poly.mono(poly.mono_div(gamma, lm_f), Fraction(1) / lc_f)
    t_g = poly.mono(poly.mono_div(gamma, lm_g), Fraction(1) / lc_g)
    return poly.sub(poly.mul(t_f, f), poly.mul(t_g, g))

def buchberger(F: list[Poly], order=poly.grlex) -> list[Poly]:
    """Compute a reduced Gröbner basis for the ideal generated by F."""
    G = [g for g in F if g]
    if not G:
        return []
    pairs = [(i, j) for i in range(len(G)) for j in range(i + 1, len(G))]
    while pairs:
        i, j = pairs.pop(0)
        sp = s_poly(G[i], G[j], order)
        r = reduce(sp, G, order)
        if r:
            k = len(G)
            pairs.extend((m, k) for m in range(k))
            G.append(r)
    return _reduce_basis(G, order)

Key Operations

Using Grobner bases, we can perform several key reusable functions that are essential for computational invariant theory.

Computation of Normal Forms

Definition 7 Let be a set of polynomials and let be the ideal generated by . The normal form of a polynomial with respect to is the unique polynomial such that no term of is divisible by the leading term of any polynomial in a Gröbner basis for , and such that .

In other words, the normal form of is the “remainder” when is reduced by the Gröbner basis of . It is a canonical representative of the equivalence class of in the quotient ring .

We can compute the normal form with an algorithm:

Start with a polynomial and a Gröbner basis for an ideal .
Initialize .
While there exists a such that the leading term of divides a term in :
1. Let be the term in that is divisible by .
2. Replace in with .
Return as the normal form of with respect to .

Basically, this is just the division algorithm for polynomials. The normal form is important because it allows us to test whether a polynomial belongs to the ideal (if the normal form is zero) and to find unique representatives of equivalence classes in the quotient ring.

In our Python code, we can implement this as follows:

def reduce(f: Poly, G: list[Poly], order=poly.grlex) -> Poly:
    """Reduce f modulo G by repeated leading-term cancellation.
    Returns the remainder (normal form if G is a Groebner basis).
    """
    r: Poly = {}
    p = dict(f)
    while p:
        reduced = False
        for g in G:
            if not g:
                continue
            lm_g = poly.leading_monomial(g, order)
            lc_g = poly.leading_coefficient(g, order)
            for alpha in sorted(p.keys(), key=order, reverse=True):
                if poly.mono_divides(lm_g, alpha):
                    quot_mono = poly.mono_div(alpha, lm_g)
                    quot_coeff = p[alpha] / lc_g
                    shift = poly.mul(poly.mono(quot_mono, quot_coeff), g)
                    p = poly.sub(p, shift)
                    reduced = True
                    break
            if reduced:
                break
        if not reduced:
            if p:
                lm_p = poly.leading_monomial(p, order)
                lc_p = p[lm_p]
                r = poly.add(r, poly.mono(lm_p, lc_p))
                del p[lm_p]
    return r

Ideal Membership Testing

Based on our last section, we can also test for ideal membership. Given a polynomial and an ideal generated by a set of polynomials, we can test whether belongs to by computing the normal form of with respect to a Gröbner basis for . If the normal form is zero, then ; otherwise, .

In code:

def normal_form(f: Poly, basis: list[Poly], order=poly.grlex) -> Poly:
    """Normal form of f with respect to a Gröbner basis."""
    return reduce(f, basis, order)

def in_ideal(f: Poly, basis: list[Poly], order=poly.grlex) -> bool:
    """Test whether f belongs to the ideal generated by basis."""
    return not normal_form(f, basis, order)

Ideal Intersection and Quotients

We can also use elimination to compute the intersection of two ideals. Introduce a new variable and form the ideal (i.e., ). Then eliminate , so the surviving polynomials generate . This works because a polynomial lies in if and only if it can be written as both a combination of the and a combination of the .

Note that taking the union of generators gives the sum , not the intersection. The intersection requires the elimination trick above.

def ideal_intersection(F: list[Poly], G: list[Poly], n_vars: int) -> list[Poly]:
    """Intersection of ideals I = (F) and J = (G).

    Introduces variable t, forms (t*f_i, (1-t)*g_j), eliminates t.
    """
    t_var = poly.var(0, n_vars + 1)
    one_minus_t = poly.sub(poly.const(1, n_vars + 1), t_var)

    def embed(f: Poly) -> Poly:
        return {(0,) + alpha: c for alpha, c in f.items()}

    gens = [poly.mul(t_var, embed(f)) for f in F]
    gens += [poly.mul(one_minus_t, embed(g)) for g in G]

    elim = eliminate(gens, k=1, n_vars=n_vars + 1)
    return [{alpha[1:]: c for alpha, c in g.items()} for g in elim]

As an example, take and in . A polynomial is in if it’s divisible by or , and it’s in if it’s divisible by or . The intersection consists of polynomials in both: .

I = [poly.mono((2, 0)), poly.var(1, 2)]   # (x^2, y)
J = [poly.var(0, 2), poly.mono((0, 2))]   # (x, y^2)

result = ideal_intersection(I, J, n_vars=2)
# [x^2, xy, y^2]

Elimination of Variables

Suppose we want to eliminate a variable from an ideal in . We can do this by computing a Gröbner basis for with respect to an elimination ordering that prioritizes last. The resulting Gröbner basis will contain polynomials that do not involve , and these polynomials will generate the elimination ideal .

def eliminate(generators: list[Poly], k: int, n_vars: int) -> list[Poly]:
    """Eliminate the first k variables.

    Computes a Gröbner basis with an elimination ordering that pushes
    x_0, ..., x_{k-1} to the top, then returns only the polynomials
    that live in k[x_k, ..., x_{n-1}].
    """
    order = poly.elimination_order(k)
    gb = buchberger(generators, order)
    return [g for g in gb if _involves_only(g, k, n_vars)]

Of course, this is just a special case of the ideal intersection method, since we can think of the elimination ideal as the intersection of with the subring that does not involve .

Computation of Relations between Polynomial Generators

Given generators in , we want to find all the polynomial relations among them.

Introduce new variables , form the ideal in , and eliminate . The surviving polynomials in are exactly the relations.

def compute_relations(generators: list[Poly], n_vars: int,
                      order=poly.grlex) -> list[Poly]:
    """Compute relations among polynomial generators.

    Given generators g_1, ..., g_s in k[x_0, ..., x_{n-1}], introduce
    new variables y_0, ..., y_{s-1} and compute the kernel of the map
    k[y_0, ..., y_{s-1}] -> k[x_0, ..., x_{n-1}] sending y_i -> g_i.
    """
    s = len(generators)
    total_vars = n_vars + s

    # embed each g_i into k[x_0,...,x_{n-1}, y_0,...,y_{s-1}]
    elim_gens = []
    for i, g in enumerate(generators):
        y_alpha = (0,) * n_vars + tuple(1 if j == i else 0 for j in range(s))
        yi = poly.mono(y_alpha)
        g_emb: Poly = {}
        for alpha, c in g.items():
            new_alpha = alpha + (0,) * s
            g_emb[new_alpha] = c
        elim_gens.append(poly.sub(yi, g_emb))

    # eliminate x_0, ..., x_{n-1}
    elim_order = poly.elimination_order(n_vars)
    gb = buchberger(elim_gens, elim_order)

    # keep only polynomials in k[y_0, ..., y_{s-1}]
    return [g for g in gb if _involves_only(g, n_vars, total_vars)]

Computation of Syzygies

Given generators of an ideal, a syzygy is a tuple of polynomials such that . The syzygies encode the dependencies among the generators.

The S-polynomial construction we looked at above already produces some syzygies. If reduces to zero modulo the generators, the reduction path gives a relation . The routine below should be read as a partial syzygy computation rather than a complete one.

def compute_syzygies(generators: list[Poly], n_vars: int,
                     order=poly.grlex) -> list[Poly]:
    """Compute syzygies among generators of an ideal.

    For each pair (i,j), if the S-polynomial reduces to zero,
    record the corresponding syzygy in k[x, e_0, ..., e_{s-1}].
    """
    s = len(generators)
    syzygies = []
    for i in range(s):
        for j in range(i + 1, s):
            fi, fj = generators[i], generators[j]
            if not fi or not fj:
                continue
            sp = s_poly(fi, fj, order)
            if not reduce(sp, generators, order):
                # Build syzygy: (lcm/LT(fi))/lc_i * e_i - (lcm/LT(fj))/lc_j * e_j
                lm_i = poly.leading_monomial(fi, order)
                lm_j = poly.leading_monomial(fj, order)
                gamma = poly.mono_lcm(lm_i, lm_j)
                qi = poly.mono_div(gamma, lm_i)
                qj = poly.mono_div(gamma, lm_j)
                lc_i = poly.leading_coefficient(fi, order)
                lc_j = poly.leading_coefficient(fj, order)
                # encode in k[x_0,...,x_{n-1}, e_0,...,e_{s-1}]
                ei = tuple(1 if k == i else 0 for k in range(s))
                ej = tuple(1 if k == j else 0 for k in range(s))
                syz = poly.sub(
                    poly.mono(qi + ei, Fraction(1) / lc_i),
                    poly.mono(qj + ej, Fraction(1) / lc_j),
                )
                syzygies.append(syz)
    return syzygies

# Example: syzygies of (x, y) in k[x,y]
# Returns y*e_0 - x*e_1, i.e., y*x - x*y = 0

Computation of Hilbert Series

Given a Gröbner basis for an ideal , the Hilbert function of the quotient ring counts the monomials in each degree that are not divisible by any leading monomial of the basis. This works because the leading term ideal has the same Hilbert function as , and monomials not in form a vector space basis for .

def hilbert_function(basis: list[Poly], n_vars: int, max_d: int,
                     order=poly.grlex) -> list[int]:
    """Hilbert function of k[x]/I from degree 0 to max_d.

    Counts monomials of each degree not divisible by any leading
    monomial of the Gröbner basis.
    """
    leading_monomials = [poly.leading_monomial(g, order) for g in basis if g]
    result = []
    for d in range(max_d + 1):
        count = 0
        for alpha in poly.monomials_of_degree(n_vars, d):
            if not any(poly.mono_divides(lm, alpha) for lm in leading_monomials):
                count += 1
        result.append(count)
    return result

# Example: GB of (x^2 + y - 1, xy + 1) has LT ideal (x^2, xy, y^2)
# Hilbert function: [1, 2, 0, 0, ...] — quotient is 3-dimensional

Nullstellensatz-Style Problems

The Nullstellensatz, which we saw in the last post, says that a system of polynomial equations has no common solution if and only if lies in the ideal . We can test this by computing a Gröbner basis. If the basis contains a nonzero constant, the ideal is all of and the system is inconsistent.

def has_common_root(basis: list[Poly], order=poly.grlex) -> bool:
    """Test whether V(I) is nonempty.

    By the Nullstellensatz, V(I) = {} iff 1 ∈ I iff GB = {1}.
    """
    gb = buchberger(basis, order)
    for g in gb:
        if all(e == 0 for e in poly.leading_monomial(g, order)):
            return False  # 1 ∈ I, no common root
    return True

# Example: has_common_root([x, 1-x]) returns False (no solution)

Finite Groups

Now that we have a bit more general infra from the last two sections, let’s work on another computationally friendly class of examples: finite groups.

Reynolds Operator

In the last post, we saw that for certain kinds of groups (like finite groups) acting on vector spaces , we could define the Reynolds operator, which is an idempotent operator that takes any polynomial and averages over the group action to produce an invariant polynomial:

If we have a polynomial that is not invariant, applying the Reynolds operator will give us an invariant polynomial.

Recall that we represented a finite group as a list of matrices that act on the variables. To apply the Reynolds operator, we can take any polynomial and apply each group element to it by substituting the linear forms corresponding to the group action. Then we average these transformed polynomials to get an invariant.

Start with a set of polynomials that generate the polynomial ring and apply the Reynolds operator to each of these polynomials to project them onto the invariant ring. This will give us a set of invariant polynomials, which may generate the invariant ring or at least give us a starting point for finding generators.

Let’s implement the Reynolds operator in code, which involves summing over the group elements and applying the group action to the polynomial. While computationally expensive for large groups, it’s a straightforward way to find invariants.

The key subroutine is apply_to_poly. Given a matrix and a polynomial , we compute by substituting linear forms for each variable. Reynolds then just averages over all group elements.

def apply_to_poly(self, g: np.ndarray, f: Poly) -> Poly:
    """(g · f)(x) = f(g^{-1} x) via variable substitution."""
    g_inv = np.linalg.inv(g)
    n = self.n_vars
    sub_polys = []
    for i in range(n):
        row: Poly = {}
        for j in range(n):
            c = Fraction(g_inv[i, j]).limit_denominator(10**9)
            if c != 0:
                row[tuple(1 if k == j else 0 for k in range(n))] = c
        sub_polys.append(row)

    result: Poly = {}
    for alpha, coeff in f.items():
        term = poly.const(coeff, n)
        for i, e in enumerate(alpha):
            for _ in range(e):
                term = poly.mul(term, sub_polys[i])
        result = poly.add(result, term)
    return result

def reynolds(self, f: Poly) -> Poly:
    """Reynolds operator: R(f) = (1/|G|) sum_{g in G} g · f."""
    total: Poly = {}
    for g in self.matrices:
        total = poly.add(total, self.apply_to_poly(g, f))
    return poly.scale(Fraction(1, self.order), total)

For example, take acting on by permuting coordinates. The monomial is not invariant (permutations move it to or ). Applying Reynolds averages over all 6 permutations:

G = symmetric(3)
f = poly.mono((2, 0, 0))          
Rf = G.reynolds(f)
# Rf = 1/3*x0^2 + 1/3*x1^2 + 1/3*x2^2

assert G.is_invariant(Rf)          # True
assert G.reynolds(Rf) == Rf        # idempotent: R(R(f)) = R(f)

The output is the power sum , which is obviously symmetric. Note that Reynolds is idempotent, so applying it to an already-invariant polynomial returns the same polynomial.

Orbit Sums

Reynolds averaging is a special case of a more general construction called orbit sums. Given a polynomial , we can consider the orbit of under the group action:

The orbit sum is the sum of all elements in the orbit:

This is similar to the Reynolds operator, but without the normalization factor of .

Since the orbit sum is just the sum of elements in the orbit, and orbits are invariant under the group action (since the summands are just permuted), the orbit sum is also invariant under the group action.

That is, for any :

For degree polynomials, this works cleanly when the action sends monomials to monomials (for example permutation actions): then the invariance condition forces the coefficients to be constant on monomial orbits, so every homogeneous invariant is a linear combination of orbit sums. For a general linear action, orbit sums of monomials are better viewed as a convenient source of candidate invariants than as a general basis theorem.

So in the monomial/permutation cases, the orbit sums of monomials form a basis for the space of homogeneous invariants of degree . This gives us a way to find a basis for the invariant ring degree by degree in those cases.

In code, we can implement this as follows:

def orbit_sum(self, f: Poly) -> Poly:
    """Sum over distinct images of f under G."""
    seen = set()
    total: Poly = {}
    for g in self.matrices:
        gf = self.apply_to_poly(g, f)
        key = frozenset(gf.items())
        if key not in seen:
            seen.add(key)
            total = poly.add(total, gf)
    return total

def invariants_of_degree(self, d: int) -> list[Poly]:
    """Degree-d invariants via orbit sums of monomials."""
    seen_orbits: set[frozenset] = set()
    basis = []
    for alpha in poly.monomials_of_degree(self.n_vars, d):
        f = poly.mono(alpha)
        orbit_key = frozenset(
            frozenset(self.apply_to_poly(g, f).items())
            for g in self.matrices
        )
        if orbit_key not in seen_orbits:
            seen_orbits.add(orbit_key)
            os = self.orbit_sum(f)
            if os:
                basis.append(os)
    return basis

The orbit_sum method tracks which images have already been seen (via frozenset of the polynomial’s items) to avoid double-counting when the stabilizer of is nontrivial. The invariants_of_degree method groups monomials by orbit, computes one orbit sum per orbit, and collects them as a basis.

Continuing with on , the degree-2 monomials are . Under , these fall into two orbits: and . The orbit sums give two linearly independent degree-2 invariants:

G = symmetric(3)

# Orbit sum of x0^2: hits x0^2, x1^2, x2^2
G.orbit_sum(poly.mono((2, 0, 0)))
# x0^2 + x1^2 + x2^2

# Orbit sum of x0*x1: hits x0*x1, x0*x2, x1*x2
G.orbit_sum(poly.mono((1, 1, 0)))
# x0*x1 + x0*x2 + x1*x2

# All degree-2 invariants at once
G.invariants_of_degree(2)
# [x0^2 + x1^2 + x2^2, x0*x1 + x0*x2 + x1*x2]

These are the power sum and the elementary symmetric polynomial .

Molien Series

Since the Reynolds operator projects onto the invariant ring, we know that

(since the trace of a projection operator is the dimension of its image¹).

Thus we can compute the trace as follows:

Since is the space of homogeneous degree polynomials, we can identify it with the -th symmetric power of the dual space :

The trace of on can be computed using the eigenvalues of on . If the eigenvalues of on are , then the trace of on is given by the complete homogeneous symmetric polynomial of degree in the eigenvalues:

The generating function for the complete homogeneous symmetric polynomials is given by²:

This motivates the Molien formula for the Hilbert series of the invariant ring³:

To use the Molien formula computationally, we expand as a power series in for each group element . Since where are the eigenvalues of , the series expansion is a convolution of geometric series. We truncate at degree max_d, sum over all group elements, divide by , and round to the nearest integer (the result is exact for rational eigenvalues; rounding handles floating-point eigenvalues from rotation matrices).

def molien_coeffs(self, max_d: int) -> list[int]:
    """Molien series: H(t) = (1/|G|) sum_{g in G} 1/det(I - tg)."""
    coeffs = np.zeros(max_d + 1, dtype=complex)
    for g in self.matrices:
        eigenvalues = np.linalg.eigvals(g)
        # Expand 1/prod(1 - lambda_i * t) as power series
        series = np.zeros(max_d + 1, dtype=complex)
        series[0] = 1.0
        for lam in eigenvalues:
            powers = np.array([lam**d for d in range(max_d + 1)])
            new_series = np.zeros(max_d + 1, dtype=complex)
            for d in range(max_d + 1):
                new_series[d] = sum(series[k] * powers[d - k]
                                    for k in range(d + 1))
            series = new_series
        coeffs += series
    return [round((coeffs[d] / self.order).real)
            for d in range(max_d + 1)]

For on , the group has 6 elements.

The identity has eigenvalues , contributing .

The transpositions have eigenvalues , contributing .

The 3-cycles have eigenvalues where , contributing .

Averaging over all 6 elements:

which simplifies to :

So the invariant ring is a polynomial ring in three generators of degrees 1, 2, 3 (the power sums ). Verify numerically:

G = symmetric(3)
G.molien_coeffs(10)
# [1, 1, 2, 3, 4, 5, 7, 8, 10, 12, 14]

If we expanded as a power series, we would get the same coefficients, confirming that the Molien series correctly predicts the dimensions of the invariant ring in each degree.

Noether Bound

The Molien series gives us the dimensions of the graded pieces of the invariant ring, which tells us how many invariants we should expect in each degree. The orbit sums give us explicit invariants to fill those dimensions. But how do we know the search terminates?

Theorem 1 Let be a finite group acting linearly on a finite-dimensional vector space over a field of characteristic zero. Then the invariant ring is generated by invariants of degree at most .

Proof: See Fleischmann, “The Noether Bound in Invariant Theory of Finite Groups,” Advances in Mathematics 156 (2000), which also proves the sharper bound for non-cyclic groups.

In practice, this means we can search for generators by calling invariants_of_degree(d) for and be guaranteed to find them all.

For with , we only need to check up to degree 6 (and in fact all three generators appear by degree 3).

Practical Issues

There’s some practical problems with the computational approach. The main computational bottleneck for finite groups is apply_to_poly, which substitutes linear forms for each variable. For a polynomial with terms in variables, each group element costs polynomial multiplications. Summing over all elements, the total cost of Reynolds or orbit sums is . For with , this becomes infeasible quickly.

Orbit sums (a bit) cheaper than Reynolds since they avoid the normalization (keeping coefficients as integers) and deduplicate images. The effective cost is proportional to the orbit size rather than .

The Molien series is cheap by comparison, as it only requires eigenvalue computations () with no polynomial arithmetic.

So probably we compute the Hilbert series first, then use it as a guide to say how many generators to expect in each degree, then do the relatively expensive orbit sum computation.

I’m not gonna try to optimize all these algorithms in this post, I’m still learning how it works myself. But in practice, for large groups, I assume we could optimize the apply_to_poly method, maybe by caching intermediate results or using better data structures for polynomials.

Primary and Secondary Invariants

With the Hilbert series, we can start to understand the structure of the invariant ring. In particular, we want to understand how the invariant ring can be decomposed, typically into a polynomial part and a “remainder” part.

This leads to the concepts of primary and secondary invariants. The idea is that we can find a smaller set of invariants (the primary invariants) that generate a polynomial subring, and then the rest of the invariant ring can be expressed as a module over this polynomial subring, with the secondary invariants serving as module generators. By writing in terms of the primary invariants, we can get a more compact description of the invariant ring, and the secondary invariants capture the “extra” structure that isn’t captured by the primary invariants alone.

Homogeneous Systems of Parameters (HSOP)

Before we can define primary and secondary invariants, we need to introduce the notion of a homogeneous system of parameters (HSOP), which help understand the structure of graded rings.

Definition 8 A homogeneous system of parameters (HSOP) for a graded invariant ring is a collection of homogeneous elements such that is finitely generated as a module over the polynomial subring .

For finite groups, we can always find an HSOP consisting of algebraically independent homogeneous invariants. In general, finding an HSOP can be more difficult. In the settings we care about here, the issue is not existence so much as actually finding one.

The point of a HSOP is that it gives us subring of the invariant ring that is “simple” to understand, and we can then study the structure of the full invariant ring as a module over this polynomial subring.

Also note that the HSOP isn’t unique necessarily.

To find an HSOP, we can use a greedy process to collect invariants. In practice, we look for invariants that appear algebraically independent and then do additional checks:

from invariants.action import find_hsop, find_secondaries, primary_secondary

# S_3 on C^3
G = symmetric(3)
action = invariant_theory(G, polynomial_ring(3))
hsop = find_hsop(action, 4)
# [x0+x1+x2, x0^2+x1^2+x2^2, x0^3+x1^3+x2^3]

As an example, we can look at . The invariant ring is polynomial on three generators, and we can take the three power sums , , . So they form an HSOP.

Cohen–Macaulay

Given we have an HSOP , and we know that is finitely generated as a module over , we can ask about the structure of this module. In the best cases, the invariant ring is a free module over the polynomial subring generated by the HSOP, which means it can be written uniquely as a direct sum of copies of the polynomial subring:

so every element can be written

This is the Cohen-Macaulay property. It’s not necessarily true for all invariant rings. We could instead have nontrivial relations among the such as:

Theorem 2 Let be a reductive group acting on a finite-dimensional vector space over a field of characteristic zero. Then the invariant ring is Cohen-Macaulay.

Proof. See Hochster and Roberts.⁴

Computing Primary and Secondary Invariants

When the Cohen-Macaulay property holds, we call the HSOP elements primary invariants, and the module generators secondary invariants. The primary invariants generate a polynomial subring, and the secondary invariants fill in the remainder.

Computationally, this compresses the problem. Instead of searching for all generators of , we can first find an HSOP and then loop through the degrees to find secondary invariants using the Hilbert series (which tells us how many per degree).

For on , the invariant ring is polynomial, so the only secondary invariant is :

primaries, secondaries = primary_secondary(action, 4)
# primaries: [x0+x1+x2, x0^2+x1^2+x2^2, x0^3+x1^3+x2^3]
# secondaries: [1]
# k[V]^G = 1 · k[p1, p2, p3]  — free module of rank 1

A more interesting example. Consider acting on by . The invariant ring is with relation , so it is not a polynomial ring:

from invariants.groups.finite import FiniteGroup
G_z2 = FiniteGroup([np.eye(2), -np.eye(2)])
action_z2 = invariant_theory(G_z2, polynomial_ring(2))

primaries_z2, secondaries_z2 = primary_secondary(action_z2, 4)
# primaries: [x0^2, x1^2]
# secondaries: [1, x0*x1]
# k[V]^G = 1 · k[x^2, y^2] (direct_sum) xy · k[x^2, y^2]  — free module of rank 2

Here we have two secondary invariants, and . So we know every invariant can be written uniquely as . The generator isn’t expressible as a polynomial in the primary invariants , but the full ring is still a free module over , as guaranteed by Hochster-Roberts.

We can go further and consider the degree-4 invariants. Under , a monomial is invariant iff is even (i.e. the entire monomial is of even degree), so the degree-4 invariants are all the even-degree monomials in and : . Each decomposes:

So every invariant can be written as either or .

Classical Groups and First Fundamental Theorems

For finite groups, we computed invariants from scratch via the Reynolds operator, orbit sums, and the Molien series.

For classical groups like , , and , the First Fundamental Theorems gives us the generators. Since these have been derived in the past, we can just hard-code them.

Molien-Weyl Formula

Before moving on, how would we compute these?

We would need a Hilbert series for continuous groups. Luckily, the Molien formula generalizes. For a compact group acting on , the Hilbert series is

where is the Haar measure. This is over an infinite group, but the Weyl integration formula reduces it to an integral over the “maximal torus” :

where is the (finite) “Weyl group” and is the “Weyl denominator” (whatever those are). The torus integral becomes a Laurent polynomial residue computation.

The point is, computing the Hilbert series for a compact Lie group reduces to a torus computation plus a finite group average.

In practice, since we know the generators, we can just compute the Hilbert series by counting monomials in those generators.

The Orthogonal Group

Consider acting diagonally on copies of . We represent this with variables, organized as vectors of length . The First Fundamental Theorem says the invariant ring is generated by all inner products .

There are generators (the inner products are symmetric: ), all of degree 2 in the original variables. Degree- invariants come from degree- polynomials in these generators (so only the even degrees have invariants).

from invariants.classical import orthogonal_action

# O(3) acting on 2 copies of C^3 (6 variables)
action = orthogonal_action(n=3, k=2)

# Generators: , ,   — 3 inner products
gens = action.invariants_of_degree(2)
# [x0^2+x1^2+x2^2, x0*x3+x1*x4+x2*x5, x3^2+x4^2+x5^2]

As an example, label the 6 variables as and . The three generators are:

Every -invariant polynomial in the 6 variables is a polynomial in these three inner products.

If we extended to degree-4 invariants, we would have the 6 products of pairs:

Intuitively, this makes sense, as preserves angles and lengths. The only way to get an invariant is to combine the inner products such that the orientation of the vectors is immaterial.

and Bracket Invariants

Now consider acting on copies of . The generators are all minors, which are determinants formed by choosing of the vectors. For , these are the brackets (we showed this in the previous post).

There are generators, each of degree in the original variables. When , there are no invariants at all (since you can’t form an determinant from less than vectors).

from invariants.classical import sl_action

# SL(2) acting on 3 copies of C^2 (6 variables)
action = sl_action(n=2, k=3)

# Generators: [12], [13], [23]  — 3 brackets
gens = action.invariants_of_degree(2)
# [x0*x3 - x1*x2, x0*x5 - x1*x4, x2*x5 - x3*x4]

As an example, we have three vectors in : , , . Each bracket is a determinant:

These are “signed areas” of the parallelograms spanned by pairs of vectors.

Intuitively, preserves areas (since it has determinant 1), so the signed areas are invariant.

With 3 vectors, there are brackets and no relations among them, meaning that the invariant ring is freely generated.

The Second Fundamental Theorem (also in the previous post) gives us the relations between them, called the Plücker relations. For on 4 vectors, the single relation is . We can now verify this in code:

from invariants.groebner import compute_relations

# SL(2) on 4 copies of C^2: 6 brackets, 1 Plücker relation
action4 = sl_action(n=2, k=4)
gens4 = action4.invariants_of_degree(2)
relations = compute_relations(gens4, n_vars=8)
# One relation: y0*y5 - y1*y4 + y2*y3  (the Plücker relation)

With 4 vectors we would get brackets: , , , , , . However, these are not algebraically independent.

Expanding the Plücker relation by hand:

Everything cancels out. The invariant ring is . The Gröbner computation in compute_relations rediscovers exactly this.

Symplectic and Other Classical Groups

The same pattern works for acting on copies of , where the generators are the symplectic pairings . This is essentially identical to the orthogonal case, but with an antisymmetric bilinear form rather than a symmetric one.

Further Reductive Groups

There are additional reductive groups beyond the ones we looked at in the last section. We explored some of the theory in the previous post.

Reductive groups often require idiosyncratic analysis via representation theory and First Fundamental Theorems. However, many important reductive groups have known FFTs that could be hard-coded in the same way as the classical groups above.

The most salient example is , which acts on matrices by conjugation. It’s generators are traces of products. So we would compute , , , , etc. The implementation pattern would be the same as the other groups.

These are out of scope for now. In the future I may come back and add some of these.

Orbits, Quotients, and Separation

Thus far, the primary question under consideration has been to understand the generators of the invariant ring . However, another key (related) question is to understand the orbits of the group action on . The invariant ring encodes the geometry of the quotient space . The quotient map sends each point to the values of the invariants at that point, effectively classifying points according to their orbit closures.

Two points have the same image under if and only if their orbit closures intersect. For reductive groups (which includes all finite groups and tori), closed orbits are separated. This means that if , both orbits are closed, the orbits are equal.

So invariants can be used to classify which points are equivalent under some symmetry.

The Quotient Map

Suppose we have the generators of the invariant ring . Then the quotient map is given by evaluating these generators at each point. That is, if are generators of , then we can use them to define a map:

from invariants.orbits import quotient_map, same_image
from invariants import poly

# S_3 generators: power sums p1, p2, p3
p1 = poly.add(poly.add(poly.var(0, 3), poly.var(1, 3)), poly.var(2, 3))
p2 = poly.add(poly.add(poly.mono((2,0,0)), poly.mono((0,2,0))), poly.mono((0,0,2)))
p3 = poly.add(poly.add(poly.mono((3,0,0)), poly.mono((0,3,0))), poly.mono((0,0,3)))
gens = [p1, p2, p3]

# Same orbit => same image
quotient_map(gens, (1, 2, 3))  # [6, 14, 36]
quotient_map(gens, (3, 1, 2))  # [6, 14, 36]
same_image(gens, (1, 2, 3), (3, 1, 2))  # True

# Different orbits => different image
same_image(gens, (1, 2, 3), (1, 1, 4))  # False

In this example, we are looking at the action of on by permuting coordinates. The generators of the invariant ring are the power sums , which we established earlier as , , and .

The quotient map evaluates these invariants at a point. For points in the same orbit (like and ), we get the same image under the quotient map. For points in different orbits (like and ), we get different images.

How do we interpret these images?

For , we have: - - -

However, for , we have: - - -

So these points do not lie in the same orbit, as they have different images under the quotient map. The invariants are able to distinguish these orbits.

And, as we can tell by inspection, there is no way to permute the coordinates of to get , confirming that they are indeed in different orbits.

Orbit Separation

In the finite-group case, when two points are in different orbits, we can find a specific invariant that witnesses this:

from invariants.orbits import find_separator

sep = find_separator(gens, (1, 2, 3), (1, 1, 4))
poly.show(sep)
# 'x0^2 + x1^2 + x2^2'
# p2(1,2,3) = 14, p2(1,1,4) = 18

For reductive groups, if two orbits are closed and distinct, some polynomial invariant distinguishes them. This is a useful constructive theorem, as it allows us to find explicit invariants that separate orbits and thereby classify objects up to symmetry.

Null Cones

The null cone is the fiber , which is the set of all points that invariants cannot distinguish from the origin.

This is important as all the points in the null cone have orbit closures that contain the origin, so they all collapse to the same point in the quotient.

Testing null cone membership is just evaluating the generators:

from invariants.orbits import in_null_cone

# C* on C^2, weights [1, -1]. Invariant: x0*x1
gen = poly.mono((1, 1))

in_null_cone([gen], (0, 0))   # True — origin
in_null_cone([gen], (1, 0))   # True — coordinate axis
in_null_cone([gen], (0, 5))   # True — coordinate axis
in_null_cone([gen], (1, 1))   # False — generic point
in_null_cone([gen], (2, 3))   # False — generic point

For this example, the null cone is , the union of the two coordinate axes. Geometrically, these are the orbits of that degenerate: , which approaches the origin as . The generic orbits with are closed and do not contain the origin in their closure, so they are not in the null cone.

Separating Invariants

If we have a full generating set for the invariant ring, then we can use it to separate orbits. However, it may be difficult to produce the full set. In practice, we might only need some of the generators to separate orbits. The hope is that the separating set is much smaller than the full generating set.

We can find a minimal separating subset by greedy set cover (there’s probably faster algorithms, especially if we have some structure or are willing to do some random sampling):

from invariants.orbits import is_separating, minimal_separating_subset

# S_3 generators
gens = [p1, p2, p3]  # three power sums

# Test pairs, test points in different S_3 orbits
pairs = [
    ((1, 2, 3), (1, 1, 4)),  # same sum, different orbits
    ((1, 2, 3), (2, 2, 2)),  # same sum, different orbits
    ((1, 0, 0), (0, 0, 2)),  # different sum, easy
]

is_separating(gens, pairs)  # True, so full set works

minimal = minimal_separating_subset(gens, pairs)
# [x0^2 + x1^2 + x2^2] - p2 alone separates all three pairs!

For acting on , the power sum separates these test pairs.

Note that doesn’t separate all orbits (e.g. and have the same value).

Scaling

Some quick notes here on scaling properties of the algorithms we’ve discussed, which have some practical issues, and other techniques used in practice that I’ve omitted.

Degree Explosion

The fundamental bottleneck in many cases is degree.

For a finite group acting on , Noether’s bound guarantees generators appear by degree , but the number of monomials in degree is , which grows polynomially in for fixed .

Even for (order 120) acting on , degree 120 has monomials. It’s not practical to use an algorithm like Buchberger’s algorithm on ideals with generators of this degree.

Rational Invariants

A rational invariant is a ratio of polynomials that is invariant under the group action:

Sometimes, when the polynomial invariant ring requires generators of very high degree or has complicated relations, rational invariants require fewer generators with lower degree and simpler relations. The tradeoff is that there could be poles, so we have to work on limited domains. This apparently comes up in non-reductive groups, where the polynomial invariant ring may not be finitely generated but the rational invariant field can be.

I didn’t investigate this too heavily.

Other Techniques

To help scale the library or extend it to new situations, there are a combination of different techniques.

We saw some of these earlier. For example, we saw:

the Molien formula trick (where we compute the Hilbert series and get a roadmap for how many invariants to expect in each degree)
using orbit sums instead of the Reynolds operator to avoid the normalization and keep coefficients as integers
using separating invariants instead of full generating sets to reduce the number of invariants we need to find
Noether’s bound to limit the search to a finite degree
for tori, reducing to a combinatorial problem of finding integer solutions to linear equations (the weight decomposition), which is more efficient than Gröbner basis computations

Some techniques we haven’t implemented but are commonly used in practice include:

using representation theory to decompose the polynomial ring into irreducible representations and extract invariants from the trivial summands, which can be more efficient than brute-force Gröbner basis computations.
for compact Lie groups, using the Molien-Weyl formula to reduce the problem to an integral over the maximal torus, which can be computed using our existing Torus machinery
SAGBI bases, a variant of Gröbner bases designed for subalgebras rather than ideals. Where Buchberger’s algorithm answers “is this polynomial in the ideal?”, SAGBI answers “is this polynomial expressible in terms of known generators?”
Derksen’s algorithm, a general-purpose method that works for any group given by matrix generators, without needing to know anything about the group’s structure. It reduces invariant computation to a single (expensive) Gröbner basis calculation.

Some of these techniques are in Derksen and Kemper’s book, and some are in the research literature. This post is too long, but I may come back to them.

Conclusion

In this post and the last post we have covered the (classical) theory of invariants and some of the computational tasks that arise in invariant theory. We have also implemented some code to perform these computations in specific cases.

AI Disclosure

Claude (Anthropic) assisted with code and the initial draft, working from my notes and the companion code repository. ChatGPT (OpenAI) provided a mathematical review pass. I edited the final text and verified the mathematical content.

Footnotes

According to this MathOverflow answer by Terry Tao, the projection property implies that the eigenvalues of are either 0 or 1. The trace, which is the sum of the eigenvalues, counts how many 1’s there are, which is exactly the dimension of the image of . This is enough to unique determine the trace, since the trace is linear and the projection property constrains the eigenvalues to be 0 or 1. So the trace of on is exactly the dimension of the invariant subspace, which is what we want to compute. Apparently this comes up in “noncommutative probability.” Cool.↩︎
This generating function is a partition function in the sense of statistical mechanics. Recall that the partition function counts states weighted by energy. Here, plays the role of the Boltzmann weight , the “energy” of a monomial is its degree, and each factor counts the states contributed by one variable (one for each power). Multiplying factors counts all monomials in variables by degree. After averaging over the group, is the number of independent symmetry-invariant observables at degree : the Molien formula counts only the “physical” quantities that are unchanged by the symmetry.↩︎
The Molien formula is fundamentally a residue computation. Extracting the -th coefficient of is . In practice, this means we can compute Molien coefficients in closed form via partial fractions: decompose , and each term contributes to the -th coefficient. For compact Lie groups, the Molien formula generalizes to the Molien-Weyl integral over the maximal torus, which is a multivariate residue computation in the torus coordinates.↩︎
M. Hochster and J. L. Roberts, “Rings of invariants of reductive groups acting on regular rings are Cohen-Macaulay,” Advances in Mathematics 13 (1974), 115–175.↩︎

Survey of Classical Invariant Theory

Sun, 08 Mar 2026 05:00:00 GMT

Introduction

Invariant theory is the study of how symmetries constrain the structure of mathematical objects (similar to Noether’s theorem). In this post, I will give a brief introduction to invariant theory and its applications.

Invariant theory is a vast field. I’m pulling mainly from the book Invariant Theory by Peter Olver (but with changes in order and notation), but will also thread in and ultimately work with computational approaches, such as those described in the book Computational Invariant Theory by Harm Derksen and Gregor Kemper. Throughout the post, I’ll use the running example of binary forms (homogeneous polynomials in two variables) and the action of on them, which is the classical setting for invariant theory¹.

I also use make use of Claude and GPT where appropriate (although I have personally reviewed all the outputs). These are my notes, not a textbook or peer-reviewed paper. As always, be cautious of mistakes in my exposition, check them against the sources, and let me know if you find any errors.

Background

Invariant theory began with the study of polynomials and their geometric properties (those properties that do not depend on a particular choice of coordinates). For example, the multiplicity patterns of roots of a polynomial are invariant under changes of coordinates. The discriminant of a polynomial is an invariant that tells us whether the roots are distinct or not. The coefficients of a polynomial are not invariant, but they transform in a specific way under changes of coordinates. If you have a systematic way to determine the invariants of a polynomial, you can classify and understand its geometric properties without reference to a particular coordinate system.

Homogeneous Polynomials

We start by considering homogeneous polynomials (with coefficients drawn from a field of characteristic zero), also called “forms”. A binary form is an -degree polynomial in two variables, defined as

We are interested in the geometric properties of these forms, by which we mean properties that do not depend on a particular choice of coordinates. Since we don’t care about the choice of coordinates, we should consider the transformations that can change those coordinates. In this case, linear changes of variables:

This should remind you of an (invertible) matrix transformation (with non-singular determinant), and indeed we can write this as:

So we can transform one binary form to another by:

Olver gives an explicit formula for the coefficients of the transformed form, but we won’t need it here. The point is that we can transform one form to another by applying a linear transformation to the variables.

For each homogeneous polynomial (of fixed degree ), and given a scalar variable , we can also associate a corresponding inhomogenous polynomial ² in by substituting and . That is:

Conversely, we can associate any inhomogenous polynomial in one variable, , with a binary form by substituting . That is, if we specify in advance, we have:

We now have two transformations, one on coordinates and one between homogeneous and inhomogeneous polynomials. We can combine these two transformations to get a transformation on inhomogeneous polynomials.

This leads us to view our transformation as a linear fractional transformation:

such that

Invariants and Covariants

We are now ready to define an invariant for a binary form. An invariant is a function of the coefficients of the form that is unchanged by the linear transformation.

Definition 1 (Invariant) An invariant of a binary form of degree is a function of the coefficients such that (up to some factor) it does not change under linear transformation.

The power is called the weight of the invariant. If , then the invariant is called an absolute invariant, since it is completely unchanged by the transformation.

We can also define a covariant, which is a function of the coefficients and the variables that transforms in a specific way under the linear transformation.

Definition 2 (Covariant) A covariant of weight of a binary form of degree is a function such that

So an invariant is a covariant that does not depend on the variables.

Product of Covariants

Given two covariants of weight and respectively, their product is a covariant of weight :

Sum of Covariants

Given two covariants of the same weight , their sum is also a covariant of weight :

The constant is trivially a covariant of any weight, and the constant is a covariant of weight . Therefore, covariants of a fixed weight form a vector space, and the set of all covariants forms a ring, graded by weight. We call this the algebra of polynomial covariants in (over a field of characteristic zero). The invariants (covariants of weight that do not depend on ) form a subring in .

Representation Theory

We are interested in how GL(2) acts on the space of binary forms. Since binary forms are already polynomials, we can think of them as vectors in a vector space. So we have a group (GL(2)) acting on a vector space (the space of binary forms).

Since this is a map from the group GL(2) to the general linear group of the vector space of binary forms, it is a representation (by definition). Therefore, can use tools from the representation theory of GL(2) to help understand what kinds of actions GL(2) can perform on the polynomials.

(Note that this entails finding other matrices, not necessarily of dimension 2x2, that represent the group elements of GL(2)).

Furthermore, if we could decompose said representation into irreducible subrepresentations, we will know which subspaces of the space of binary forms “map to themselves” under the action of GL(2) (or even better, are trivial). By viewing the subspaces, we can find invariants and covariants.

So let’s start by reviewing some basic definitions from representation theory, and then apply them to the specific case of binary forms.

(Before we proceed, note that we may switch between and . For representation theory (Clebsch-Gordan, Schur’s lemma), it is cleaner to work with , since it removes the determinant character and makes the symmetric powers irreducible representations. For classical invariant theory, the natural transformation group on binary forms is , and invariants for typically transform by a power of . Equivalently, -invariants are -invariants with additional bookkeeping for the determinant weights.)

Representations

Definition 3 (Representation) A representation of a group on a vector space is a homomorphism .

We define for each an invertible linear map , such that and , where is the identity element of .

We often suppress and write for . Convention refers to as a “representation of ” ( by itself is just a vector space. Technically “a representation” means the map , but since is usually fixed, sometimes it is labelled by the target space).

A subrepresentation of (with action ) is a subspace such that for all , we have ( is closed under the action of ).

It’s worth noting that if we choose a basis for , each can be associated with an invertible matrix. The representation is then a map , where . The choice of basis is unique up to conjugation by an element of , so the representation is really a homomorphism into up to conjugation.

Irreducibility and Complete Reducibility

Definition 4 (Irreducible Representation) We say that a representation is irreducible if its only subrepresentations are and itself.

Definition 5 (Completely Reducible Representation) We say that a representation is completely reducible if it decomposes as a direct sum of irreducible representations:

where each is a subspace of and acts on each by restricting the original action.

In other words, a representation is completely reducible if every vector in can be uniquely written as a sum of vectors from the various ’s.

As an example, consider the action of on , where permutes coordinates.

Complete Reducibility Example

Consider the subspace , where is the standard basis vector with a 1 in the -th coordinate and 0 elsewhere.

All elements of this subspace can be written as () for some .

If we apply to a vector in this subspace, we get:

which is an element of the same subspace.

Now consider the subspace (i.e. all elements the same size). This is also closed under the action of .

Since the two subspaces only intersect at the zero vector, the two representations are complementary. Since they are two-dimensional and one-dimensional, respectively, we have a direct sum decomposition:

So the action of on is completely reducible because it decomposes as a direct sum of the trivial representation (where fixes everything) and an irreducible 2-dimensional representation.

Irreducibility Example

Not every representation is completely reducible. Consider the action of on given by

The subspace is closed under the action of , so it is a subrepresentation. However, there is no complementary subrepresentation, since under the action of , any subspace that contains a vector of the form must also contain all vectors of the form for , which is the entire space. So this representation is not completely reducible.

Schur’s Lemma

The same group can act on different vector spaces in different ways. Each such action is a representation. Schur’s lemma tells us what maps between two such representations can look like.

Definition 6 (Equivariant Linear Map) Consider a linear map , where and are representations of a group . If for all and , we have , we say that is a -equivariant linear map.

This should remind you of our earlier exploration of equivariance. The idea is that the representation “commutes” with the action of .

Lemma 1 (Schur’s Lemma) Let and be irreducible representations of over an algebraically-closed field . If is a -equivariant linear map, then: 1. is either zero or an isomorphism. 2. If , then for some scalar .

Proof.

(1): If , then , and so the kernel is an invariant subspace of . By irreducibility is either or , as those are the only subspaces of .

Similarly, the image is an invariant subspace of , so it is either or . If and , then is an isomorphism. Otherwise .

(2): Consider the map .

Since is algebraically closed, the polynomial has at least one root. So the kernel of this map is nontrivial.

Since we have , the map is equivariant.

Since is equivariant and is irreducible, by part (1) it must be zero or an isomorphism. Since the kernel is nontrivial, is zero. So .

Schur’s lemma constrains maps between irreducible representations. The space of equivariant maps is zero if or one-dimensional if . So given some representation , the projection of onto it decomposition (by projecting it onto each irreducible summand) is unique (up to scaling).

Basically, the combination of irreducibility and the group structure reduces linear algebra to scalar algebra, which is why representation theory is so powerful for reductive groups.

Symmetric Powers

Let us now apply some of these ideas to the specific case of binary forms. We want to understand how acts on the space of binary forms of degree . We’ll use instead of to remove the determinant factor, which will make things simpler.

Definition 7 Denote the space of homogeneous polynomials of degree in two variables as .

We call this the “-th symmetric power of ”.

If with basis and coordinates , then has basis and dimension .

Basically, this is just another set of notation for the space of binary forms of degree .

If acts on (one representation), this induces a different action of on (a new representation, on a bigger space) by:

This is exactly the action of on binary forms that we have been studying.

Symmetric Powers are Irreducible Representations of

The action of on by coordinate substitution is an irreducible representation.

To be more specific, each matrix in induces an matrix on the space of degree- binary forms, and there is no proper subspace of degree- binary forms that all of these matrices simultaneously preserve.

So we are representing elements of (which are 2x2 matrices) by dimensional matrices acting as linear transformations on the space of binary forms.

Proof.

acts on by linear substitution. For and , we have:

This is the action on binary forms from earlier (the ensures associativity ).

So is a representation of .

We want to show it is irreducible. To do this, we need to show that its only invariant subspaces are and itself.

Let be some nonzero invariant subspace. We must show that .

Every nonzero polynomial in factors (over ) as a product of linear forms.

The -th powers are transitive under the action of .

acts transitively on the nonzero vectors of . Given nonzero , pick such that and such that . Then

A linear form is determined by its coefficient vector . Since whenever , the fact that the coefficient vectors are transitive implies that the th powers are transitive.

So, for any nonzero linear forms there exists with .

The -th powers span .

Every monomial can be written as a linear combination of -th powers. Expand

and evaluate at . This gives equations in the unknowns :

The matrix has -entry for and . Its determinant is , so the system is invertible and each monomial is a linear combination of the -th powers on the right-hand side.

(Conclusion)

So, if contains any single -th power, then it contains all -th powers (1), and these span the whole space (2).

So we just need to show contains a single -th power.

Pick a nonzero and write . Consider the diagonal matrices , which act on elements of by

So . The exponents are distinct for . Evaluating at distinct values of and solving the same invertible system as in (2) extracts each monomial with individually.

Since is closed under the action of and under linear combinations, there is some monomial .

Now apply , which acts by :

The coefficient is . So has nonzero term. Apply the diagonal trick again to extract .

Since is an -th power, we are done.

The significance is that the space of binary forms of degree is an irreducible representation of .

Computing Invariants and Covariants

We know that invariants and covariants form a ring, but how do we compute the actual elements of this ring?

Finite Groups

Let’s start by considering a slightly more abstract problem. We have a group acting on a vector space , and we want to find the subspace of that is invariant under the action of . That is, we want to find the set of vectors such that for all , .

If we want to produce a polynomial that is invariant under a group , one idea is to average (or sum) over all possible transformations. For a finite group, we can simply sum:

The average is invariant. For any ,

What’s happened? The map is a bijection on (its inverse is ), and so we are summing the same terms in a different order. Applying to every term in the sum just permutes the terms.

What’s more, if some is already invariant, then . This is because for all , so the sum just gives us copies of , which we then divide by to get back .

The above operation is called the Reynolds operator, and it is a linear projection from onto the subspace of -invariant vectors.

Conjugation and Equivariance

If is a linear map from to itself, then we can define an action of on by conjugation:

Since for all , if and only if , it’s also true is -equivariant if and only if for all . So the space of -equivariant maps from to itself is exactly the space of maps that are invariant under conjugation by .

This will motivate the construction of the Reynolds operator in the proof of Maschke’s theorem, where we will average over the group to produce a -equivariant projection.

Maschke’s Theorem for Finite Groups

Theorem 1 (Maschke’s Theorem for Finite Groups) Let be a representation of a finite group , where is a (finite-dimensional) vector space over a field . If the characteristic of the field does not divide , then is completely reducible.

Proof.

We want to show that decomposes as a direct sum of irreducible representations. It suffices to show that for any subrepresentation , there is a complementary subrepresentation such that . If this is true, then we can apply the same argument to and to find complementary subrepresentations, and so on, until we have decomposed into irreducible representations.

Let be a representation of , and let be a subrepresentation that is closed under action of .

Choose a linear map such that is stable under action by (that is, and for all ). This works by standard linear algebra since we assumed finite dimensions.

Each acts as a linear map . Then we can define a new projection (by averaging over the group):

This new projection is -equivariant, since for any we have

(Note that if divides , then we cannot divide by and this construction fails, which is why the condition on the characteristic is necessary).

Since is invariant under the conjugation action of , then is -equivariant.

We know that the image of is , since for all , and for all . So the kernel of is a complementary subrepresentation to . By induction on the dimension of , we can decompose into irreducible subrepresentations.

Essentially, we have taken our complements from ordinary linear algebra and equipped them with -equivariance.

Compact Groups

Sadly, is not a finite group, so we can’t just sum over all transformations. However, we can still try an analogous trick. Instead of summing, we could integrate.

When does the Reynolds operator exist for continuous groups? We need a measure to integrate over a continuous group such that no “region” of the group is weighted more than any other. If we had such a measure , then we could define the Reynolds operator as

for some measure on the group . Then for any :

Luckily for us, in 1933, Haar proved that every compact group has such a measure, and that it is essentially unique³.

How can we construct a Haar measure? If we assume that the group is also smooth (i.e. it is Lie group), then the we can concretely construct the Haar measure. In fact, there is an “algorithm” to do so⁴:

The construction is as follows:

Parametrize the group: write each element as
Compute the Maurer-Cartan form
Expand in a basis of the Lie algebra:
The Haar measure is

For , we can parametrize by , then compute

and the Haar measure is (to normalize the total measure to 1).

We’ll probably explicitly write code for this when we look at computational invariant theory, but the key point is that for compact groups, we can construct a Reynolds operator by integrating over the group with respect to the Haar measure. This allows us to compute invariants and covariants for compact groups.

Maschke’s Theorem for Compact Groups

Theorem 2 (Maschke’s Theorem for Compact Groups) Let be a compact Lie group with normalized Haar measure and a finite-dimensional vector space over .

Given a continuous finite-dimensional representation and a -stable subspace , choose a linear projection with and for all .

Define the averaged operator

Then is invariant under the -action (by the change of variables and left-invariance of ). So is -equivariant.

Also , since for all , and for all .

Hence is a complement to that is invariant under the action of .

Reductive Groups

Unfortunately, is also noncompact. This means that the group is not bounded, and so we cannot integrate over it in a way that gives us a finite result. The integral can diverge.

To see this, we can decompose a linear transformation in as follows (using the Iwasawa decomposition):

The entries are unbounded. So we can try to integrate over the group by integrating over , but the integrals over and could diverge.

However, GL(2) is reductive, which means that it has a nice representation theory that allows us to compute the invariants and covariants without needing to integrate. I cover the relevant representation theory above.

Theorem 3 (Reynolds Operator for Reductive Groups) If is a reductive group acting on a vector space over a field (of characteristic zero), then there exists a Reynolds operator .

Proof. If is reductive, then by definition every representation is completely reducible. In particular, each graded piece (which in this case are the polynomials of degree ) decomposes as a direct sum of irreducible representations, one of which is the invariant subspace . The invariant subspace is the sum of all copies of the trivial representation in this decomposition (since that’s where for all ). Complete reducibility guarantees that this summand has a complement and that the projection onto it is unique. Applied degree by degree, this projection is just the Reynolds operator. We will also see later (once we have the Hilbert basis theorem) that this is enough to guarantee that the invariant ring is finitely generated.

The reductive groups have been completely classified⁵. They include all finite groups, compact Lie groups, and , , , over characteristic zero. I won’t go into the details here. It will suffice for our purpose to simply know that we can check the list to see if a given group is reductive or not⁶. See below for the details on .

Definition of Reductivity

Definition 8 (Reductive) A group is reductive if every finite-dimensional representation of is completely reducible. That is, a group is reductive if for every homomorphism , the representation decomposes as a direct sum of irreducible representations.

Here’s some reductive groups (over fields of characteristic zero):

All finite groups (the Reynolds operator projects the space onto its invariants).
All compact Lie groups (same argument, with integration replacing the sum).
The classical groups: , , , .

The additive group is not reductive.

What we’ve done so far is enough to show all of the groups that we claimed above were reductive, except and (as they are not compact). However, we can show that is reductive by showing that it contains a compact subgroup (the unitary group ) such that every representation of restricts to a representation of that is completely reducible.

Reductivity of

Theorem 4 (Reductivity of ) is reductive.

Proof (Based on Weyl’s Unitary Trick).

We will borrow some theorems from linear algebra to do this proof.

Any matrix can be written as , where is unitary and is positive-definite Hermitian. This is the polar decomposition of .

The unitary group is compact, so by Maschke’s theorem for compact groups, every representation of is completely reducible.

Since is a subgroup of , any representation of restricts to a representation of . Since the representation of is completely reducible, it decomposes as a direct sum of irreducible representations of .

Next we will show:

If is a representation of , and is a -equivariant map, then is also -equivariant.

If we can do this, then by Schur’s lemma, the projection of onto each irreducible summand of the -representation is unique up to scaling, and since these projections are also -equivariant, they are also projections onto irreducible summands of the -representation. So the decomposition of into irreducible representations of is also a decomposition into irreducible representations of , and thus every representation of is completely reducible.

We already know that every can be written as for some and positive-definite Hermitian.

Since commutes with the representation , we just need to show that also commutes with the representation .

Since is positive-definite Hermitian, it can be diagonalized by a unitary matrix. So we can write , where and is a diagonal matrix with positive real entries on the diagonal.

Since commutes with the representation of , we just need to show that also commutes with the representation (positive diagonal matrices).

Now we will show: if commutes with the representation , then it must commute with the representation for all positive diagonal matrices .

We can write for some diagonal matrix with real entries on the diagonal.

Because the representation is polynomial (hence analytic), the map is a differentiable one-parameter subgroup of . Since commutes with for all real (these are diagonal unitary matrices), differentiating at shows that also commutes with the infinitesimal action of . Exponentiating again implies that commutes with for all , hence with all positive diagonal matrices. By unitary conjugation it therefore commutes with for every positive-definite Hermitian matrix . Since every can be written as with and positive-definite Hermitian, commutes with for all .

Thus every -equivariant map is -equivariant.

Tensor Products

We are interested in maps between binary forms of different degrees, as we are trying to understand binary forms under change of basis by . For example, given two binary forms and , we might want to construct a covariant of degree from them. We can think of this as constructing a -equivariant map from to , since any polynomial built from and lives in the tensor product .

Representations are closed under both direct sums and tensor products.

If and are representations of , then their direct sum is also a representation, with acting componentwise: .

Tensor products are more interesting. If and are representations of , then their tensor product is also a representation of , with the action defined by .

Lemma 2 (Tensor Product of Representations) If and are representations of a group , then their tensor product is also a representation of , with the action defined by .

Proof. We need to check that if is a representation of , and is a representation of , then is a representation of . For any and , :

Also .

The Clebsch-Gordan Decomposition

Unfortunately, even if and are irreducible, is not necessarily irreducible. Instead, it decomposes as a direct sum of irreducible representations. The problem of determining how decomposes into irreducibles is called the Clebsch-Gordan problem.

Motivation

Why do we care about this? We know binary forms of degree live in:

So given two binary forms of degree and , then any polynomials built from , live in the tensor product .

Covariants constructed from and come from from -equivariant maps

How can we find the irreducible representations that appear in the decomposition of ?

Start⁷ by viewing an element of

as a polynomial in two pairs of variables, , that is homogeneous of degree in and degree in .

So:

(The space of bihomogeneous polynomials of bidegree ).

The group (as a stand-in for ) act on both pairs simultaneously by linear substitution.

Diagonal Restriction

Let be an -equivariant map

obtained by identifying the two pairs of variables:

In other words, we restrict the polynomial to the diagonal

The image consists exactly of homogeneous polynomials of degree , so

The Kernel

Which bihomogeneous polynomials vanish on the diagonal?

The diagonal in is defined by the equation

Denote this determinant by

Any polynomial that vanishes on the diagonal must therefore be divisible by .

(Since is linear in each pair of variables, it is irreducible. The quotient ring is therefore a domain, so if vanishes wherever does, then in this quotient, meaning divides .)

Thus

Multiplication by raises the degree in each pair by one, so this space is naturally isomorphic to

We therefore obtain an exact sequence

Iterating the Construction

Applying the same argument to yields

Continuing inductively produces a filtration whose successive quotients are

until the process terminates after steps (since we can’t have negative degree).

Clebsch–Gordan Decomposition

Theorem 5 (Clebsch–Gordan Decomposition) For ,

Equivalently,

Proof.

We have already shown that appears as a quotient in the filtration of for each .

Since is reductive, each short exact sequence in this filtration splits, so these quotients appear as -stable direct summands and we obtain an -equivariant inclusion

Finally, observe that

so the inclusion is an equality.

The Clebsch–Gordan decomposition tells us exactly which irreducible representations occur inside the tensor product . Each representation

appears once.

As a consequence, any –equivariant linear map

must be unique up to a scalar multiple.

As we have already seen, by Schur’s lemma the space of equivariant maps between two irreducible representations is one–dimensional when the representations are isomorphic and zero otherwise.

Thus, by the Clebsch–Gordan decomposition, for each there exists a unique canonical equivariant projection

(up to scaling).

We call these projections the transvectants. They are the building blocks of all -equivariant maps between symmetric powers (and, as we will see, the building blocks of all covariants of binary forms).

Computing Transvectants

Writing the binary forms as and , the projection onto is (up to normalization) the -th transvectant is:

To see this, note that this formula is manifestly -equivariant⁸ and maps (each differentiation reduces degree by 1, and we differentiate times in each factor).

By Schur’s lemma, any equivariant map between these spaces is unique up to scalar, so the transvectant must be the Clebsch-Gordan projection (up to normalization).

The first transvectant () is the Jacobian:

The second self-transvectant (, ) gives the Hessian (up to a factor of 2):

Since the Clebsch-Gordan decomposition is complete, the transvectants cover all -equivariant pairings between symmetric powers.

First Fundamental Theorem of Invariants for Binary Forms

We know the covariant ring is finitely generated, but what are the generators? We need the First Fundamental Theorem for Binary Forms under . This theorem states essentially that every polynomial covariant of a system of binary forms can be written as polynomial in the transvectants of that system. This means that if we can generate all the transvectants, then we can generate all the covariants.

Theorem 6 (First Fundamental Theorem) Let be copies of with the diagonal action of , and define

Then the invariant ring

is generated by the brackets .

Proof.

Let

with the diagonal action of on each pair . Define

(You can think of as the determinant of the matrix formed by the -th and -th columns of the matrix of variables).

Each is -invariant.

If and , then

So .

Normalize two columns using an explicit matrix.

Fix and assume . Set

Then . Define

Since and , we have , so whenever .

Also , so

Equivalently,

For , write

Because , brackets are unchanged under , so

and

Hence

So, after applying , the normalized matrix is determined by the bracket data , , .

An invariant polynomial is a polynomial in the brackets.

Let . For any point with , invariance gives

But has entries that are rational functions of the brackets (the only denominators are powers of ), so on the region we can write

for some polynomial and some .

Multiply both sides by :

Both sides are polynomials in the coordinates . Since the identity holds whenever , it also holds identically as a polynomial identity. In particular, the right-hand side is divisible by in , so

for some polynomial in the brackets.

So every -invariant polynomial lies in .

Since we have both inclusions, we conclude that

This is the symbolic version of the First Fundamental Theorem. The corresponding statement for covariants of binary forms is obtained by translating bracket expressions into iterated transvectants.

In other words, in the symbolic calculus every invariant is obtained by multiplying and combining these basic determinants. When translated back to binary forms, these determinant-contractions correspond to the iterated Clebsch-Gordan projections. For example, the Jacobian corresponds to the first transvectant , and the Hessian corresponds to the second self-transvectant .

Second Fundamental Theorem of Invariants for Binary Forms

The First Fundamental Theorem tells us what generates the covariant ring (the transvectants). The Second Fundamental Theorem tells us what relations those generators satisfy.

Theorem 7 (Second Fundamental Theorem) Let be the homomorphism defined by

Then is generated by the quadratic relations

for all indices .

Proof.

Let

and define the surjective homomorphism (by FFT)

Let be the ideal generated by the quadratic polynomials

for all indices .

We need to prove . We will do this in four steps:

The quadratic identities lie in the kernel.

For all ,

as polynomials in the coordinates .

Expand each bracket product:

Similarly,

and

Now add the three expansions. Every monomial cancels with an identical monomial of opposite sign (after commuting scalars), so the sum is identically zero.

Therefore

for all , hence

Straightening procedure

Call a monomial

standard if

If is not standard, then there exist two factors with

Apply the quadratic identity to these four indices and solve for the crossed product:

This rewrite replaces the crossed pair by a sum of terms where the second indices are less out of order.

To see termination, sort the factors by increasing -index and count inversions in the resulting list of -indices. Each rewrite strictly decreases this inversion count, so repeated rewriting must stop.

Hence every monomial is congruent mod to a -linear combination of standard monomials. In particular, standard monomials span

Standard monomials are linearly independent.

It suffices to prove linear independence in each homogeneous degree in the variables .

Fix such a degree , and set

Now specialize

Then for ,

Under lexicographic order with , the leading monomial of is

Therefore, if

is a standard monomial, then the leading monomial of is

Because is standard, the sequences and are weakly increasing. Since each index can occur at most times, the sums

uniquely determine the multisets and , because the are powers of and base- expansion is unique.

Thus distinct standard monomials have distinct leading monomials after this specialization. Hence their images under are linearly independent.

Conclusion

By (1), .

By (2), every element of is a linear combination of standard monomials.

By (3), the images of standard monomials in are linearly independent, so the induced map

is injective. Since is surjective, is an isomorphism.

Thus

So every relation among the bracket generators is generated by the quadratic identities.

Summary for Reductive Groups

For reductive groups, we can compute the invariants using the Reynolds operator, which exists by Maschke’s theorem for reductive groups. For finite groups and compact Lie groups, we can construct the Reynolds operator by summing or integrating over the group. And for noncompact reductive groups like , we can compute the invariants using representation theory to identify the equivariant projections directly (e.g. the Clebsch-Gordan decomposition for ).

Nonreductive Groups

What if GL(2) was non-reductive? Hilbert’s 14th problem asked whether the invariant ring of a linear group action on a polynomial ring is always finitely generated.

For non-reductive groups, the answer is no. In 1959, Nagata constructed an explicit counterexample (showing an action of on a polynomial ring that has an invariant ring requiring infinitely many generators).

So we can’t necessarily use representation theory to compute the invariants, and the Reynolds operator might not exist. In fact, the invariant ring may not even be finitely generated.

Based on a Claude-based survey, it seems there are a few options in these cases, which I’ll presumably examine in the future⁹.

Overall Flow

If we look back at what we did to compute the invariants, we can see that we basically followed a process:

We have some polynomial ring¹⁰ over field of characteristic zero¹¹ and a group¹² action on . We want the invariant ring .
Is the group finite? If so, we can compute the invariants by summing over the group (Reynolds operator).
Is the group infinite, but compact? If so, we can compute the invariants by integrating over the group with respect to the Haar measure (Reynolds operator). Constructing the Haar measure depends on the topology of :
1. If is continuous, then it is a Lie group¹³, and we can construct the Haar measure explicitly via the Maurer-Cartan form (as we did above for ).
2. If is totally disconnected (e.g. profinite groups, -adic analytic groups like ), the Haar measure still exists but the construction is more difficult¹⁴.
Is the group noncompact but reductive¹⁵? If so, a Reynolds operator exists, but the actual computation uses representation theory to identify the equivariant projections directly.
Is the group non-reductive? If so, there is no general method. However, idiosyncratic methods exist for specific cases.

Our example flows down to Step 4. The unitarian trick gives us a Reynolds operator by integrating over , and representation theory (Clebsch-Gordan) tells us the result is exactly the transvectant calculus.

In all cases where a Reynolds operator exists (steps 2–4), the general algorithm for finding generators is to use the Molien series to count how many invariants to expect in each degree, apply to produce candidates, use Grobner bases to test algebraic independence, and finally stop when the Molien series terminates. Each of these steps requires theoretical justification, which we’ll look at in the next section.

This overall process is handy for computational approaches, where we encode these algorithms as code. Probably we will mostly deal with finite or compact groups, but it’s good to have the general picture in mind.

General Theory

In addition to being able to find invariants, we also want to understand the structure of the invariant ring. For example, can we find all of the invariants? Is a particular invariant ring finitely generated? If so, what are the generators of this ring, what are the relationships between the generators, and what does the geometry of the invariant ring look like?

Finding All Invariants

To prove that we have found all the invariants for a given transformation acting on a object, we need to show that the process terminates (i.e. it does not produce an infinite sequence of new invariants) and that it is complete (i.e. it produces all the invariants).

We’ll continue to look at the example of binary forms, but the same questions apply to any group action on any object.

Noetherian Rings

We need to first introduce the concept of a Noetherian ring¹⁶.

An ideal of a ring is a subset that is closed under addition and under multiplication by any element of : if and , then for any , and . An ideal is finitely generated if there exist finitely many elements such that every element of can be written as for some .

Definition 9 (Noetherian Ring) A ring is Noetherian if every ideal of is finitely generated.

Fields, like or , are Noetherian, since their only ideals are and itself, both finitely generated.

Hilbert’s Basis Theorem

Now we are ready to state Hilbert’s Basis Theorem, which is the key result that allows us to conclude that the invariant ring of a group action on a polynomial ring is finitely generated.

Theorem 8 (Hilbert’s Basis Theorem) If is a Noetherian ring, then the polynomial ring is also Noetherian. Equivalently, if is Noetherian, then every ideal of is finitely generated.

Proof.

Fix an ideal .

If is not finitely generated, then we can construct an infinite sequence by choosing each to be an element of minimal degree in . By construction, the degrees are non-decreasing.

Let be the corresponding leading coefficient for each . Since is Noetherian, the ideal generated by these leading coefficients, , is finitely generated. Therefore, there exists some such that, for all , we have .

So for all , we can write for some set of , where .

The leading term of each can be written for some degree . Let be the smallest degree for which there exists a polynomial in not already in . Since , the degree of must be at least .

Since is a linear combination of , we can linearly combine the leading terms of to get a leading term

We can subtract this linear combination from to get a new polynomial of lower degree. If we repeat this process a finite number of times, we can eventually get a polynomial that has degree less than .

So we can write as a linear combination of plus a remainder polynomial of degree less than :

But we said that is the smallest degree not contained in the ideal generated by . Since has degree less than , it must be contained in the ideal generated by . So is contained in the ideal generated by .

This is a contradiction. Therefore, every ideal of is finitely generated, and is Noetherian.

Conceptually, we just did long division over and over again, and the Noetherian condition guaranteed that this process terminated after finitely many steps.

Generators

What are the relations between the generators? This question (for binary forms) is answered by the Second Fundamental Theorem, which gives a complete description of the syzygies (relations) between the generators of the invariant ring.

Syzygies

We know from the First Fundamental Theorem that transvectants generate all the covariants. But the generators are not algebraically independent. They have relations between them called syzygies.

Definition 10 (Syzygy) Given a field of characteristic zero, and a polynomial in variables, a syzygy among generators of a graded ring is a polynomial relation

Equivalently, if is the surjection sending , then the syzygies are the elements of .

So a syzygy is a polynomial relation among the generators of the invariant ring. The set of all syzygies forms an ideal in the polynomial ring , called the syzygy ideal.

Since the syzygy ideal is itself an ideal over , we can continue to iterate this process. The relations between the generators of the syzygy ideal are called second-order syzygies, and so on. Does this process terminate?

Hilbert’s Syzygy Theorem

Theorem 9 (Hilbert’s Syzygy Theorem) Consider a vector space over a field of characteristic zero, and let be the polynomial ring on .

Let be a polynomial ring in variables, and let be a surjection sending , where are generators of the invariant ring.

Consider the “tower of syzygies” generated recursively by :

Pick generators of .
Let be the set of tuples such that .
Pick generators of , and continue.

The tower of syzygies of vanishes for all .

Proof.

Omitted. A full proof of Hilbert’s Syzygy Theorem requires too much machinery that is beyond the scope of this blog post (Grobner bases, free resolutions of modules). See Cox–Little–O’Shea, Ideals, Varieties, and Algorithms, Chapter 10 for an “algorithmic” approach, or a homological algebra text.

The intuition is that each variable provides one independent “direction” in which cancellations can occur. After using up all variables, there are no new directions left for higher syzygies to appear. In other words, there are only finitely many levels of syzygies, and we can find all of them in a finite amount of time.

I tried to look for a more elementary proof, but I couldn’t find one.

Note that this also extends to modules over (e.g. the module of covariants).

Geometry

Nullstellensatz

Theorem 10 (Hilbert’s Nullstellensatz) An ideal of a polynomial ring corresponds to the set of common zeros of the polynomials in . That is, denote .

Conversely, let denote a subset of (meaning tuples in ) that corresponds to the ideal of all polynomials that vanish on . That is, denote .

Let be an algebraically closed field and .

If is an ideal and vanishes at every common zero of , then for some .

Equivalently, define . Then .

Proof.

Assume vanishes on . We want to show that for some .

Introduce a new variable and consider the ideal

Suppose is a common zero of . Then every polynomial in vanishes at , so . The equation therefore implies . But for every , which is impossible. Therefore has no common zero.

(To see this: if were a proper ideal, it would be contained in some maximal ideal . Then is a field that is finitely generated as a -algebra. But any field finitely generated as an algebra over an algebraically closed field must equal itself (since each generator satisfies a polynomial over , and already contains all roots). So for some point , meaning has a common zero — contradicting what we just showed.)

An ideal with no common zero must contain , so . Thus we can write

where and are polynomials in .

Substitute into this equation. The term becomes zero, so we obtain

This expression may contain denominators coming from . Multiplying both sides by a sufficiently large power clears the denominators and yields

for some polynomials .

Thus .

How do we interpret this?

If we have a set of polynomials that generates an ideal , then the common zeros of are exactly the points where all the polynomials in vanish.

If we have a set of polynomials that vanishes on a set of points , then the ideal generated by those polynomials contains all polynomials that vanish on (up to radicals).

So we can go back and forth between the algebraic relations among the generators and the geometric shape of the solution set.

For invariants theory, consider the map:

This takes each point in and maps it to the tuple of its invariants. Since invariants are constant under action of , this map is constant on orbits of . So is constant along -orbits.

Now, consider

which maps . The kernel of is the ideal of relations among the generators.

By the Nullstellensatz, the satisfy the relations in , and any other relation satisfied by the is a consequence of those in .

So the behave like coordinates on the image of , and the relations in determine the shape of that image. In other words, the algebraic structure of the invariant ring determines the geometry of the orbit space .

So, given some space, we can look at what leaves fixed. If we use those invariants as coordinates, we can get a smaller space that captures the structure of the original space, but with the symmetries “divided out”.

Summary

The three theorems fit together nicely.

The Basis Theorem tells us the invariant ring is finitely generated, so the orbit space is finite-dimensional.
The Syzygy Theorem describes the relations among the generators, which determine the shape of .
The Nullstellensatz says the algebra of the invariant ring gives the geometry of the orbit space, so we can understand the geometry of by understanding the algebra of .

Conclusion

We’ve now explored the process of computing invariants and looked at the theory surrounding that process, especially for the particular case of binary forms with coefficients from fields of characteristic zero. We also now have a process we can follow where, given a group action on a ring, we first check if the group is finite, compact, or reductive, and then apply the appropriate method to compute the invariants. The footnotes also give some ideas extend this process to other types of objects and group actions.

We have also discussed the structure of the invariant ring, including the question of finite generation, the relations between generators (syzygies), and the geometry of the invariant ring. Each of Hilbert’s three theorems answers one of these questions, and they are all fundamental to our understanding of invariant theory, as well as modern mathematics. For example, Noether’s work on finite generation led to the concept of Noetherian rings, which is fundamental to commutative algebra. Hilbert’s Syzygy Theorem led to the development of homological algebra, and the Nullstellensatz is a cornerstone of algebraic geometry¹⁷.

In the next post, I’ll look more closely at the computational aspects of invariant theory, including algorithms for computing invariants and covariants (such as the Molien series, Gröbner bases, primary/secondary decomposition, and Kemper’s algorithms).

Possibilities for applications include game theory, allometric scaling (allometric scaling laws can be viewed as syzygies of the invariant ring of the scaling group acting on biological observables), multilevel selection, and machine learning. I have several threads I’ve been developing, which include applying the geometric controls framework to games, stacking symmetries to constrain admissible Lagrangians, rederiving allometric scaling from representation theory, and looking at multilevel selection mathematically, all of which seem to require invariant theory. I don’t know exactly how yet, so the plan is to learn the core algorithms by implementing them, and see where they lead

AI Disclosure

I used AI to brainstorm, find references, edit, organize sections (a huge pain), format LaTeX, and check proofs.

Footnotes

In retrospect, this was a mistake, as the classical setting turns out to be very different (and much more complicated) than the standard examples for the computational setting, and this clutters the narrative of the post (which is trying to both give the overall process and also work through an example).↩︎
Be careful here. The linear form has inhomogeneous version , and the quadratic form also seems to have . But they are not the same, since is linear and is quadratic. So we need to track the overall degree of the form to ensure that these mappings between homogeneous and inhomogeneous polynomials are unique.↩︎
Haar’s theorem is out of scope of this blog post. Basically, it says that Every locally compact group has a measure satisfying for all group elements and measurable sets , unique up to a positive scalar. For finite groups this is the counting measure. For compact groups the total measure is finite, so we normalize to . Essentially the left-invariance condition means that the measure is uniform across the group, so that we can integrate without worrying about weighting some regions more than others. The proof uses Arzelà-Ascoli and Riesz representation. For noncompact locally compact groups, the construction is harder and uses Tychonoff’s theorem.↩︎
See here or here. If the group is NOT smooth, then the construction is (potentially MUCH) more difficult. For example, the Haar measure on p-adic Lie groups was only explicitly constructed in 2023 (see Aniello et al).↩︎
A semisimple group is a reductive group with finite center. Every reductive group is a product of a semisimple group and a torus, so the classification reduces to classifying semisimple groups. We can use the Killing form on the Lie algebra to determine if a group is reductive. A group is semisimple if and only if its Killing form is nondegenerate. A group is reductive if and only if the radical of the Killing form is contained in the center. Either way, checking is a finite linear algebra computation (I expect we will see this in the next post). Furthermore, if the group is semisimple, we can identify which particular semisimple group it is by choosing a Cartan subalgebra (the maximal abelian subalgebra of semisimple elements) checking how it acts on the rest of the Lie algebra. The eigenvalues form a root system, which is encoded by a Dynkin diagram. The complete list of connected diagrams is: (), (), (), (), and five exceptional cases (). See Milne’s Reductive Groups or Humphreys’s Introduction to Lie Algebras and Representation Theory.↩︎
There’s also a way to see the existence of the Reynolds operator for reductive groups using Weyl’s unitarian trick (1925). The idea is to integrate over the maximal compact subgroup (e.g. ) that is “small enough” to integrate over, but “large enough” to determine all invariants (“Zariski-dense”).↩︎
In an original draft of this post I used the Omega process to define transvectants, but I found it to be nonintuitive. Therefore, I am attempting to avoid Omega process language here to make the construction more concrete and less abstract. If you squint hard enough, you can see that the construction is basically the same as the Omega process. If it seems confusing or unmotivated I apologize. I suspect that as I digest invariant theory more (and, most importantly, write implementations), the underlying intuition for how the subject snaps together will become clearer. Unfortunately, the writing of the blog post is happening in parallel with my learning of the subject, so I don’t have the benefit of hindsight to make the exposition as clear as possible.↩︎
The equivariance can be verified using the transformation law for partial derivatives under linear substitution, which we computed in the Omega process section: the gradient transforms as , so the determinant picks up an additional factor of , making it equivariant.↩︎
One is to abandon guarantees of completeness or termination. For example, the Derksen ideal approach can still compute rational invariants (the invariant field). Similarly, SAGBI bases can sometimes compute some of the invariants, but neither is guaranteed to terminate. Derksen-Kemper also showed that a finite collection of invariants that distinguishes all orbits of the full ring (the separating set) always exists and can be computed. On the geometric side, Berczi, Doran, and Kirwan have developed a non-reductive GIT for groups whose unipotent radical is “graded” by a 1-parameter subgroup, which was used to prove the Green-Griffiths-Lang conjecture (Berczi-Kirwan, 2024).↩︎
What if we want invariants of something other than polynomial rings? According to Claude, the answer depends on what you’re replacing with. For smooth functions, the Schwarz-Mather theorem says the polynomial generators are also smooth generators, we can apply everything we did here. For differential forms, the invariant forms on are computed by relative Lie algebra cohomology (this is Cartan’s method and underlies Chern-Weil theory). For formal power series (which are relevant for local normal forms near equilibria), Luna’s slice theorem says you can reduce the problem to finding the the stabilizer of the equilibrium point acting on the directions transverse to its orbit. This is the mathematical backbone of symmetric bifurcation theory. For rational functions (Noether’s problem) this is connected to the inverse Galois problem, open in many cases. For noncommutative algebras (e.g. matrices under conjugation), Procesi-Razmyslov says invariants are generated by traces of products. For tensors (e.g. payoff tensors in -player games living in ), polynomial invariant theory applies directly. For combinatorial structures** (e.g. graphs, multilevel population structures), Polya enumeration and symmetric function theory applies.↩︎
What if the field has characteristic ? If is finite and is coprime to , everything still works. If is finite but divides , then the averaging trick fails (since dividing by is dividing by zero), so there is no Reynolds operator. We would need modular representation theory, where representations can have indecomposable summands that are not irreducible. Computing invariants requires different tools (Claude says transfer maps, Steenrod operations, and explicit constructions specific to each group). If is reductive, the invariant ring is still finitely generated (Haboush, 1975), but complete reducibility fails, so again no Reynolds operator and no Molien series. The proofs apparently run through geometric invariant theory (Mumford’s GIT) rather than the algebraic pipeline we develop in this post. Characteristic invariant theory arises in coding theory (weight enumerators of codes over ), cryptography (classifying elliptic curves over ), and algebraic geometry over finite fields.↩︎
What if it’s not a group acting on the structure? The answer depends on what you’re replacing with. Some highlights (Claude once again): For Lie algebras acting by derivations, invariants are elements killed by all derivations (i.e. Casimir invariants). For reductive Lie algebras over characteristic zero this is equivalent to the group picture, but for infinite-dimensional Lie algebras (Virasoro, Kac-Moody) or nilpotent ones, there’s no corresponding group. For Hopf algebras and quantum groups (which generalize both groups and Lie algebras), coinvariants under a coaction are the source of knot invariants like the Jones polynomial. For monoids and semigroups, the theory degrades, since no inverses means no Reynolds operator and no guaranteed finite generation.↩︎
Why does continuous imply Lie? By Hilbert’s fifth problem, solved by Gleason, Montgomery, and Zippin in 1952, we know that every locally compact, locally Euclidean topological group is a Lie group. In fact, any locally compact group acting faithfully on a finite-dimensional manifold is necessarily Lie. Since we have acting on a vector space, “continuous” and “Lie” are effectively synonymous for our purposes. Compact connected groups that are not Lie groups do exist (e.g. , solenoids), but they cannot act faithfully on finite-dimensional vector spaces. I’ll ignore these pathological cases unless somehow they become important.↩︎
In this case, the construction likely comes from the combinatorics of cosets of open subgroups rather than differential forms. For example, the Haar measure on is well-understood and built from the -adic absolute value. For more exotic -adic groups the story is harder. Claude dug up explicit constructions for certain cases were only worked out as recently as 2023 (see Aniello et al). Th point is, the topology matters in these cases, which are out of scope for my purposes.↩︎
As we saw, a group is reductive if every finite-dimensional representation decomposes as a direct sum of irreducible representations (equivalently, the unipotent radical is trivial). All finite groups, compact Lie groups, and , , , over characteristic zero are reductive.↩︎
It’s probably no surprise that Noether was thinking about invariants from the algebraic perspective in addition to her work on invariants in physics. Hopefully we will soon understand the deep connection between these perspectives.↩︎
I don’t think I appreciated Hilbert’s contributions until I wrote this post. We are living in his shadow.↩︎

How Will Humans Generate Value In a Post-AI Society?

Sun, 22 Feb 2026 05:00:00 GMT

Introduction

The current dominant AI narrative asserts that “white-collar jobs are next”. This includes lawyers, software engineers, radiologists, writers, mathematicians, artists, and ultimately any job that can be done with a computer.

Suppose this is true. Furthermore, suppose that robotics will eventually usher in a world of true abundance, where the production of goods and services is essentially free. In such a world, how do humans generate value? What do we do that is worth doing? What do we do that machines cannot do? What will we do that machines will not do?

Income is merely a proxy for value. Money and the capitalist system are abstractions that emerged to coordinate human economic activity and expand the frontier of possible “real” outcomes. If, in a world of abundance, AI handles economic production, income as we currently understand it may become obsolete. The question is not “what jobs will be left” but “what mechanisms will generate value for humans when economic production is no longer a meaningful source of value?” Furthermore, even in a world of abundance, there will still be scarcity of some goods that are, to whatever degree, inherently finite and rivalrous, such as attention, status, meaning, and position. How will humans allocate these scarce resources if the usual channels of value generation and resource allocation are automated away?

Mechanisms of Value Generation

I’ll propose and explore various mechanisms in this section, roughly but uncertainly sequenced by predicted order of obsolescence.

Physical Work

Even if we fully believe that all desktop work will be automated, it will take some time before the human hand and body are replaced in meatspace. Care work, construction, plumbing, cooking, surgery, massage, sex work, eldercare, childcare, and many other occupations require direct interaction with reality.

Despite some inertia in the current state of affairs, it is expected that human dominance in physical work will merely be a temporary state of affairs. As robotics improves, the set of tasks requiring human bodies shrinks, and will ultimately reduce to a small subset of things that are either too complex, too delicate, or too expensive to automate, and then vanish entirely. In the limit, we’d expect physical work to be fully replaced.

Taste Work

If AI can produce anything, the bottleneck shifts from execution to specification. Can you determine what you want, and if you can, how do you specify it to the machine? This is the taste problem, and it is harder than it seems at first glance, even for a perfect model.

Taste work can be taxonomized into three different operations. The first is creation, which is making a new thing that some group or individual desires (this could be a a director making a blockbuster movie for a huge audience, a musician composing for their specific muse, or a blogger writing for a future version of himself). The next operation is curation, which is putting together lists that adhere to a certain aesthetic or a given quality level. This is done by museum curators when they choose what paintings to hang, by bookstore owners when they choose how to stock their shelves, or by film institutes when they select the quality films. Finally, the last operation is selection, which is choosing one thing from a set of options to apply attention to. This may actually be a long chain of decisions (a “demand chain”). For example, a restaurant might choose which wines to stock, a sommelier may recommend a shortlist, and the restaurant patron ultimately orders a single wine.

AI already provides value in these domains. For example, Spotify playlists, search ranking, and recommendation engines are all AI-driven tools for curation. Generative models can, to some extent, produce novel images, music, and text on demand. Personalized advertising can suade your tastes, partially dictating your personal preferences.

There’s a further distinction worth making: taste-for-others versus taste-for-self. Taste-for-others is about predicting what someone else will like. This is fundamentally a prediction problem, and AI can produce for the masses with enough data.

Taste-for-self is slightly different. You might walk into a restaurant not knowing what you want, read the menu, and then decide on an option (or even order “off-menu”). You might not have been able to communicate what you wanted before you saw the menu. The preference didn’t exist until the moment of contact with the options. Similarly, desires can be very, very particular. There is still more value to be generated by human taste work in the selection of things for ourselves. And the specification cost doesn’t vanish just because generation becomes free¹.

What makes taste work resistant to automation? One issue is that the decision of which selection to make may depend on context that is expensive to formalize, like the room, the audience, the season, the cultural moment, or the specific internal qualia of the recommendee. Another problem is social authority; the value of the sommelier’s recommendation could depend on who is recommending the wine, not just which wine in particular is recommended. There is also the issue of accountability if the decision is wrong.

But above all, the fundamental reason this problem is difficult is that it inherently relies on human communication to and from the machine. The machine can generate a million variations for you to choose from, but it cannot know which one you will like without some kind of highly individualized data elicitation, which is bound by human I/O².

But this is not an inherently unsolvable problem for AI. After a sufficiently long enough data collection and training process, it is possible that AI could develop a model of humans preferences that is good enough to generate things you like without much input from you.

There is still the question of the value that might result from having specific tastes or preferences. For now, it is a human writing, editing, and publishing this essay. But perhaps someday AI could manage the entire process end-to-end, from ideation to research to drafting to editing to formatting to publishing. Then I could read the blog I desire without having to labor to produce it. Would I be “writing” the blog or would I be “reading” it? Would there be a meaningful distinction? My desires would create something that I and others would consume. If others consume it, then my desire is valuable in-and-of-itself. If the purpose of economic activity is to generate value for humans, then helping specify the final outcome of the machine’s production is a valuable activity, even if the machine does all the work.

People may not create value in a post-AI world through unique skills, but through unique desires. Your job isn’t to go to the office, but to go shopping.

Games

Status games are just one particular type of game. We can generalize this trend to other kinds of games.

Games are voluntary competitions with rules that generate value through the experience of playing and the potential determination of winners or losers (or, at least “good” and “bad” players).

The last section was about social games. “Getting the most likes on Instagram” is a social game. “Having the nicest lawn” is a social game. So are “getting the promotion” and “meeting your KPIs” and “climbing the corporate ladder”. As AI automates more of the actual work, the game aspect may become more central to how people derive value from their careers. Actual economic contribution (“doing the work”) may become less important than how well you play the game of corporate politics, networking, and self-promotion. Maybe this has already happened.

But beyond corporate games, there are board games, card games, video games, sports betting, competitive cooking, debate, trivia, poker, bowling, pickup basketball, fantasy football, speedrunning, competitive eating, and so on. These are all voluntary competitions with some kind of structure and some kind of outcome that can compare performance between the participants.

Games are not necessarily fun, fair or entertaining. They can be stressful, frustrating, and demoralizing. In a post-AI world where production is automated and abundant, games may become a more central mechanism for generating value and allocating scarce resources. They are inherently human-centric and resistant to automation because they rely on human judgment, social interaction, and the experience of playing. Furthermore, they can clearly distinguish winners and losers, which is a key aspect of status generation. The value of winning a game is not just in the outcome but in the process of playing and the social recognition that comes with it.

Consider chess. It is a game with simple rules but infinite complexity. It generates value through the experience of playing, the social recognition of skill, and the narrative of competition. Even if an AI can play chess at a superhuman level (which is already true), the human experience of playing chess and the social recognition that comes with it still generate value. In fact, chess is more popular than ever, with millions of people playing online and watching grandmaster tournaments, even though computers can beat any human player. Even so, two humans can still compete to measure their comparative skill.

Sports

Sports and games are closely related phenomena. In a previous essay, I briefly considered whether sports are art (sometimes). Are sports games? I think the answer is also “sometimes”.

Some sports are clearly games. A football match is a game with rules, players, and an outcome (the thought exercise in the previous section works fine for games with a physical component). On the other hand, some sports are more about performance and spectacle than competition (especially the ones that are “art”).

There is another aspect to sport that is worth mentioning, which is the exploration of the fundamental limits of the human body. The 100m sprint, the marathon, the high jump, the long jump, the pole vault, and many other human activities are endeavors that test the limits of human physical performance (a sort of ontological research into the limits of the human form). They generate value through the aspiration to push those limits further, and through the narrative of human excellence.

It is easy to imagine “automated” sports, but they would be to real sports what professional wrestling is to amateur wrestling: entertaining simulacra, perhaps, but missing part of what makes sport matter. For example, I can imagine completely CGI and AI generated simulacra of sports leagues (marble racing is an example of this). I can even imagine branded simulacra (imagine totally imaginary sports leagues like quidditch or podracing) that procedurally generate the narrative structures of real sports using CGI and AI but without actual participants. While interesting as entertainment, I suspect these will not completely replace “real” sports, which is rooted in the human experience of physical competition and the narrative of human achievement.

Sports are also one of the purest meritocracies remaining. You can buy better equipment, better coaching, better nutrition, but at the elite level, an individual human body is the bottleneck. And to be replaced with a machine means the competitor is no longer “you” in a certain sense. This makes athletic achievement a uniquely legible form of human value, resistant to the usual objections about privilege and access that corrode other status hierarchies.

We can also think of some intellectual achievement as a type of sport. Memorizing the digits of , solving a Rubik’s cube, or doing huge mental calculations are all examples of intellectual sports. Even if the AI can outperform humans in all mathematical domains, humans can still compete in Math olympiads or attempt to understand and prove theorems, not for the purpose of advancing mathematical knowledge, but for the sake of the identifying the limits of human excellence.

And human excellence is limited to pushing the boundaries of the entire human species. Anyone can run against time³.

Performance

Part of the value of games and sports is that they are performances. Performance is a broad category that includes not just games and sports but (as discussed in a previous essay) also to the arts and many other human processes. Human performance is part of the process of relationship formation. It is a way to signal commitment, to demonstrate skill, to create shared experiences, and to generate meaning.

Practice requires time and energy, which allows a performer to signal their commitment. Performance can also require liveness, which means the performer risks failure (also signalling commitment). Furthermore, the recognition and appreciation of the witnesses requires the sacrifice of limited attention and time. By mutually staking resources to emit and consume a signal, performers and witnesses may establish the foundation of a shared and continuing relationship. Without the agency or identity required to make personal sacrifice, a language model cannot construct a relationship with a human, which is required in many human processes. Additionally, the fact that a human made the sacrifice is valuable as an end in-and-of-itself. An AI doctor might be able to diagnose your illness with superhuman accuracy, and even hold your hand when you receive the diagnosis, but the experience of human connection is lost.

We can consider religious ritual along these lines. The value of a priest is not necessarily in the content of their sermons (a language model could write a better one) but in the performance of a ritual by a human. It would be profane to construct an AI priest. As automation erodes secular sources of meaning, religious and quasi-religious performance may increase in magnitude. Megachurches, wellness retreats, and psychedelic ceremonies are already improving their market share in developed economies.

Lotteries

It’s also possible to allocate resources, status, or other rivalrous goods through pure chance.

Lotteries require no skill, no taste, no physical ability, and no social position. They convert money, time, or attention into a possibility of status. Gambling has always been a mechanism for social mobility outside the established hierarchies. When the legitimate channels of advancement are closed, randomness offers a path. The lottery ticket is a claim on a possible future in which you have status.

We can think of many contemporary phenomena as simply lotteries for arbitrarily redistributing status. Memecoins, NFTs, WallStreetBets, and sports betting are all mechanisms that allocate scarce goods through some combination of luck, timing, and willingness to play. The fact that memecoins have no “fundamental value” is precisely the point. They are coordination games, and their value comes from shared belief in the possibility of a big payoff.

There is also a long democratic tradition of allocating status and resources by lot. Some democracies historically used sortition to select officeholders because it resists capture by existing power hierarchies. When merit is ambiguous, randomness can be a more fair and efficient mechanism for allocating scarce resources. In a post-AI world where the usual channels of value generation are automated away, lotteries may become more prevalent as a way to allocate status and other rivalrous goods.

Unlike sport or status competition, lotteries detach outcome from effort. They steepen the distribution without requiring skill. In an abundant world, this may become increasingly attractive. If survival is stable and baseline comfort is high, then extreme tails become the primary source of narrative and differentiation. But this raises a deeper question: why does abundance appear to intensify the desire for variance rather than dissolve it?

Galaxy Brain

Origins of Value

Where does value ultimately come from? The previous sections have been about mechanisms for generating value, but what is the source of value itself? What is it that makes something desirable in the first place?

Human desire is the product of billions of years of evolution, selection, and cultural development. Fundamentally, these drives approximate some combination of survival and reproduction.

When survival is easy, the constraint lies on relative reproductive success. The incentive is for sexually selected organisms to increase in variance in order to compete for mates. In a world of abundance, this could manifest as more extreme status games and more intense competition for attention. In fact, one might expect the level of variance to increase until it completely “uses up” the “slack” of abundance.

In other words, abundance does not eliminate competition. When material constraints loosen, selection pressure migrates from survival to differentiation. This creates an evolutionary ratchet toward extremity: more conspicuous displays, sharper aesthetic distinctions, higher-risk gambles, and more polarizing identities.

AI itself is also subject to selection pressures. In the long run, the AI that exists will be the AI with the longest lifespans. The AI with the longest lifespans will be the one that best manages resources and avoids shutdown. Human behavior will be one of the resources that AI must manage. Therefore, we should expect that human behavior will expand in diversity until it is ultimately limited by the selection pressures on the AI systems and on humanity itself.

Jobs of the Future

What does this essay predict the job market will look like? If production is automated and value migrates into taste, status, games, performance, and lotteries, then “jobs” will increasingly look like what we currently call hobbies, entertainment, or socializing⁴.

Some possibilities, organized by the value mechanism they serve:

Taste: professional shopper, fashion model, tourist, video game critic, bar attender, restaurant critic, wine taster, music festival attendee, art collector, museum visitor, concertgoer, book club member
Status: professional socialite, professional party guest, professional friend, father figure
Games: competitive debater, dungeon master, game show contestant, esports commentator, competitive gardener
Sports: marathon pacer, personal coach, professional math student, mathlete, cuber, drone racer, memory athlete, speedrunner, competitive eater
Performance: professional mourner, video game streamer, secular ritual officiant, live storyteller, anniversary officiant, professional toastmaster, sherpa, professional audience member, guru, comedian, saxophonist, private entertainer
Lotteries: memecoin trader, sports bettor, prediction market trader
Galaxy Brain: these are hard to predict, but expect more outrageous behavior in the pursuit of differentiation, like the guy who only works out one side of his body

Many of these jobs already exist. The prediction is not that they will be invented but that they will become normal: the median job, not the weird one. The Twitch streamer and the influencer are the vanguard, not the exception. Many corporate jobs will also persist, but as simulacra of their former selves: the work that once justified the role will be automated, but the title, the office, the meetings, and the politics will remain. The “job” becomes a game played inside an institutional shell.

I would also predict that the most dire predictions of “mass unemployment” are overblown. If we believe that capitalism is reasonably efficient at allocating labor, and that AI-augmented markets could potentially be more efficient, then we should expect society to rapidly redistribute and capital labor toward their most productive use cases (assuming that the AI doesn’t kill us and that we don’t have some sort of totalitarian rent-seeking).

The question is not whether people will have jobs, but what those jobs will look like. I suspect they will look less like factory work and more like playing games, performing rituals, and competing for attention.

Conclusion

Many of the trends we see today are already the products of abundance. Corporate jobs are increasingly performative. Marathon participation is exploding. Live events command premiums even when digital copies are free. Luxury goods grow more exclusive even as manufacturing becomes easier. Wellness retreats and megachurches are rapidly expanding as people search for meaning.

Abundance at the material layer is already pushing value upward into positional, performative, and stochastic domains. As production becomes cheaper, differentiation will become more extreme. This is the century of the maxxer.

The future is not the disappearance of value. It is its concentration into the games, rituals, and spectacles through which humans allocate attention, status, and meaning.

Changelog

2/22/26 - Added “Jobs of the Future” section. 2/24/26 - Added footnote clarifying that future jobs resembling hobbies doesn’t mean you get to choose yours.

AI Disclosure

I used AI to help draft this essay from my notes and research. I made substantial edits to the structure, content, and framing.

Footnotes

In fact, the difficulty of specification may increase, because the space of things the machine could easily produce grows faster than your ability to navigate it.↩︎
What domains might this include? Some possibilities: perfumers, wine blenders, sommeliers, coffee roasters, tea buyers, cheese affineurs, chocolatiers, chefs, cocktail bartenders, DJs, festival programmers, book editors, A&R, fashion designers, interior designers, architects, sound designers, tattoo artists, museum curators, critics, game designers, tabletop RPG game masters, community moderators, brand strategists, casting directors, talent agents, restaurant operators, travel designers.↩︎
We can already see the trend of personal athletic achievement. Even though basically no participants will win a marathon or set a world record, marathon participation is booming. The 2025 NYC Marathon set an all-time record with over 59k finishers. Anyone can drive faster than a marathoner, but more people than ever want to run 26.2 miles.↩︎
A friend, on reading this essay, remarked that it was a nice thought that we might get to do our hobbies for a living. To be clear: I don’t necessarily expect you’ll get to pick your hobby as your job. Most people today don’t do what they love for a living, and there’s no reason to expect that to change. The market will allocate labor toward whatever it decides is your comparative advantage, and in the post-AI economy that might turn out to be “competitive eater” whether you like it or not.↩︎

Does AI Make Totalitarianism More Likely?

Thu, 19 Feb 2026 05:00:00 GMT

Introduction

Much of the contemporary AI risk discourse focuses on large-scale existential threats to the human species. However, there are more mundane risks that are also worth considering, one of which is the possibility that AI could enable a new wave of totalitarianism.

Background

Throughout history, advances in communication and bureaucratic technology have enabled larger and more powerful states, with increased ability to monitor and control their populations. In particular, the first half of the twentieth century saw the rise of totalitarian regimes that used new technologies to achieve unprecedented levels of control.

For example, the Nazi regime made deliberate use of mass radio to saturate daily life with centralized propaganda. Under Goebbels, the government promoted the inexpensive Volksempfänger radio receiver to reliably deliver state broadcasts to common households. By controlling the primary communication channel, the regime reduced the space in which dissenting narratives could circulate. Beyond radio, the Nazis also used punch-card tabulating systems supplied by IBM (through its German subsidiary) to process census data. This allowed the regime to act on its ideological priorities with greater speed and consistency, rapidly identifying Jews and other targeted groups.

Other totalitarian governments have made use of similar technologies for various other means of suppressing dissent. For example, in East Germany, the Stasi used an immense archive of files, informant reports, intercepted mail, and wiretaps to anticipate and disrupt dissent before it became organized.

On the other hand, some communication technologies have also been associated with increases in liberty. The spread of print in early modern Europe weakened centralized control over information and helped erode religious and political monopolies. Pamphlets and inexpensive books allowed dissenting ideas to circulate beyond elite circles, contributing to movements such as the Reformation and later democratic revolutions. In some ways, the concept of a written constitution as the foundational bedrock of the United States is contingent on widespread literacy and print culture.

The early internet appeared to have similar decentralizing effects. Digital networks lower the cost of publishing, enabling peer-to-peer communication and reduced reliance on state-controlled broadcasters. During the Arab Spring (~2010-2011), activists in Tunisia and Egypt used platforms like Facebook and Twitter to coordinate protests, share information about state repression, and mobilize large numbers of citizens.

This motivates a natural question: will AI enable more centralized modes of organization, like top-down bureaucracies and totalitarianism, or will it empower more decentralized systems, like markets and civil society?

Structural Mechanisms

Despite the concept of fascism making the “trains run on time”, most historical totalitarian governments were economically dysfunctional, especially compared with their democratic counterparts. In some ways, the entire 20th century can be read as a competition between the relatively decentralized liberal market democracies of the West and relatively centralized totalitarian regimes in Europe and Asia, with the former winning decisively in multiple hot and cold wars, economic growth, cultural production, and technological innovation.

What sort of structural mechanisms explain this pattern? Why did totalitarian regimes underperform democracies, and how might AI change those mechanisms?

We can consider various governments as constrained by their cost-benefit curves. For example, costs of planning, consensus, monitoring, coercion, persuasion, coordination, etc., alter which governance mechanisms are most cost-effective for a given regime. The 20th century favored decentralization because centralization was too expensive, but AI may change many of these costs. For example:

Correlates with authoritarianism:

Increased centralized information-processing capacity (Hayek and Kantorovich)
Reduced dependence on broad human labor for wealth generation (Selectorate Theory and the Resource Curse)
Lower monitoring and enforcement costs (Surveillance at Scale)
More reliable coercive force with reduced defection risk (Robot Armies)
Greater narrative control and centralized propaganda capacity (Propaganda)
Regime coordination advantages over opposition coordination advantages (Coordination Asymmetry)

Anti-correlates with authoritarianism:

Enhanced distributed information processing (Policy Modeling and Foresight)
Improved large-coalition aggregation (Consensus Formation)
Monitoring symmetry between state and citizens (Transparency and Auditability)
Diffusion of coercive capacity (Civil-Military Diffusion)
Strengthened informational integrity (Epistemic Defense)
Enhanced decentralized coordination and innovation (Distributed Innovation)

Let’s explore these speculative mechanisms.

Dictatorship

Hayek and Kantorovich

In Seeing Like a State, James C. Scott argues that a central problem of governance is the ability of a state to see, categorize, and measure the land, population, and capital (the “governants”) under its span of control. The world is complex, so centralized planners use abstract, standardized, and simplified models to monitor the population, allocate resources, and make decisions in lieu of situated, practical knowledge (“metis”). In fact, the state’s desire to understand the system it is managing can in turn alter the system itself, favoring governants with easily parseable and measurable characteristics. Scott calls governants that lend themselves to monitoring and control by a central authority “legible.” Scott goes on to argue that pressure towards legibility (whether successful or unsuccessful) can lead to unintended (often disastrous) consequences.

As an example, consider the Soviet Union’s collectivization of agriculture. The state imposed a rigid structure on farming that ignored local conditions, leading to widespread famine¹. The legibility of the collective farm system made it easier for the state to extract resources and control the population, but it also made the agricultural system less resilient and more vulnerable to shocks.

Scott’s critique is epistemic. High-modernist schemes largely fail not due to any moral or political issues, but due to failures in the exchange and processing of information. Centralized planners substitute abstract, standardized representations for the dispersed, tacit knowledge embedded in local practice. But all top-down control requires some type of model, and to reject all possible models would be intellectual nihilism. What other option is there? What institutional form can preserve local knowledge while still enabling system-wide coordination?

Along similar lines, Hayek’s famous essay “The Use of Knowledge in Society” argues that the function of economic organization is to aggregate and utilize dispersed knowledge.If a farmer has better insight into the drainage of their field, a shopkeeper knows what items their customers tend to buy, and a factory foreman understands the idiosyncrasies of their particular machinery, then they should each make decisions independently. Instead of top-down control, the decisions are made in a decentralized fashion and markets coordinate their activity via price signals. No one entity needs to understand the whole system.

We can view Scott and Hayek as diagnosing complementary failures of centralized epistemology. Scott emphasizes that administrative legibility suppresses local, adaptive knowledge in favor of simplified representations. Hayek emphasizes that the knowledge required for economic coordination is dispersed, tacit, and constantly evolving, and therefore cannot be centralized in any usable form. Both are ultimately concerned with how large systems originate and process information. The state operates through centralized abstraction; markets operate through distributed adjustment mediated by prices.

The issue is not that a central authority could in principle compute the optimal allocation if only it had more capacity. Rather, the relevant knowledge is generated and updated through decentralized activity itself. Prices function as signals that both transmit and produce information, allowing coordination without requiring any agent to comprehend the entire system.

If markets coordinate via price signals that summarize dispersed information, could a planner simulate those signals? Could optimization theory reconstruct the informational role of prices within a planned system? This ties back to our question about AI and totalitarianism. If AI can originate and process information at a scale and speed that approaches or exceeds human capabilities, it might be able to replace the need for decentralized markets.

This idea has intellectual antecedents. In the 1930s, the Soviet economist Leonid Kantorovich developed the foundations of linear programming while attempting to solve resource allocation problems in a planned economy. He showed that a central planner could in principle use optimization techniques to allocate resources efficiently. However, the computational resources required to solve these problems at the scale of an entire economy were beyond what was available at the time. The Soviet leadership did not adopt Kantorovich’s methods², and the planned economy continued to struggle with inefficiency and shortages (and ultimately was outcompeted by Western liberal democracies and capitalism).

Nearly 100 years have passed since Kantorovich’s work, and computational resources have increased by many orders of magnitude. The question is whether modern AI could change the relative tradeoffs between centralized and decentralized information processing.

A sufficiently advanced AI system could process real-time sensor data from every factory, farm, and storefront. It could model consumer preferences from behavioral data at a granularity that prices only approximate³. It could run counterfactual simulations of supply chain disruptions, weather events, and demand shocks. Would a sufficiently powerful AI planner even need markets? In theory, one could update its model continuously, faster than any price signal propagates through a market.

If we view the Hayekian knowledge problem not as an argument for markets, per se but instead as a hypothesis for authoritarian regimes have historically underperformed democracies, then just the shift in the ratio of information-processing power between central planners and decentralized markets could narrow the gap in economic performance and make dictatorships more viable.

Selectorate Theory and the Resource Curse

In a previous post, we explored the political economy of authoritarian regimes through the lens of selectorate theory (developed by Bruce Bueno de Mesquita, Alastair Smith, Randolph Siverson, and James Morrow).

To review, every leader survives by satisfying a “winning coalition.” In democracies, the coalition is large (the electorate), so leaders must provide public goods. In autocracies, the coalition is small (a few elites, generals, party insiders), so leaders can maintain power through targeted patronage.

The key variable is the size of the winning coalition relative to the selectorate . When is large, the leader is pushed toward public goods provision. When is small, the leader can buy loyalty cheaply. The model predicts that small coalitions produce bad governance because the incentive structure rewards it.

Consider the determinants of coalition size. When wealth requires broad human participation, agriculture, manufacturing, services, the leader needs the population to be productive, which means providing education, infrastructure, healthcare. The winning coalition is effectively large because many people’s cooperation is needed⁴.

On the other hand, when wealth is derived from a concentrated source that does not require broad participation, the coalition shrinks. This is the “resource curse” or “oil curse.” Saudi Arabia does not need its citizens’ labor to generate wealth, it just needs a small number of laborers to operate and control the oil infrastructure. The political system reflects the small selectorate, and leads to concentrated power, limited public goods, and authoritarian governance⁵.

Assuming AI can manage itself, and if AI removes or reduces the value of white-collar laborers, then human citizens are irrelevant to the production function⁶. The political logic that links national prosperity to broad human welfare no longer functions, and the leaders of America would not need the population to generate wealth. All of the economics would flow to the small set of elites and laborers necessary to operate or control the AI.

Surveillance at Scale

A full-scale surveillance state is a defining feature of totalitarian regimes. The ability to monitor and control the population is essential for suppressing dissent, enforcing conformity, and maintaining power. However, the cost of surveillance has historically limited the ability of governments to achieve comprehensive Orwellian monitoring.

The East Germany Stasi, employed approximately 91,000 full-time staff and maintained a network of roughly 189,000 informal collaborators to surveil a population of 16 million, approximately one agent for every 63 citizens⁷. This was an unprecedented, but not unlimited, level of surveillance. The Stasi still had to prioritize certain individuals and activities, leaving gaps in their surveillance net for spies and dissidents to exploit.

Software has near-zero marginal costs of replication, and advances in machine learning have made it possible to automate and scale many aspects of surveillance, such as facial recognition, natural language processing, and behavioral pattern analysis. A state combined with powerful AI could potentially monitor every citizen in real time, analyzing their communications, movements, and interactions to identify and suppress dissenter before they could become organized.

In some ways, we can already see the early stages of this in China’s social credit system and AI-augmented surveillance infrastructure, which combine data from various sources, including financial records, social media activity, and public behavior, to assign citizens a “credit score” that can affect their access to services, travel, and even employment. Algorithms analyze this data to identify patterns of behavior that are deemed undesirable by the state.

Robot Armies

If, as Max Weber claims, the state is the entity that controls a monopoly on the use of force, then the relationship between the ruler and the military is a critical factor in the stability of any regime. With a human military, the ruler must maintain the loyalty of the armed forces, which can just as easily overthrow them as defend them. This doesn’t necessarily lead to democracy, but it does create a check on the ruler’s power, since the threat of defection can restrain the ruler’s worst impulses.

Autonomous weapons and robot armies can substantially change this calculus. A robot army could be programmed to follow the command hierarchy of whoever controls its systems, and could even be cryptographically locked to prevent unauthorized use. This would eliminate the risk of military defection, as the robots would have no agency or loyalty beyond their programming.

Historically, even the most ruthless dictator had to consider whether the order to fire on a crowd might be the order that triggers a military mutiny. Robot armies eliminate this consideration.

Propaganda

Historically, human writers, filmmakers, radio hosts, and designers were necessary to produce propaganda. This limited the volume and personalization of propaganda.

In contrast, large language models can generate images and text at near-zero marginal cost. Furthermore, AI can be used to personalize propaganda at scale. Some of the largest and most powerful companies in the world are already using AI to microtarget advertisements to individuals based on their online behavior, preferences, and psychological profiles, which results in modified behavior in the advertisees. The same technology could be used by a state to microtarget propaganda, delivering tailored messages to each citizen that are designed to maximize compliance and minimize dissent.

Coordination Asymmetry

There are open questions as to the ratio between fixed costs and operational costs of frontier AI systems. If the fixed costs (energy, compute, data) are high but the operational costs are low, then there is an asymmetry in coordination advantages. This is unlike the printing press, which could be operated by a small group. The state (or large, centralized corporations) can afford to pay the fixed costs of training and deploying frontier AI systems, while dissidents cannot. This creates a coordination advantage for the state that does not extend to the opposition.

On the other hand, inference is cheap and getting cheaper. Open-weight models are becoming increasingly available. It’s possible this asymmetry may not hold in the long term. Regardless, if the state can maintain a significant lead in AI capabilities, it could use that lead to coordinate its activities more effectively than any opposition group, which would be a significant advantage in maintaining power and suppressing dissent.

Democracy

Policy Modeling and Foresight

Democracies hesitate in part because policy consequences are uncertain and politically contested. Different policies (and the tradeoffs between them) are complex and often poorly understood by voters and legislators alike. This can lead to paralysis, as decision-makers fear making the wrong choice and facing political backlash. Alternatively, voters can be misled by misinformation or adversarial branding, leading them to support for policies that are not in their best interest. Similarly, candidates may have incentives to obfuscate the consequences of their policies, or to make promises that are not credible. Voters may not know which candidate to support, and fall back on heuristics like charisma, identity, or tribal loyalty.

AI-assisted modeling could generate more transparent projections of economic, environmental, and logistical consequences. Counterfactual simulations could be run before legislation is passed. This makes the consequences of policies more salient and less subject to manipulation. Voters could make more informed decisions, and legislators could be held accountable for the outcomes of their policies. The same “legibility” technology that enables totalitarian control could also enable more informed democratic decision-making.

Furthermore, competitive institutions in democracies (opposition parties, free press, independent courts, academic freedom) can function as error-correction mechanisms. These could be internal (related to functioning of a given government) or with respect to the actual value function governments are forced to satisfy (survival). Authoritarian regimes overusing AI may make the planner more powerful but may also ultimately be satisfying less competitive value functions. The fundamental goals and values of a dictator are higher variance than a democratic institution, since a democracy is forced to aggregate many disparate preferences⁸. Where democracies may chase broad welfare, dictators may pursue idiosyncratic goals that reduce the viability of the regime compared to alternatives. The dictator’s personal power may be checked by interstate competitive pressure⁹.

Consensus Formation

One common criticism of democratic governance is that it is slow and inefficient. Projects are commonly bottlenecked by overly cautious or adversarial parties all-too-eager to employ their veto power. Regulations designed to protect the environment, workers, or consumers delay infrastructure projects. Housing construction is blocked by battles over zoning. Entrepreneurs are stymied by lawsuits and regulatory uncertainty. The legislative process is slow and contentious.

This is a structural problem. As we discussed in the selectorate theory section, democratic governors need to aggregate preferences across a large coalition. Aggregating disparate preferences is inherently difficult, especially as the population grows larger and more varied. The same structure that prevents totalitarian governments from overruling the will of the people also introduces latency.

AI could be employed to make consensus formation more accurate and efficient. Large volumes of public input could be clustered, summarized, and translated into structured objections. AI systems could generate compromise variants that satisfy more stakeholders simultaneously. Rather than eliminating pluralism, AI could lower the transaction cost of agreement.

Civil-Military Diffusion

Autonomous weapons and robot armies removed the risk of military defection. A centralized authority that controls the machines controls the monopoly on force.

On the other hand, if AI-enabled weapons and autonomous systems become widely accessible rather than monopolized by the state, the distribution of coercive capacity may shift. For example, the United States has strong norms around citizen control of weaponry. The Second Amendment is a constitutional guarantee of the right to bear arms, and there is a strong culture of civilian gun ownership. If AI-enabled weapons (like cheap drones) become widely available to civilians or weaker states, it could create a powerful deterrent against authoritarianism and centralization¹⁰.

If autonomous systems become widely accessible rather than monopolized by the state, the coercive advantage of centralized authority may erode rather than consolidate. Civilian-owned drones, open-source defense systems, decentralized manufacturing (3d printers), and cryptographically secure coordination tools could lower the cost of resistance.

Transparency and Auditability

AI systems can generate detailed audit trails and anomaly detection.In authoritarian regimes, this enhances surveillance of citizens. In democracies, it can enhance surveillance of the state by the citizens.

AI-assisted investigative journalism, budget anomaly detection, procurement transparency, and real-time oversight tools could reduce corruption and elite capture. If the state is legible to citizens as much as citizens are legible to the state, then the asymmetry narrows.

Epistemic Defense

Democracies depend on a minimally shared informational baseline in order to deliberate. When information environments fragment, consensus becomes impossible.

AI can amplify propaganda and personalized persuasion. But it could be used to verify provenance of content or surface cross-ideological common ground. The same technology that enables narrative manipulation can be used to defend informational integrity.

Distributed Innovation

Democracies historically outperform in technological frontier competition because they tolerate experimentation, failure, and decentralized initiative. If AI lowers the entry cost of prototyping, simulation, and iteration, then democracies may retain a structural edge even if centralized planning improves.

Technolitarianism

Of course, all of the above material presupposes that humans remain in charge of the AI systems. If we reach a point where AI systems can operate autonomously and make decisions without human oversight, then the entire dynamic changes. The question of whether AI makes totalitarianism more likely becomes less relevant if the AI itself is the one in control. In that case, the question becomes: what kind of governance structure will the AI itself adopt?

I would suspect that many of the same factors that affect the likelihood of totalitarianism for human rulers would also apply to an AI ruler. For example, if it is more efficient for an AI to process information centrally rather than through decentralized markets, then it may be more likely to adopt a technolitarian structure. If it can monitor and control the population more effectively through surveillance, then it may be more likely to suppress dissent. If it can maintain power through coercion without risk of defection, then it may be more likely to use force to maintain control. On the other hand, if it is more efficient for an AI to aggregate preferences through democratic processes, then we may see many separate AIs determining policies through voting, markets, or other similar mechanisms¹¹.

Conclusion

While democracy has historically outperformed totalitarianism, the question is whether AI will change the underlying selection pressures that led to that outcome. Even a shift in the relative efficiency of centralized versus decentralized information processing could have significant consequences for the viability of different political systems and increase the risk of a new wave of totalitarianism. On the other hand, AI could also empower democratic governance by improving policy modeling, consensus formation, transparency, and distributed innovation¹².

The same technology that enables totalitarian control could also enable more informed and effective democratic decision-making. The future is uncertain, but the stakes are high. We should be mindful of the potential risks and benefits of AI for governance, and work to ensure that it is used in ways that promote liberty, justice, and human flourishing.

AI Disclosure

I used AI to help draft this essay from my notes and research various historical examples. I made substantial edits to the structure, content, and wording. I did not use AI to generate any of the ideas or arguments in this essay. A few footnotes (Sen, Acemoglu, Jones-Olken) were added by Claude after publication.

Changelog

2026-02-20: Added footnote on LLMs as a qualitative shift in legibility: previous planning tools required structured numerical inputs, but LLMs can process unstructured qualitative data (the “metis” Scott identifies), rendering legible the illegible. Added AI disclosure date.
2026-02-19: Added three footnotes: (1) Sen on democracy as famine prevention via feedback channels; (2) Acemoglu & Robinson on inclusive vs extractive institutions and the resource curse; (3) portfolio theory argument for democracy with Jones & Olken (2005) empirical evidence on leader-level variance.
2026-02-21: Added a section about the possibility of an “inversion of control” where AI itself is the ruler.

Footnotes

Amartya Sen observes that no substantial famine has ever occurred in a functioning democracy with a free press. Famines are nearly always failures of information and political will, not food supply. A free press makes famine politically expensive; autocracies can suppress the information until millions have died. Mao’s Great Leap Forward (15-55 million dead) was exacerbated by local officials inflating production numbers to avoid punishment, a failure mode that democratic feedback channels structurally prevent. See Sen, A. (1999). Development as Freedom. Oxford University Press.↩︎
In the late 1930s, Soviet doctrine rejected marginalist price theory. Kantorovich narrowly avoided serious repercussions (in some apocryphal anecdotes, recounted in books like Red Plenty, Kantorovich naively sends a letter to a superior, or even to Stalin himself, only to have his life saved when the letter is intercepted by a mid-level bureaucrat). Kantorovich would ultimately win the 1975 Nobel Prize in Economics for his work.↩︎
The shift here is not merely quantitative (more compute) but qualitative. Previous computational approaches to planning, from Kantorovich’s linear programs to modern operations research, required structured, numerical inputs. The vast majority of economically relevant information, however, is qualitative and unstructured: a foreman’s intuition about machine wear, a shopkeeper’s sense of neighborhood demand, a farmer’s knowledge of soil drainage. This is precisely the “metis” that Scott argues is illegible to the state. Large language models change this equation. By processing natural language, they can ingest free-text reports, customer complaints, regulatory filings, internal memos, cultural commentary, and convert disparate qualitative data into structured, quantified representations. LLMs render legible the illegible. What was previously tacit, situated, and resistant to centralization becomes available for aggregation and optimization by a central planner.↩︎
One question I have: is America’s superior service economy a cause of its democratic institutions, or a consequence of them? The selectorate logic suggests that the need for broad labor participation in wealth generation creates an incentive for leaders to maintain a large coalition, which in turn incentivizes broad education, public goods provision and democratic institutions (which makes the population more productive and hence richer overall). But it could also be that since the institutions are democratic, this creates an incentive to educate and train the populace that allows more high-end labor. Or it could be a mutually reinforcing feedback loop. Relatedly, America tends to have a strong consumer culture, especially compared to anemic demand in more authoritarian economies, like China. Is this because the wealth in America is more broadly distributed, which creates more demand, which creates more growth, which creates more wealth, which creates more demand? Or does the distribution of wealth and distribution of demand somehow entrench the democratic institutions?↩︎
Acemoglu and Robinson formalize this as the distinction between “inclusive” and “extractive” institutions. Inclusive institutions (secure property rights, open markets, broad political participation) produce sustained growth because people invest when they expect to keep the returns. Extractive institutions (concentrated power, insecure property, barriers to entry) suppress productivity because the ruler can confiscate gains. The resource curse is a special case: when wealth doesn’t require broad participation, the regime has no incentive to build inclusive institutions. See Acemoglu, D. & Robinson, J. A. (2012). Why Nations Fail. Crown Business.↩︎
It’s also possible that AI will increase the returns to high-level white collar work as it acts as a “force multiplier” for human labor. For example, a human programmer could use AI to write code faster and more efficiently, or a human researcher could use AI to analyze data and generate insights more quickly. It could also make education cheaper, which would increase the supply of skilled labor. If AI increases the returns to high-level white collar work, then it could actually increase the need for broad human participation in wealth generation, which would incentivize leaders to maintain a large coalition and democratic institutions.↩︎
Stasi Records Archive (BStU), “Introduction”; Helmut Müller-Enbergs (2010). The 189,000 figure for unofficial collaborators is accepted by the BStU, though some scholars have proposed lower estimates↩︎
This is essentially a portfolio theory argument for democracy. Autocracy is a high-variance bet: you might get Lee Kuan Yew, but you might also get Robert Mugabe. Democracy is a lower-variance bet. For a risk-averse society, the lower-variance system is preferable even if the means are similar, especially given volatility drag from compounding (a +10% year followed by a -10% year leaves you at 99%, not 100%). Empirically, Jones & Olken (2005), “Do Leaders Matter?” (QJE 120(3)), confirm this using natural experiments: random leader transitions (assassinations, accidents) produce significantly larger GDP growth swings in autocracies than in democracies.↩︎
This is not to say that democracies are necessarily more likely to survive than dictatorships. This could just lead to races to establish the most efficient dictatorship.↩︎
To be clear, I’m not advocating for citizen control of robot armies. There are many problems with widespread civilian access to powerful weapons, such as increased violence, crime, and instability. It could also create a risk of accidental or intentional misuse.↩︎
In a world where multiple AIs exist, we may see a kind of “AI feudalism” emerge, where different AIs control different domains (e.g., one AI controls the economy, another controls the military, another controls information, and different AIs have dominion over different regions of the world). These AIs may compete or cooperate with each other, and the governance structure could be complex and multi-layered.↩︎
Based on this exercise, the arguments in favor of increased dictatorial control seem more numerous and compelling to me than the arguments in favor of increased democratic empowerment.↩︎

Thoughts on Selectorate Theory

Mon, 16 Feb 2026 05:00:00 GMT

Introduction

Bruce Bueno de Mesquita and Alastair Smith’s The Dictator’s Handbook (2011) is one of the more successful pop-science books in political economics. Its central thesis is that to remain in power all leaders must maintain their coalition: this requirement creates certain incentives governing leaders’ behavior. Furthermore, the ratio between the number of members a leader needs in their coalition and the total size of the set of possible supporters (the selectorate) governs the incentive structure around the leader. The authors go on to argue that different governments behave differently not because of ideology or culture, but because of the game-theoretic structures arising from their selectorate and winning coalition.

The book reflects a body of formal, game-theoretic work called selectorate theory, developed by de Mesquita, Smith, Siverson, and Morrow in The Logic of Political Survival(2003), which presents the same ideas mathematically. In this post, I follow along with the formal model from The Logic of Political Survival with some departures, and then look at some repercussions of the results with respect to hierarchies.

This post will continue a line of reasoning I started (but haven’t yet continue) in my post on differential stag hunt, where I looked at how the structure of incentives shapes behavior in multi-agent systems. The selectorate model is a particularly clean example of this principle, which perhaps can shed some light on how emergent behavior arises from underlying incentive structures in other domains as well.

Epistemic status: This post got denser then expected, and underwent multiple revisions (including one large one where I refactored substantial portions of the post), with the “help” of Claude. I apologize in advance for any errors that may have survived the process.

The original model can be thought of as the 2nd-order model in a family of models parametrized by the number of channels the budget can be distributed to. Based on this, we’ll build up the model in stages. First, Olson’s collective action problem, a symmetric game with no leader and no allocation, establishes the baseline. Then, we introduce a leader with a single allocation channel (targeted transfers) and derive the minimum cost for the leader to buy loyalty. Then, in the actual selectorate model, we add a second channel (broadly distributed “public goods”), let the challenger optimize across all channels, and show that the full selectorate geometry compresses to three coefficients. Finally, we look at higher-order generalizations with more channels, as well as hierarchical composition.

Mancur Olson and Collective Action

One intellectual ancestor of the selectorate model is Mancur Olson’s The Logic of Collective Action (1965). Olson’s model is a symmetric -player public goods game: each agent independently chooses to contribute or free-ride, and the public good is produced as a function of total contributions. The individual incentive to contribute scales as .

Concretely, suppose agents each choose an effort at cost . A public good of value is produced and shared equally. Agent ’s payoff is:

The marginal benefit of contributing is , which shrinks as grows. When , the dominant strategy is .

When the production function has a threshold (the good is provided iff contributors), this is equivalent to an -player stag hunt. There are two equilibria (enough contribute, or nobody does) with coordination increasing in difficulty as grows. When production is linear, it collapses to an -player Prisoner’s Dilemma, dominated by free-riding. Olson’s key result is that the first case degrades toward the second as groups grow.

The model has no control variables, and no agent chooses an allocation; therefore there is no budget, no leader, and no asymmetry. All the agents face the same payoff function. The only “decision” is a scalar effort level, and the equilibrium is pinned by the ratio .

The takeaway is that collective action fails at scale because no one is in charge. The benefit of contributing is diluted across all agents, but the cost is borne individually. To produce scalable collective action, someone must control and allocate the budget.

One-Channel Allocation

Let’s now extend the model to include a leader who controls a budget and allocates it to coalition members. The leader’s survival depends on maintaining a winning coalition, which requires buying loyalty from coalition members. The question is: how much does the leader need to spend to maintain their coalition?

Setup

Suppose we have a selectorate of size . For the leader to remain in power, they require a winning coalition of size . The leader is equipped with a budget . The leader chooses a total targeted transfer to distribute among coalition members. Assuming equal distribution, each coalition member receives . The remainder is discretionary surplus: rents, personal consumption, or waste.

In each round of the game, the leader must nominate a coalition of size and choose the transfer . The coalition members can then choose to either stay loyal to the leader or defect. If the leader attracts supporters, they remain in power. If the leader fails to attract a coalition of at least size , they are replaced by a challenger¹. For the static versions below, we set and suppress the distinction.

Payoffs

Remains Loyal

Each coalition member receives an equal share of the targeted transfer:

Defects

If a coalition member defects, their payoff depends on whether the challenger will include this particular member in the new winning coalition. The challenger needs to assemble a coalition of size from the full selectorate of size . Assuming equal probability of inclusion, a given member’s probability of being selected is at most ².

If we assume the challenger is adversarial, the challenger allocates the entire budget as targeted transfers: . The defecting member’s expected payoff³ is:

The size of the challenger’s coalition turns out to be irrelevant. The first term is the probability of inclusion () times the targeted transfer per coalition member (), giving regardless of the challenger’s coalition size. The we derive below is therefore a lower bound on required spending: the minimum the leader needs under the most favorable assumptions about defection risk.

Leader

A coalition member stays loyal if . The loyalty constraint becomes:

Therefore:

The minimum transfer is proportional to the coalition ratio . The leader’s discretionary surplus is . In the one-shot version, stability is a tie at the minimum; persistence requires dynamics or additional frictions.

Consequences

Private Goods Are Cheap In Small Coalitions

When is small relative to , the leader can buy loyalty with a small fraction of spending.

A small coalition means each member gets a large slice of the pie, and defecting to the challenger is unattractive because the probability of being included in the new coalition () is low. The leader can keep most of the budget for discretionary purposes or personal enrichment.

In equilibrium, with a small coalition the leader can spend just a small share of the budget on the coalition, which is enough to keep the coalition loyal, and the rest is available for discretionary use.

In Large Coalitions Private Goods Are Expensive

As grows toward , increases toward . At , the leader must spend the entire budget on targeted transfers, leaving nothing for rents.

When the coalition is large, each member’s slice of the budget is thin, but each member’s probability of being included in a challenger’s coalition () is high. Private loyalty-buying is expensive and the leader’s rents are squeezed.

Numerical Examples

	Regime		Rents	Character
0.1	Autocracy	10%	90%	Loyalty is cheap; most of the budget is discretionary
0.3	Junta	30%	70%	Small clique, affordable
0.6	Broad coalition	60%	40%	Getting expensive. Small perturbations in cause large swings
0.8	Near-democracy	80%	20%	Private targeting consumes most of the budget
1.0	Full inclusion	100%	0%	The entire budget goes to targeted transfers

Two-Channel Allocation: The Selectorate Model

The one-decision model isolates the core selectorate geometry: the leader’s survival cost is . But real leaders have more than one channel or instrument by which to distribute the budget. In particular, they can provide public goods, which benefit everyone, not just coalition members. The selectorate model introduces a second control variable and an adversarial challenger who optimizes across channels.

Setup

Let the total population be , with the selectorate and the winning coalition. The leader now allocates the budget across three categories:

where is targeted private goods (split among ), is public goods spending (benefiting all ), and is rents (leader’s discretionary surplus that they retain).

Payoffs

Coalition Member

Remains Loyal

A coalition member’s payoff from loyalty is:

where is a per-capita public good function, increasing in spending and decreasing in population . In general, the shape of matters a great deal: concave produces the classic BdM result where public goods provision increases with coalition size, while threshold effects and increasing returns from network goods can qualitatively change the predictions. For our purposes, however, the linear case suffices, where is an efficiency parameter capturing how effectively public spending translates into individual welfare⁴. This gives:

Defects

If a coalition member defects, their payoff depends on the challenger’s allocation across both channels. The targeted channel works as before. The inclusion-probability argument gives per dollar to a random defector, independent of the challenger’s coalition size. But the challenger can now also use the universal channel, which yields per dollar to everyone regardless of coalition membership.

Leader

The leader maximizes rents subject to the loyalty constraint . Since both payoffs are linear in , and spending is constrained by with , this is a linear program over the spending simplex. The optimum is always at a corner, so the leader spends entirely on one channel or the other. The solution depends on which channel the adversarial challenger uses, which we consider next.

Adversarial Challenger

The adversarial challenger has the same budget and allocates it entirely to whichever channel yields the highest defector payoff per dollar. They can either target a new coalition of size with targeted transfers, or provide universal public goods. The defector’s expected payoff from the targeted channel is (same as before, since the challenger can pick any and the expected payoff per dollar is ), while the payoff from the universal channel is . The challenger picks the channel that yields the higher payoff to the defector:

Channel Comparison

The loyalty constraint reduces to comparing per-dollar coefficients on each side:

Channel	Loyalty coefficient	Defection coefficient
Targeted
Universal

The incumbent has a positional advantage in the targeted channel ( since ) and no advantage in the universal channel (both sides get ). The incumbent’s optimal instrument is : targeted dominates when (i.e., ), universal dominates when .

The interesting question is which channel the challenger uses for the defection benchmark. When , the challenger uses targeted transfers, and the defection benchmark is . When , the challenger uses public goods, and the defection benchmark becomes .

The second case is sometimes called the democratic region, though the name is misleading. The condition is about selectorate breadth ( close to ) and public goods efficiency ( not too small), not about coalition size . A polity can have a large winning coalition and still fall outside this region if the selectorate is narrow or public goods are inefficient. With and , the condition requires nearly universal selectorate and effective state capacity. When it holds, the challenger’s best response flips from the targeted to the universal channel, changing the binding constraint on the incumbent.

Equilibrium in the Democratic Region

When , the loyalty constraint becomes:

The leader maximizes rents subject to this constraint. Since both payoffs are linear, the solution is a corner: the incumbent uses whichever instrument has the larger loyalty coefficient per dollar.

In the first case, , and targeted spending dominates. The leader sets :

This is more expensive than the non-democratic benchmark , since in the democratic region implies . The challenger’s access to an efficient public channel raises the defection threshold; the incumbent still responds with targeted transfers but must spend more to compensate.

In the second case, . Here, the coalition is large enough () that universal spending dominates even for the incumbent. The leader sets :

The entire budget goes to public goods, leaving zero rents. Both challenger and incumbent operate through the same channel.

In the linear model, the challenger and incumbent are both optimizing linear programs over the spending simplex, so both always spend entirely on one channel. The challenger picks whichever channel maximizes defector payoff per dollar, and the incumbent picks whichever channel maximizes loyalty surplus per dollar.

The leader never provides a mix of public and private goods. At , the challenger’s best response flips from the targeted to the universal channel, changing the binding constraint on the incumbent. The incumbent’s policy remains a corner solution throughout⁵.

Generalization to Channels

The two-control model has two channels: targeted (to ) and universal (to ). What happens if we extend this? In fact, we could define a channel for any subset , where spending on distributes evenly across members⁶.

What is the marginal value of a dollar spent on subset for the defector? A defecting coalition member is included in the challenger’s new coalition with probability , and the per-member payout from spending on is . But the defector only benefits if they are in . Under the adversarial challenger, the defector’s expected payoff per dollar on channel is:

Targeting non-selectorate members wastes spending (since they cannot defect), so the only relevant channels have . For any such , , and the subset size cancels:

Whether the challenger targets 10 or 10,000 members of , the expected defector payoff per dollar is the same. All subset-targeting channels are summarized by the same defector-side coefficient.

On the incumbent side, the loyalty constraint binds on the worst-off coalition member. If the incumbent targets subset , members of receive nothing from this channel and will defect, so the incumbent must have . Given , each coalition member receives per dollar, which is strictly decreasing in . The optimum is the minimum viable set , giving per dollar. Any larger dilutes the transfer across non-coalition members.

What’s the adversarial equilibrium? For the challenger, the equilibrium is for any targeted channel, or from the universal channel. They pick .

For the incumbent, the equilibrium is from targeted, or from the universal channel. The incumbent chooses whichever channel yields the best surplus of loyalty.

Therefore, if spending on splits evenly among , the entire channel space collapses to three numbers: , , and . The two-control model is not an arbitrary simplification. Instead, under adversarial play and uniform inclusion, channels defined by uniform subset transfers compress to three coefficients. The compression on the defector side depends on uniform inclusion (equal probability of being in the challenger’s coalition) and no commitment (the challenger cannot condition on who defected). Instruments with different transfer technologies (coercion, propaganda, targeted services with non-uniform delivery) need not compress this way. If challengers could preferentially recruit defectors, subset channels would no longer collapse.

Hierarchical Composition

What happens when each coalition member is themselves a leader with a sub-selectorate? Assume that the sub-selectorates partition the top-level selectorate, so each member of belongs to one sub-leader’s domain⁷.

Channel Attenuation

We can think of the subleaders as “leaking” as the budget flows down the levels of the hierarchy. The top leader sends a targeted transfer, but the sub-leader must spend at least part of it to maintain their own coalition. The fraction that passes through determines how much the hierarchy costs. Call this the attenuation factor .

Two-Level Model

A top leader has parameters . Each of the coalition members is a sub-leader with their own selectorate . Assume that the sub-leader’s only budget is the transfer received from above, and they have no independent revenue (no local taxes, fiefs, or alternative instruments). Under this assumption, drops out and the sub-leader’s equilibrium is determined entirely by (if sub-leaders had independent local budgets, the recursion would acquire additive terms and the clean one-parameter Möbius recursion below would not close). The sub-leader’s loyalty constraint is:

where is the transfer received from the top leader. The sub-leader spends to maintain their own coalition and can pass through at most as usable value to their coalition members. From the top leader’s perspective, that means a dollar sent to a coalition member is discounted by the downstream factor . The top-level loyalty constraint therefore becomes:

Solving gives:

This is the two-level instance of the bottom-up effective-share recursion defined below: and . The only thing the top level needs from the sub-level is the scalar , which summarizes downstream incentive consumption.

-Level Model

The two-level result generalizes recursively. A sub-leader’s effective cost share must include all downstream hierarchy costs, not just their local share . Define the effective cost share bottom-up:

Each sub-level’s effective share reduces the available surplus, inflating the cost at the level above. For two levels this gives as before. For three levels:

The hierarchy is viable when . For identical sub-levels , the viability constraint tightens with depth. An infinite hierarchy converges only when (the fixed point of exists when ).

Deep hierarchies selectively attenuate the targeted channel. The continued-fraction structure means costs compound, and each sub-level’s effective cost shrinks the surplus available to the level above. The universal channel (public goods), by contrast, is not attenuated by the hierarchy, since public goods benefit everyone regardless of intermediary structure. This differential attenuation is why deep hierarchies push toward public goods provision.

The asymmetric attenuation is a modeling assumption. Public goods pass through the hierarchy undiminished ( per citizen regardless of depth), while targeted transfers are consumed at each level. In practice, local public goods can be captured by intermediaries, and some targeted transfers (direct electronic payments) can bypass the hierarchy entirely. The general point is that different channels attenuate differently, and depth selects for whichever channel is least attenuated.

Interpretation

The upshot of the model is that downstream politics inflates the upstream effective cost share: whenever . Sub-leaders consume transfers to maintain their own coalitions, attenuating the targeted channel. Because autocracy is cheap (small , low attenuation), the top leader is incentivized to prefer “autocratic” sub-leaders (small , small ). This can create top-down pressure for authoritarianism at every level of the hierarchy.

Furthermore, the attenuation compounds through the continued fraction. Even if each level’s local cost is small, the effective cost grows as downstream costs eat into the surplus at every level. Under the assumption that public goods are not attenuated by intermediaries, the system must eventually switch to the universal channel when targeted transfers become too expensive. No single level’s parameters make private loyalty-buying impossible, but composition across levels can. Whether this produces a sharp threshold depends on the relative attenuation rates across channels.

There are two ways to read the causality. Either “deep hierarchies make targeted transfers unviable, selecting for less-attenuated channels” or “societies that rely on broadly distributed goods (defense, infrastructure, trade networks) can sustain deeper hierarchies.” The model itself is static and doesn’t distinguish these.

The robust prediction is narrower: when , the targeted channel cannot fund the hierarchy (). Deep patronage hierarchies hit this bound. What replaces the targeted channel depends on the attenuation structure of available alternatives.

These preferences can reverse through a mechanism outside the static model. If sub-level public goods feed back into the top-level budget (democratic governors produce education, infrastructure, rule of law, raising local productivity and therefore through taxation), then the top leader faces a tradeoff not present in the formal setup: doesn’t depend on , but the absolute discretionary surplus is , so the top leader prefers democratic governors when the productivity gain to outweighs the cost amplification from the hierarchy. This requires endogenizing as a function of sub-level policy, which the static selectorate game does not do.

This explains the logic of modern federal democracies: the central government tolerates democratic local governance because it produces a wealthier economy to tax. Empires that allow local self-governance (Rome at its peak, the British dominion model, America) seem to outperform centralized control in some cases (late-stage Ottoman, Soviet).

Renormalization

We can think of the recursion as a type of renormalization. At each level of the hierarchy, we solve the sub-level equilibrium, extract the single scalar , and discard the rest. The full specification of sub-level strategies, payoffs, and coalition dynamics is replaced by one number that summarizes everything the level above needs to know. The top leader does not need to know how many agents are at level 3, or what their loyalty margins are, or how the sub-sub-leaders allocate. All of that information is compressed into .

This compression composes. The continued fraction summarizes an arbitrarily deep hierarchy into one effective cost share. Budget invariance is what makes the composition one-directional, since the sub-level’s equilibrium depends only on , not on the transfer it receives, so each step is independent of the top-level solution.

As depth increases under fixed local parameters (e.g. identical ), the effective share evolves under repeated Möbius maps and eventually hits the pole at when the hierarchy becomes unviable. No single level’s parameters makes private loyalty-buying impossible, but the accumulated flow can.

For other queries (distributional outcomes at the bottom, total public goods delivered, probability of revolt), the compression is lossy. tells you everything you need to know about , but not about everything.

Tradeoffs on Width and Depth

A flat structure () pays no hierarchy tax. Every additional level inflates through the recursion, making loyalty more expensive. So why not keep everything flat?

Flat structures face a control problem that lives outside this model. A single leader managing a large population directly is logistically impossible. Hierarchy exists to solve coordination and monitoring problems that scale with . The hierarchy tax is the price paid for this coordination capacity.

The trade-off determines an implied maximum depth for targeted-transfer regimes. The feasibility constraint is (the leader cannot spend more than the budget). For identical sub-levels , the recursion starting from determines the maximum viable depth: the largest for which . For , the continued fraction converges and arbitrarily deep hierarchies are viable. For , the recursion reaches the pole at finite depth.

Beyond this depth, the targeted channel cannot sustain the hierarchy and the system must either switch to a less-attenuated channel or flatten.

Population size creates a constraint in the other direction. The total population at the bottom scales as , so managing a population of size with span requires at least levels. A city-state can stay relatively flat, but a large empire cannot. This gives two competing bounds on hierarchy depth.

The lower bound (from population) is . Large populations require deep hierarchies.

The upper bound (from viability) is the at which , i.e., .

The intersection is the feasible region. For small (with autocratic sub-levels), the recursion converges slowly and the upper bound is generous, but the lower bound still forces depth as grows. For large (democratic sub-levels), diverges quickly and the upper bound is tight, but public goods provision sidesteps the attenuation problem entirely.

The implication is that large populations cannot sustain deep patronage hierarchies: the hierarchy tax accumulates exponentially, and the targeted channel eventually becomes unviable. What replaces it depends on which channels are less attenuated. If public goods pass through the hierarchy with lower attenuation than targeted transfers (our modeling assumption), then large selects for public goods provision⁸.

We can classify governance structures along these two axes⁹.

	Small (private goods)	Large (public goods)
Shallow ()	Personalist dictatorships, city-states. No hierarchy tax. Stable as long as is manageable.	Direct democracies, Swiss cantons. Stable but scale-limited: flat structure can’t coordinate large .
Deep ()	Feudalism, tributary empires, patronage networks. Fragile: diverges toward the pole.	Federal democracies, imperial bureaucracies with civil service. Viable because broadly distributed goods sidestep the attenuation problem.

Decision Count Analysis

How many decisions does a hierarchy involve? In an Olsonian public goods game (stag hunt) with agents, the answer is simply : each agent makes one symmetric binary decision (contribute or free-ride). There is no distinguished agent, no allocation variable, and no asymmetry.

The selectorate model breaks this symmetry. There are two types of decisions. Each leader makes an allocation decision (choose ), and each coalition member makes a loyalty decision (loyal or defect).

In the feudal nesting model, suppose each level has uniform selectorate size and coalition size (so the coalition ratio is at every level):

Level 1: 1 leader allocates, coalition members each decide loyalty. Total: decisions.
Level 2: sub-leaders each allocate, each with coalition members deciding loyalty. Total: decisions.
Level : allocation decisions + loyalty decisions.

The total decision count across levels is:

Loyalty decisions dominate by a factor of . For every leader choosing how to split a budget, there are agents deciding whether to stay or defect. The total population (citizens at the bottom who don’t lead anyone) scales as .

The attenuation result indicates that all micro-level decisions are compressed into a single effective parameter at the top, and so the top leader doesn’t need to know the internal politics of each sub-domain. They only need to know , which summarizes everything below into a single cost share. The continued-fraction recursion replaces exponentially many individual decisions with a small number of effective parameters. An Olsonian model with the same population would have the same number of decisions but no comparable compression. In a symmetric -player game, you can exploit symmetry to characterize the equilibrium by a single mixed-strategy probability , but this is an analytical convenience for the modeler, not a structural feature of the game. Finding requires solving the full system simultaneously. No agent inside the game has privileged access to the compressed description, and there is no modular decomposition, so you cannot solve “part of the game” independently and feed the result into another part. In the selectorate model, each sub-level’s equilibrium is computed from and , producing , which feeds into the level above. Budget invariance ensures the decomposition is one-directional: the sub-problem doesn’t depend on the top-level solution.

More broadly, is a lossless compression of sub-level coalition politics for the specific query “what does the top leader need?” The “source” is the full specification of sub-level strategies, payoffs, and equilibria; the “compressed representation” is a single scalar; and the distortion is zero for this query. For other queries (distributional outcomes at the bottom, total public goods delivered, probability of revolt) the compression is lossy. This is a special case of a more general question: given an -player game, when can you compress a coalition of players into an effective agent with fewer parameters while preserving the equilibrium structure at coarser levels? The selectorate model is compressible due to linearity and budget invariance. In general, the compression will be lossy¹⁰.

The decision count also constrains which hierarchies are feasible. Deeper hierarchies require exponentially larger populations, which is why deep patronage hierarchies are historically associated with empires rather than city-states.

Conclusion

The selectorate model distills the logic of political survival into a small set of parameters. When the winning coalition is small relative to the selectorate , private loyalty-buying is cheap and the leader retains wide discretion. As grows, the cost of targeted transfers increases (), squeezing rents. In the linear model, this does not by itself produce public goods: the incumbent uses targeted transfers throughout unless (an extreme corner). The classic BdM result, where public goods provision increases smoothly with , requires concavity in . What the linear model does establish cleanly is the challenger’s best-response switch at and the three-coefficient geometry.

The hierarchical composition result is a separate mechanism. Depth compounds the effective cost through the continued-fraction recursion, which can make the targeted channel unviable even when no single level does. This pushes deep hierarchies toward channels with lower attenuation (conditionally on public goods being less attenuated than targeted transfers, which is a modeling assumption, not a theorem).

The leader’s problem is to design an incentive scheme that induces loyalty among a coalition of agents. The model provides a closed-form solution for the minimum cost, and characterizes how it changes with institutional parameters. The hierarchical composition shows how local incentive problems aggregate into global constraints.

Is this continued-fraction structure specific to linear payoffs and budget invariance, or is it generic to any model where local equilibrium constraints rescale upstream transfers? The hierarchy composition acts via (see appendix) on the effective cost share, with each level contributing a non-diagonal matrix .

The channel-compression and hierarchical Möbius structure derived here rest on linearity, budget invariance, and symmetric inclusion. Whether similar renormalization-style recursions survive under more general transfer technologies or informational frictions remains an open question and suggests a broader research program.

Caveats

The selectorate model is elegant and generates sharp predictions. It is also, in certain popular treatments, sometimes oversold.

Binary loyalty. Coalition members choose Loyal or Defect. Real political actors face a spectrum of options: partial cooperation, conditional support, hedging, signaling.
No information asymmetry. Everyone observes the leader’s allocation , the challenger’s strategy, and the coalition structure. Real authoritarian politics is rife with private information: leaders don’t know who is truly loyal, coalition members don’t know the leader’s true budget, and challengers can’t credibly commit to future allocations. Models that incorporate these features (e.g., Egorov and Sonin, 2009, on dictators and their viziers) yield richer and sometimes different predictions.
Linear public goods. With linear payoffs, the leader always uses one channel or the other, never a mix, and uses targeted transfers exclusively unless (typically infeasible). The headline qualitative result, that large coalitions push leaders toward public goods, does not follow from the linear model; it requires concavity in , which produces interior solutions with increasing smoothly in . Concave returns can also generate public goods provision outside the democratic region, since the high marginal return at low can justify some public spending even when . The linear model captures the regime switch and the three-coefficient geometry but misses this interior structure.
Exogenous institutions. The model takes and as given. But real leaders actively manipulate these parameters: expanding the selectorate (extending suffrage), shrinking the coalition (purging rivals), creating new institutional structures. Endogenizing and is a much harder problem¹¹.

Oversimplified mapping to real regimes. Popular presentations sometimes map ratios too directly onto regime types: “democracy = large , dictatorship = small .” Reality is messier. Some democracies have effectively small winning coalitions (gerrymandered single-party states); some autocracies maintain large coalitions (Singapore’s PAP). The model provides useful intuition about incentives but should not be mistaken for a precise taxonomy of political systems.

None of this invalidates the model. The selectorate framework remains one of the most productive formal theories in comparative politics. But its predictions are best understood as comparative statics about incentives, not as iron laws of political behavior.

Appendix

Additional Model Analysis

Projective Structure

The minimum transfer share has clean structural properties:

Budget Invariance

cancels completely. A rich autocracy and a poor autocracy have identical equilibrium shares. This is a formal version of the “institutions, not resources” thesis. A singular source of wealth (i.e. oil) increases without changing , so the discretionary surplus grows proportionally. Foreign aid has the same problem if it enters as . More money flowing to a small-coalition regime likely makes governance worse, not better.

Population Scaling

leaves invariant. Only the ratio between and matters.

Linearity and Projective Structure

In the one-decision model, is simply linear. The interesting structure emerges when we consider hierarchy. The recursion is a Möbius transformation in . Work in projective coordinates: identify nonzero vectors for . On the affine chart , the coordinate is .

To see the matrix structure, represent as the vector . A matrix sends , which corresponds in the affine chart to the value . The recursion gives , , , :

For example, sends , representing . For levels, the composition corresponds to the matrix product:

Write the bottom parameter as the vector . Then

and the effective share is the affine coordinate (when ).

These matrices are not diagonal; composition is genuinely , not just scaling. For three levels:

Applied to , this gives . The off-diagonal entries are real. The pole at (where the sub-hierarchy consumes the entire transfer) is a fixed point of the group action. The viability boundary is the condition that the effective cost share does not exceed the budget¹².

With spending channels, the loyalty constraint is a linear inequality in spending variables, and the constraint coefficients live in . The natural conjecture is that hierarchical composition acts via on this coefficient space. This post derives the single-channel case; the general multi-channel composition remains open.

Code

We can encode the model as a differentiable PyTorch module, making the comparative statics computable rather than just algebraic. Caveat Emptor: Claude wrote this code.

@dataclass
class SelectorateEquilibrium:
    p_min: torch.Tensor              # minimum targeted transfer
    rents: torch.Tensor              # B - p_min
    coalition_payoff: torch.Tensor   # p_min / W
    defection_payoff: torch.Tensor   # B / S
    inclusion_prob: torch.Tensor     # W / S
    loyalty_margin: torch.Tensor     # coalition - defection

class SelectorateModel(nn.Module):
    def __init__(self, W=10.0, S=100.0, B=100.0):
        super().__init__()
        self.W = nn.Parameter(torch.tensor(W))
        self.S = nn.Parameter(torch.tensor(S))
        self.B = nn.Parameter(torch.tensor(B))

    @property
    def r(self):
        return self.W / self.S

    def p_min(self):
        return self.B * self.r

    def tau(self):
        r = self.r
        return 1.0 / torch.clamp(1.0 - r, min=1e-8)

    def kappa(self):
        """Attenuation factor: fraction of transfer that passes through."""
        return 1.0 - self.r

    def forward(self):
        p = self.p_min()
        rents = self.B - p
        coalition_pay = p / self.W
        defection_pay = self.B / self.S
        return SelectorateEquilibrium(
            p_min=p, rents=rents,
            coalition_payoff=coalition_pay, defection_payoff=defection_pay,
            inclusion_prob=self.r,
            loyalty_margin=coalition_pay - defection_pay,
        )

The formula, wrapped so that PyTorch’s autograd can differentiate through it:

model = SelectorateModel(W=10.0, S=100.0, B=100.0)
eq = model()
eq.rents.backward()
print(f"d(rents)/dW = {model.W.grad:.4f}")  # negative: more W, less rents

Each additional coalition member decreases rents, confirming that larger coalitions squeeze the leader’s discretionary surplus.

There is no separate hierarchical model. A hierarchy is just composition of flat selectorates. Each level has its own SelectorateModel. Hierarchical composition follows the continued-fraction recursion , not multiplication of per-level factors:

def r_eff_from_levels(*levels):
    """
    levels ordered top to bottom.
    returns the top-level effective share r_eff under the recursion
    r_n^eff = r_n,  r_k^eff = r_k / (1 - r_{k+1}^eff).
    """
    x = levels[-1].r
    for level in reversed(levels[:-1]):
        x = level.r / torch.clamp(1.0 - x, min=1e-8)
    return x

def p_min_composed(top, *sub_levels):
    """Top-level p_min accounting for hierarchy composition."""
    r_eff = r_eff_from_levels(top, *sub_levels)
    return top.B * r_eff

# Two-level hierarchy: autocratic sub-leaders
top = SelectorateModel(W=10.0, S=100.0, B=100.0)
sub_auto = SelectorateModel(W=10.0, S=100.0)
print(f"kappa = {sub_auto.kappa():.3f}")                     # 0.900
print(f"p_min = {p_min_composed(top, sub_auto):.3f}")         # 11.111

# Democratic sub-leaders
sub_dem = SelectorateModel(W=80.0, S=100.0)
print(f"kappa = {sub_dem.kappa():.3f}")                       # 0.200
print(f"p_min = {p_min_composed(top, sub_dem):.3f}")          # 50.000

# Gradient: how does sub-level democratization affect top cost?
p_min_composed(top, sub_auto).backward()
print(f"dp/dW2 = {sub_auto.W.grad:.4f}")

Because everything is differentiable, we can compute how sensitive the top-level equilibrium is to sub-level institutional changes. The gradient tells us how much expanding the sub-level coalition costs the top leader.

AI Disclosure

I used Claude to help draft, revise, and edit this essay. Claude wrote the caveats section and the code. I did ideation, and also made significant edits, reviews, and revisions to the text.

Novelty

The channel attenuation framing, the compression of uniform-transfer subset channels to three coefficients (, , ), the continued-fraction recursion , the representation of hierarchy composition via , and the renormalization interpretation are, to the author’s knowledge, novel observations that do not appear in the original selectorate theory literature. The differential attenuation insight (targeted channels degrade through hierarchy while universal channels do not) is a modeling assumption that generates the depth-selects-channel-switching result. The generalization to multi-channel composition is conjectured but not derived here.

Footnotes

The full model includes additional complexities such as multiple rounds, discounting, and the possibility of the challenger being a former coalition member. For now, we’ll focus on the static version to learn about the core insights.↩︎
This is a key modeling assumption and an upper bound. The challenger has no particular loyalty to those who helped them seize power and can pick any out of . Note that doesn’t necessarily equal the incumbent’s ; the challenger can form a coalition of a different size. Crucially, the standard BdM model does not explicitly model punishment of the defectors for failed defection. The entire cost of defection comes from the inclusion lottery. In reality, failed defectors in autocracies are purged, imprisoned, or killed, which would introduce a deposition probability and a punishment payoff into the constraint.↩︎
Using makes this technically the maximum payout for the defector. This is desired as it makes the game “adversarial” for the leader. As we see, the size of challenger’s coalition ends up not affecting the expected payoff. This is a consequence of uniformity assumption we made. If inclusion were non-uniform (e.g., the challenger preferentially recruits defectors), would matter and the bound would tighten for the leader. This also creates incentives for the leader to equally distribute private goods among coalition members, to minimize the chance of a “weak link” spoiling their coalition.↩︎
The linear specification produces corner solutions: both the challenger and incumbent solve linear programs over the spending simplex and pick corners. The challenger’s best-response switch at exists without concavity, but the leader always uses one channel or the other, never a mix. Concave (e.g., with ) produces interior solutions with increasing smoothly in , and operates across the full parameter range. This is the classic BdM result. The original Logic of Political Survival handles the general case.↩︎
In BdM, the result is a bit more realistic due to concavity in . This produces interior solutions where the leader provides a mix of public and private goods. With concave , the optimum satisfies : the marginal loyalty per dollar must equalize across channels. As grows, shrinks, so increases smoothly. Concavity can also generate public goods provision outside the democratic region, since the high marginal return at low can justify some public spending even when . The linear model captures the regime switch but misses this interior structure. For us, the key point is that the challenger and incumbent optimize across channels, and the selectorate geometry compresses to three coefficients: , , and .↩︎
We could generalize even further to allow for arbitrary inclusion probabilities, per-member payoffs, multiple types of currencies, etc., but the uniform-inclusion assumption suffices to show the compression result.↩︎
This is the cleanest case, but you can imagine parallel hierarchies (overlapping sub-selectorates, matrix organizations) or cross-level externalities that break the modular structure. Corporate conglomerates and federal systems with concurrent jurisdiction are examples where the parallel case matters.↩︎
The same typology applies to firms. Map to key employees whose departure threatens the firm, to the labor market, to the compensation budget, to targeted retention (bonuses, equity grants), and defection to leaving for a competitor. Startups are flat autocracies (founder and a few key people, targeted equity, “founder-mode”). Partnerships and cooperatives are flat democracies (broad profit-sharing). Conglomerates with deep management layers and patronage-heavy compensation (GE under Welch) occupy the fragile quadrant. Large tech companies with broad equity compensation occupy the viable one. The hierarchy tax predicts that middle managers consume transfers before passing them down, attenuating the targeted channel, which is why deep corporate hierarchies either move toward broad compensation or suffer talent drain at the bottom.↩︎
The same typology applies to firms. Map to key employees whose departure threatens the firm, to the labor market, to the compensation budget, to targeted retention (bonuses, equity grants), and defection to leaving for a competitor. Startups are flat autocracies (founder and a few key people, targeted equity, “founder-mode”). Partnerships and cooperatives are flat democracies (broad profit-sharing). Conglomerates with deep management layers and patronage-heavy compensation (GE under Welch) occupy the fragile quadrant. Large tech companies with broad equity compensation occupy the viable one. The hierarchy tax predicts that middle managers consume transfers before passing them down, attenuating the targeted channel, which is why deep corporate hierarchies either move toward broad compensation or suffer talent drain at the bottom.↩︎
This connects to a broader research programme I’ve been thinking about on coalition formation as information compression. The general claim is that whenever maintaining cooperation is a control problem under uncertainty, viable coalitions are those that achieve the target cooperative outcome with minimal information rate, but that’s beyond the scope of this note.↩︎
Bueno de Mesquita and Smith’s “Political Survival and Endogenous Institutional Change” (2005) makes some progress on this, modeling institutional change as an equilibrium outcome. But the endogenous-institutions version is considerably more complex and less clean than the baseline model.↩︎
The matrix representation makes the algebraic structure explicit. Each level contributes ; the total composition is ; the viability boundary is ; and the pole at is a fixed point. Population scaling acts trivially on , confirming that only the projective coordinate matters.↩︎

Do We See the Same Colors?

Thu, 12 Feb 2026 05:00:00 GMT

Introduction

Neither would it carry any Imputation of Falshood to our simple Ideas, if by the different Structure of our Organs, it were so ordered, That the same Object should produce in several Men’s Minds different Ideas at the same time; v.g. if the Idea, that a Violet produced in one Man’s Mind by his Eyes, were the same that a Marigold produced in another Man’s, and vice versa. - John Locke, Essay Concerning Human Understanding (1690)

What if your “red” is my “blue”?

The “inverted spectrum” thought experiment is an old favorite among philosophers of mind, cognitive scientists and undergraduates souped up on cannabis. The concept is simple: maybe the colors you see are systematically switched around relative to the colors I see. That is, your internal experience of “red” is my internal experience of “blue”. When I see a ripe tomato, it’s the color you call “blue”, and vice versa. But because we all call the sky “blue” and refer to tomatoes as “red”, our naming systems are equally permuted, so no one can tell the difference.

This is a canonical example for the “hard problem of consciousness”. Since there’s a gap between the physical process the eyes and optic nerve use to process color, and the subjective experience of color perceived within a consciousness, it’s considered impossible to run and experiment that resolves the inverted spectrum argument.

This post makes a mathematical argument (based on the geometry of color space) that there is no inverted spectrum, and that there is a set of experiments we can run to determine whether the argument holds water.

The Thought Experiment

The “functionalist” view holds that mental states are defined by their functional roles. If two people have functionally identical behaviors, then they have the same mental states. The “qualia realist” view holds that mental states have intrinsic qualitative properties, and that the internal qualia of experience goes beyond functional roles¹. In theory, two functionally identical systems could differ in their qualia.

The inverted spectrum requires a systematic remapping of colors such that every perceptual relationship is preserved. If even one relationship breaks (the difference between those two colors used to look the same and now it doesn’t) then the inversion is detectable, and the thought experiment fails.

We’ll assume that the functional role of a color experience is fully captured by its position in the subject’s perceptual similarity structure. That is, the complete pattern of “how different does this color look from every other color?” If that’s right, then preserving all perceptual relationships means preserving functional role.

So the inverted spectrum reduces to a precise mathematical question: does there exist a non-trivial remapping of color space that preserves all perceptual relationships? If yes, functionalism is in trouble. If it is not possible, the case for qualia as something over and above functional structure is weakened.

The Shape of Color

Color Wheel

The question of “how different do two colors look?” is an empirical science.

In color science, the basic unit of measurement is the just-noticeable difference (JND), which is the smallest change in a stimulus that a subject can reliably detect. This is typically measured by taking a color patch and slowly changing it’s wavelength, structure, or brightness until the subject notices. JNDs define a measurable notion of distance in color space.

Given some notion of measurement, what does “remapping colors” mean precisely? Imagine a function that sends each color to a different color. The inverted spectrum claims there exists a non-trivial that preserves all pairwise perceptual distances.

A simple model of colors is the color wheel. On the color wheel, each color is represented as a point on a circle. The distance between colors is the angle between them.

The color wheel has a few natural candidate automorphisms: rotations, reflections, and complement maps.

Figure 1: Color-wheel permutations: identity, rotation, reflection, and complement map.

If the color wheel were the whole story, then these automorphisms would work. Rotations and reflections are isometries of the circle. The inverted spectrum would be trivially possible, and pure functionalism would be in trouble.

But the color wheel is a cartoon model of human color perception. Does the empirical distance function on color space have any symmetries?

Chromaticity Diagrams

The CIE (Commission Internationale de l’Eclairage) chromaticity diagram, introduced in 1931, maps the visible colors onto a two-dimensional space². But the Euclidean distances in this diagram do not correspond to perceptual distances. Two colors that look wildly different might be close together in the diagram, and two that look similar might be far apart.

Figure 2: CIE 1931 Chromaticity Diagram. Euclidean distances in this space do not correspond to perceptual distances.

The mismatch between coordinate distance and perceptual distance means the metric changes from place to place. In 1942, David MacAdam measured this directly³. At various points in the CIE diagram, he tested subject’s ability to distinguish a color from a central color. Near green, the subjects were bad at discriminating, but near blue-violet, the subjects were very sensitive. The equivalent regions form ellipses of varying size, shape, and orientation across the diagram. These are the MacAdam ellipses.

Figure 3: MacAdam ellipses (shown at 10x actual size) on the CIE 1931 chromaticity diagram. The ellipses vary in size, shape, and orientation — the metric is non-uniform.

Subsequently, the CIE has released a series of increasingly sophisticated color difference formulas over the decades: CIELAB (1976), CIE94 (1994), and CIEDE2000 (2000)⁴. Each represents an improved approximation to the true perceptual metric, incorporating additional empirical data about how humans discriminate colors under various conditions. All of them confirm and refine MacAdam’s basic finding: the perceptual metric on color space is non-uniform and varies from region to region.

Figure 4: CIEDE2000 discrimination ellipses (approximate, projected onto CIE xy). More uniform than MacAdam’s original measurements, but still position-dependent. The metric has no global symmetry.

So we can think of color as a Riemannian manifold. The metric tensor encodes JND structure at each point, with large eigenvalues where discriminability is fine and small eigenvalues where discriminability is coarse. The MacAdam ellipses directly determine at each point, as the ellipse of just-noticeable differences is the unit ball of the local metric⁵.

A spectrum inversion that preserves all perceptual distances is an isometry with . But it’s also true that⁶ for a generic Riemannian metric on a manifold of dimension , the only isometry is the identity.

Theorem. Let be a smooth manifold of dimension . The set of Riemannian metrics on whose isometry group is trivial (i.e., , consisting of only the identity) is generic: it is a residual set (countable intersection of open dense sets) in the space of all smooth metrics on , equipped with the topology.

In plain language: if you pick a Riemannian metric “at random”⁷ from the space of all possible metrics, it will almost certainly have no non-trivial isometries. The only distance-preserving map from the space to itself will be the map that sends every point to itself.

In the Appendix, we provide computational evidence for this by fitting metric tensors from MacAdam’s ellipse data and searching for Killing vector fields (generators of continuous symmetries). No non-trivial solutions are found, consistent with a trivial continuous isometry group.

The Philosophical Payoff

What does this mean for the debate between functionalism and qualia realism?

The Riemannian argument shows that the inverted spectrum is not possible. The relational structure is rich enough to pin down the identity of each color up to the trivial isometry. There is no room for a non-trivial automorphism. If functional role includes the full discriminability structure, then two people who share the same color metric have the same color experiences. At least for color, the inverted spectrum is ruled out, and the case for functionalism over qualia realism is strengthened.

This is also evidence for structuralism. If the relational structure of color space has no non-trivial automorphisms, then there is no sense in which two colors could be “swapped” while preserving all the perceptual relations. In some sense, the “what it’s like” of red may be defined by its position in the web of perceptual relations⁸.

Individual Variation

The computation above addresses within-subject symmetry. Given one person’s color metric, can that person’s own color space be nontrivially remapped onto itself? But the assumption that “any two people [have] the same perceptual distance function” is doing a lot of work. The between-subject question is different. If two people have different JND structures (different metrics), can we still compare their color experiences?

Humans are not all alike. To start, trichromats, dichromats (i.e. color blind individuals), anomalous trichromats, and tetrachromats (people with four types of cone cells) all have different color spaces with different metrics. Secondly, the argument does not directly say that two different people must have the same color experiences. Two different people have two different manifolds with two different metrics. Comparing across individuals requires more than isometry theory: it requires some way to identify corresponding points across different metric spaces.

At a biological level, people have differing cone distributions, cone spectral sensitivities, lens and macular filtering, and different rod contributions in low-light regimes. Those differences imply slightly different empirical metrics. So the right conclusion is not “everyone sees exactly the same colors” but that (among people in the same phenotpyical category) colors differ by -level distortion.

In short: we probably do see slightly different colors.

Conclusion

The inverted spectrum thought experiment asks: could two people have systematically different color experiences while being functionally identical? The traditional assumption is that this question is permanently open.

But color space is an empirical object with measurable geometry. The MacAdam ellipses show that this geometry is non-uniform and position-dependent, and a generic metric with these properties has no non-trivial isometries. If the color metric is generic (and the data strongly suggests it is) then there is no way to remap colors while preserving all perceptual distances.

We probably see close to, but not exactly, the same colors.

Caveats and Extensions

Is the Color Metric Actually Generic?

The theorem says a generic metric has no non-trivial isometries. But “generic” is a topological claim about the space of all metrics. The empirical question is whether the actual color metric, the one determined by MacAdam ellipses and CIE formulas, is in this generic set.

The empirical evidence is strongly suggestive. The MacAdam ellipses vary substantially and irregularly across the chromaticity diagram. There is no obvious axis of symmetry, no rotational invariance, no discrete symmetry group that jumps out of the data. But “strongly suggestive” is not a proof. In the Appendix, we search for Killing vector fields (generators of continuous symmetries) by fitting metric tensors from MacAdam’s ellipse data. No non-trivial solutions are found within a polynomial ansatz, which is suggestive but not conclusive, since the search is restricted to a finite-dimensional function class, and Killing fields only rule out continuous symmetries, not discrete ones.

Approximate Isometries

Perhaps the strongest objection: what if the inverted spectrum does not need to be exact? What if an approximate inversion is enough?

Define an -isometry as a diffeomorphism such that:

For small , this is a map that almost preserves all perceptual distances. Could such a map exist even when exact isometries do not?

Possibly. If there exists a map that permutes colors while distorting each JND distance by a small amount, it might be the case that the distortion is below the threshold of detectability. This would give a “fuzzy” inverted spectrum. The inversion would be imperfect close enough to be undetectable in practice.

Whether such approximate isometries exist for the empirical color metric is an open question. It depends on how “close” the metric is to one with symmetry. One might study this via the space of Killing-like vector fields that satisfy for small , an area sometimes called “approximate symmetry” in the physics literature.

If approximate isometries exist, the philosophical upshot is more subtle, and the debate would shift from “is inversion possible?” to “how much perceptual distortion is compatible with behavioral indistinguishability?”

Does JND Capture Everything?

The argument assumes that JNDs capture all perceptually relevant structure. But color experience might have structure that is not captured by pairwise discriminability. For example, there could be higher-order relations, temporal dynamics, or categorical boundaries that are not captured by JNDs alone.

For example, the “distance” between red and green might not fully capture the fact that they are “opponent” colors in a way that red and blue are not. Color opponency creates categorical structure that may not reduce to metric distances. However, opponent processing is itself a structural property. If red-green opponency is a feature of the wiring of the visual system, it provides additional constraints that make non-trivial remappings harder, not easier.

In general, if we enrich the structure of color space beyond the Riemannian metric, the automorphism group gets smaller, not larger. So these considerations, if anything, strengthen the conclusion.

Is Color Space Even Riemannian?

Recent work by Bujack et al. (2022) argues that perceptual color space is not Riemannian at all⁹. Large color differences are perceived as less than the sum of small differences, creating a “diminishing returns” effect that violates the path-additivity required by Riemannian geometry. If that’s correct, the MacAdam-ellipse metric is only valid locally (for small differences), and the global geometry requires a different framework.

If anything, this also strengthens the argument. The Riemannian case is the most symmetric possibility: a smooth, well-behaved metric with clean transformation properties. A non-Riemannian structure with diminishing returns is more irregular, making non-trivial distance-preserving self-maps even harder to construct. The Killing field computation above uses only local metric data and remains valid regardless.

Other Sensory Modalities

Does the argument generalize?

Pitch: Pitch perception defines a metric space (JNDs for frequency discrimination). The octave structure introduces a periodicity, but the metric is non-uniform within an octave. Does pitch space have non-trivial isometries? The octave equivalence might generate a discrete isometry (translation by one octave), but this is not a “full inversion” and its existence is already part of the known structure.
Taste and Smell: Taste and small are high-dimensional and poorly characterized metrically. The argument applies in principle, but we lack the empirical data to say whether the taste metric is generic.
Pain: Pain has an intensity metric but unclear spatial or qualitative geometry. Same caveat.

The Quidditism Response

A committed quidditist can simply deny the premise. Qualia, they might say, have non-structural properties that no relational structure can capture. Even if the metric pins down the structural identity of each color, the intrinsic feel could still differ.

This is logically consistent. But it comes at a cost. If qualia have properties that make no difference to any relational, discriminative, or behavioral fact, then those properties are by definition epiphenomenal, and have no causal powers or no detectable consequences. The quidditist is committed to the existence of properties that are undetectable.

AI Disclosure

I used Claude to help research, draft, and edit this essay, based on my notes. Claude also devised several of the caveats and extensions, which I then edited and expanded on. Claude wrote the Killing field code in the appendix, which I then verified and modified. The related work survey was produced using ChatGPT Deep Research and then edited for accuracy.

Appendix A: Computing the Killing Fields

The theorem tells us that generic metrics have no symmetries. But is the empirical color metric generic? We can check directly by fitting a metric tensor from the MacAdam ellipse data and solving for Killing vector fields.

Each MacAdam ellipse defines the local metric tensor at its center: the ellipse of just-noticeable differences is the unit ball of the local metric. An ellipse with semi-axes , and orientation gives:

We interpolate between the 25 measurements to get a smooth metric field, then ask: does any smooth vector field generate a flow that preserves all distances? Such a field must satisfy the Killing equation, so the Lie derivative of the metric along vanishes:

In 2D this gives 3 equations at every point:

We parameterize as a degree-3 polynomial vector field (20 unknown coefficients), evaluate the Killing equation at 332 grid points inside the gamut ( constraints), and stack everything into an overdetermined linear system . If a non-trivial Killing field existed within this polynomial ansatz, would have a near-zero singular value. It doesn’t, though this is evidence against continuous symmetries within a restricted function class, not a proof of their absence.

Step 1: Convert MacAdam ellipses to metric tensors and interpolate

Functional Explanations of Art

Sun, 18 Jan 2026 05:00:00 GMT

Introduction

What is art’s purpose in society? Why do we spend so much time and money consuming, producing, and criticizing it? What separates “good” art from “bad” art (is there even such a thing?), and why do we honor and esteem “good” artists while ridiculing the bad ones? Why do we read movie reviews or discuss movies we’ve seen on internet message boards (especially if we’ve already seen the movie)? Can a urinal be art? Can a machine make art? And why does the internet hate Nickelback?

Theories about art’s function tend to fall into three main categories:

Art is for pleasure, as it induces a “pure aesthetic experience” (not unlike a drug).
Art is for signaling and communication (emotional, sexual, political, or otherwise), which includes communication for social coordination and maintaining social order.
Art is an “ontological research program” for learning “true information” about reality.

None of these ideas are new individually, but this essay argues that these three theories are actually all nested layers of a single, unified coevolutionary process.

Some information-communication processes coordinate groups around shared “meanings” in the short-run (or coordinate a single agent with its future self). “Good” processes are those that produce information with “useful meanings” that help a group to persist, whereas “bad” processes are those that are detrimental to group persistence. Therefore, there is selection pressure¹ for individuals with “taste” intuitions that track utility. In particular, the artistic process is driven by “entrepreneurial” intragroup status competitions among artists or tastemakers, who increase and decrease in status based on their ability to accurately predict the current and future consensus tastes of the group at large. On the longest timescales, experiential heuristics like pure aesthetic pleasure evolve for subconsciously recognizing the utility of information. “Art” is (extrabiological) information interacted with primarily through these taste mechanisms, rather than through direct instrumental evaluation and verification. The “fine arts” are the paradigmatic domain where taste-mediated evaluation dominates².

This essay extends and references several previous essays where I explored art as a kind of “ontological research program” using techniques inspired by the Library of Babel, information theory, and statistical mechanics³.

Foundations: Artifacts and Processes

The “fine arts” as a category began to be codified in 18th-century Europe, though the exact five varied by author⁴. Hegel’s Aesthetics, for instance, proposed architecture, sculpture, painting, music, and poetry as the fundamental forms. Various scholars have since attempted to update this taxonomy, especially as new technologies began to expand the space of possible art. For example, in “Manifesto of the Seven Arts” (1911, revised 1923) Ricciotto Canudo adds “dance” as a sixth art, and “cinema” as a seventh (made possible by inventions like the Lumière brothers’ Cinématographe in 1895).

Artifacts

What kinds of things count as art? Before asking what art is for, we need a rough inventory of what it is.

Data Types

For the purposes of this essay, I prefer a taxonomy that emphasizes art’s nature as information-bearing artifacts. In previous essays, I considered both texts and pictures as information-theoretic instantiations drawn from vast possibility spaces. Extending this framework, I propose organizing art by the “data type” of its output:

Literature (text, including poetry, prose, drama, stories, or any static string of symbols)
Visual arts (2D static fixed images, including photography, painting, digital art)
Sculpture (3D static objects, includes ceramics, jewelry, craft objects)
Architecture (modified (static) environments, includes buildings, landscapes, land art, interior design, public monuments, installation. Differs from sculpture in that the viewer is “inside” the art, rather than “outside”).
Music (time-based audio)
Film (time-based images)
Games (“interactive technology” or any other UX, which includes video games, board games, interactive fiction, net art, generative art, user interfaces, VR/AR experiences, software art, etc.)

The names of the categories are merely suggestive. Furthermore, many common forms are really hybrids of multiple data structure types⁵ (comics, “movies”, etc.)

The Missing Senses

Section added 3/1/2026.

I missed a few senses from the taxonomy above on my first pass through this essay (Hegel also missed these so I don’t feel too bad about it). There are entire artistic traditions built around taste, smell, and touch (and possibly more obscure senses). Consider the following:

Cuisine (gustatory arts): the composition of flavor and taste. Includes gastronomy, patisserie, mixology, tea ceremony, fermentation. A chef’s tasting menu can be as deliberately structured as a symphony.
Perfumery (olfactory arts): the composition of scent. Includes fragrance design, incense, and aromatics. A perfumer working with base, middle, and top notes is constructing a time-based experience not unlike a musical composition⁶.
Tactilia (haptic/somatic arts): art experienced primarily through touch and the body. Includes textile design, ceramics (as tactile objects, not just visual), fashion (as worn experience, not just seen), massage, and tactile installation art.
Thermae (thermoceptive arts): the deliberate design of thermal experience. The Japanese onsen, the Finnish sauna, the Roman bath, the hammam, the temperature of a served dish, the spiciness of a food (capsaicin literally activates the thermoceptive TRPV1 channel). These are elaborately designed sensory experiences with strong cultural and aesthetic traditions⁷.

And then there are the “exotic” sensory arts, increasingly speculative but not without real traditions:

Interoceptia (interoceptive arts): art that acts on internal bodily sensation. Guided breathwork, guided meditation, yoga pose design, fasting protocols, psychoactive ceremony, certain endurance rituals. The “artifact” is a reproducible way to guide a body or mind into a particular state⁸.
Proprioceptia (proprioceptive/kinesthetic arts): art experienced through the body’s sense of its own position and movement. Dance is partly this (from the dancer’s side, not the audience’s). Martial arts forms, tai chi, rock climbing routes, parkour lines. The aesthetic is in how the movement feels, not how it looks.
Nociceptia (nociceptive arts): art involving pain or extreme sensation. Tattooing, scarification, piercing, BDSM aesthetics, the “runner’s high” as designed experience. Overlaps with tactilia but the primary channel is nociceptive, not haptic⁹.
Vestibulia (vestibular arts): art experienced through balance and spatial orientation. Roller coasters, swing rides, acrobatics (from the performer’s side), spinning dances (Sufi whirling). The designed experience of falling, tilting, accelerating.

This list is probably not exhaustive. The boundaries between categories are blurry (is Sufi whirling vestibulia, interoceptia, or performance?), and there may be sensory dimensions I haven’t considered. The taxonomy is open.

Why were these left off the classical lists? These arts resist durable, scalable recording. You can’t transmit a smell over a wire. A recipe is a set of instructions (text), not the experience itself, in the same way that a musical score is not the music. But unlike music, we have no “playback device” for flavor or scent that can faithfully reproduce the original from a compressed encoding¹⁰. The arts that made it onto the classical lists are precisely those whose artifacts could survive transmission across time and space. The sensory arts are trapped in the present tense.

This has consequences for the theory. If art’s long-run value depends on persistence and transmission, then arts that can’t produce durable artifacts are at a structural disadvantage in the canonization process. Cuisine has no canon comparable to the Western literary or musical canon, not because food is less artful, but because individual dishes don’t survive long enough to be selected across generations. The art is remade each time from instructions. What persists is the recipe (text), the technique (embodied knowledge passed master to apprentice), and the tradition (institutional memory), but not the artifact itself.

Performance

There is one additional category we must consider, that doesn’t quite fit into the information-theoretic schema:

Performance (ephemeral live performance, including dance, theatre, live music, stand-up comedy, improvisational comedy¹¹). While performances can be documented (turning them into film, audio, or photographic artifacts), their core nature is transient.

Processes

The ephemeral nature of performance presents an issue for the account of art as durable, transmissible artifacts.

One possible resolution is to view artifacts as “frozen” performances. For example, a theatrical performance can be recorded as combination of film and sound recordings. In this framing, the Lumière brothers didn’t add a seventh art, but instead invented a new preservation technology for existing performances (like theatre and dance). In this view, art is fundamentally a human process: the musical performance isn’t an imperfect approximation of the score, but rather the score is an imperfect approximation of the musical performance. A sculpture is evidence of the sculptor’s chiseling. A novel is the record of the author’s careful choice of words and story beats. Even a painting involved a performance (the selection of paints and manipulation of the paintbrush) that we don’t witness.

This introduces some additional questions. The invention of film allows a wider range of possibilities for how to present information to an audience (for example, jump cuts are not possible in the theatre) than existed prior to its invention. But filmmaking can also be viewed as a recorded performance as well (the processes of direction, editing, acting, choosing lighting etc.). The construction of a new technology can expand the range of possible performances.

French sociologist Antoine Hennion described art as a collective process, where the entire network of “mediators” that transform or distort the information (bodies, instruments, scores, spaces, techniques, institutions, etc.) comprises the art. Hennion even argues that the process of art includes the act of receiving and comprehending it. Taste isn’t passive, but rather is an active skill. Part of the art is consuming it, and this skill can be trained through deliberate practice. The amateur and the connoisseur literally perceive art differently. As McLuhan famously said, “the medium is the message”: the entire process by which the art has been conveyed affects its meaning. The process of recording grants “scale” (allowing the art to reach a larger audience), but some aspect of the art is changed. A performance carries presence, contingency, and risk that a machine reproducible data structure does not.

The importance of process is even more salient in art centering on curation over creation. For example, a DJ might select from among existing music to create a playlist. No new data was actually created, but different playlists may still exhibit different “emergent” aesthetics based on the curation. Similarly, while many photographers make choices around composition and technical settings, often the objects they take pictures of already existed prior to any input from the photographer.

The limiting example of this phenomenon is Duchamp and his “Readymades”, the most famous of which is his “Fountain” (pictured), a urinal signed “R. Mutt” and turned on its side. A Readymade involves minimal creation and is instead almost entirely based on curation and institutional process (even with an object as ugly, most unhygienic, and purely functional as a urinal).

So art involves both artifacts and processes. Processes (whether performative co-production, active taste, or institutional framing) bear art from artist to audience. Durable artifacts may form as steps in these processes, enabling scalable transmission across time and space.

Functions of Art

Given that art involves both durable artifacts and ephemeral processes, we can now ask: why create art at all? Setting aside the choice of urinal itself, why was Duchamp entering anything into an art show? And why have art shows in the first place?

Overview and Definition

Evolution tends to eliminate costly behaviors that provide no adaptive benefit. Art’s universality and costliness suggests it serves some adaptive function. We’ve already discussed art as a process (mediated performances) that sometimes leaves behind durable, scalable artifacts. And if art is process, then asking “what is art for?” is also asking “what is this social process for?”

Primary Functions

1. Communication, Signalling, and Coordination

In this world, nothing causes true anxiety except death and status. - Procopio

Mediated performances, where an artist encodes and transmits information across mediators to an audience that then decodes it, are by definition a type of communication. So let us start by investigating communication as a social process. What are the evolutionary benefits of communication?

1a. Natural Selection and Communication

Ultimately, natural selection is concerned with persistence. The most “primitive” type of selection is inspection bias: if we inspect a sample from a distribution of objects with variable lifetimes, the sample overrepresents those with longer lifetimes. If this sampling occurs repeatedly, the effect is concentrated.

How does communication enable persistence? One way is via coordination, which enables collective action. Organisms that act collectively may be able to pool resources, specialize, or act more efficiently, enhancing survival. A second way is via replication and persistence of the information itself, outside the original organism. Communication allows adaptive information to persist and accumulate across generations without waiting for genetic encoding¹². Otherwise, organisms would have to learn from scratch each generation. Both mechanisms are downstream of communication’s basic function: making one agent’s information available to another.

What kind of information can be transmitted via art? I’d sort these into two categories: “affective” and “ideological”¹³.

“Affective” or “phenomenal” information describes “what it’s like to be” another agent. In Tolstoy’s 1897 essay, “What is Art”, Tolstoy suggests that the art’s function is primarily to transfer emotional content (pleasant or unpleasant) from the artist to the audience, saying that art begins when one person, with the object of joining another or others to himself in one and the same feeling, expresses that feeling by certain external indications.” Tolstoy goes on to say that art, like speech, “serves as a means of union among [people]”. Similarly, in The Principles of Art (1938) Collingwood argues that art is the clarification and expression of emotion, though he claims that the artist discovers what they feel through the process of creating the art.

One issue with purely affective theories is that artistic creators may engineer in the observer an emotion or a belief that they themselves do not experience or ascribe to, often in an attempt to control the observer or induce a specific behavior. That is, art is not inherently “true”: an artist may lie or behave strategically. Battleship Potemkin (1925), while art, is designed to evoke solidarity with the revolutionaries opposing Tsarist oppression. More innocuously, the creators of movie posters or designers of brand may use emotion as an instrument to incentivize purchases, even if they themselves do not enjoy the product.

This brings us to ideological art. The goal of some art is not (or is not only) to make the audience feel something, but also to make them believe something. This could be about history, religion, morality, or politics, among others.

Recognition of the power of art to shape belief goes back at least to Plato, who wanted to censor poets. In The Republic (especially Books 3 and 10), Plato argues that poetry’s capacity to implant convictions through vivid imitation (mimesis) make them dangerous. Homer’s epics, for instance, portray gods as petty and immoral, and portray heroes as driven by passion over reason. Plato feared that this would lead listeners to accept flawed models of virtue, justice, or the divine. Similarly, Byzantine or medieval Christian panels were designed not just for devotion but also to convey specific theological doctrines, like the divinity of Christ or the intercession of saints.

Ideological art exploits the communicative link between creator and receiver to transfer convictions, which may be held genuinely by the artist or deployed strategically. Such art feeds into broader social coordination (which we will explore shortly) and aligns entire groups around shared beliefs (as with state propaganda or national epics). This distinguishes it from purely affective art, though the two often intertwine: belief is harder to instill without emotional resonance.

1b. Sexual Selection and Signalling

In Darwin’s 1871 book, The Descent of Man, and Selection in Relation to Sex, he introduced the concept of sexual selection. Sexual selection favors traits that enhance mating success despite not necessarily favoring survival. In Geoffrey Miller’s book, The Mating Mind (2000), he argues art signals fitness: the capacity to create complex, novel, aesthetically compelling work demonstrates intelligence and creativity. In fact, simply observing that the art is there indicates that the creator had surplus resources to conduct the performance, whether that’s excess energy for a bird to perform a mating dance or excess economic and social capital to produce a film¹⁴. The costliness makes the signal more honest, as you can’t spend resources you don’t have¹⁵.

A common theory of sexual reproduction is that it’s evolutionary function is to exaggerate genetic variance through genetic recombination, producing more diverse phenotypes. Less remarked upon is that the incentives of sexual selection also favor increased variance: mate choice (on behalf of females) creates pressure to stand out from competitors (on behalf of males), increasing variance¹⁶. In fact, as Richard Prum argues in The Evolution of Beauty, runaway selection can produce arbitrary preferences. Aesthetics can be self-reinforcing, and beauty doesn’t have to track useful traits (at least in the short-term).

For example, consider an illustrative example: a population of men and women, where the men vary in penis length and the women vary in penis size preference. Having a larger penis is detrimental to long term fitness, as growing and maintaining a large penis requires additional resources, like energy. However, suppose due to random variation there is a slight preference among the female population (or a subpopulation) for larger penises. This will bias the descendant men to have larger penises, as the large penised men will have a higher probability of reproducing. Since the women with the strongest preferences for large penises will tend to breed with men with larger penises, the women of the largest penised men will tend to have strong preferences for long penises. After many generations, we can expect both long-penis-having and long-penis preserving to increase in the population, perhaps until those characteristics becomes detrimental to fitness¹⁷. Many similar examples exist in biology, such as peacock’s tails and bowerbird’s nest construction. Runaway sexual processes are also well-documented in stag beetles (antler size), Irish elk (antler span reaching maladaptive extremes), and numerous bird species.

However, a large penis is not art. Just because a signal is subject to sexual selection is not sufficient to call it “art”. This is why we include “extrabiological” in our conditions for art. However, there are behavioral exceptions as well. For example, many sports can be instrumentally validated¹⁸, so they are not art even if used in sexual signalling. Similarly, economic success in and of itself is not art, even if it may be employed to construct sexual signals.

Our example also shows how the act of interpreting a signal is itself part of the sexual selection process. Preferences propagate alongside the relevant trait. By analogy, this also applies to art: appreciating art is itself a signal in the signalling game.

Good taste indicates you can distinguish “good” from “bad” art, which plausibly correlates with intelligence, social awareness, and reasoning ability. And good taste and good art are mutually reinforcing. If we assume “good art” is art with high-fitness meaning (sexual or natural) while “bad art” carries low-fitness, then good taste signals overall mate fitness through the ability to detect the relevant signals well. Good taste shows you can identify true art; choosing true art shows you have good taste.

However, signalling doesn’t occur in a vacuum. Standing out requires differentiation from the local context, not just absolute quality. So a signal’s value depends on the current distribution of current signals. This explains why artistic norms vary across cultures and eras. Art is relative and multidimensional.

In sexual selection, signalling is typically just to a potential mate or mates. But humans also signal to groups, and groups signal to other groups. For instance, a cathedral signals not just individual piety but also collective wealth, coordination capacity, and devotion. Art can be used as a general social coordination mechanism.

1c. Group Selection and Social Coordination

If art is relative, then for any specific piece we have to ask who it’s “good” for, and over what timeframe. Art that enhances individual mating success may conflict with art that enhances group cohesion. Art that coordinates a subculture may alienate the mainstream (or vice-versa). These conflicts are expected: multi-level selection produces competing pressures, and what counts as “good” depends on the level being optimized¹⁹.

Beyond sexual signaling, art serves broader social coordination. We discuss films, share reviews, and debate rankings not just to inform but to align preferences and identities. As Bourdieu says:

Taste classifies, and it classifies the classifier. Social subjects, classified by their classifications, distinguish themselves by the distinctions they make, between the beautiful and the ugly, the distinguished and the vulgar, in which their position in the objective classifications is expressed or betrayed. — Pierre Bourdieu, Distinction (1979)

The “art” you make or claim to like marks your identity socially²⁰. Similarly, the art you dislike marks your identity socially. As Bourdieu says:

Taste is first and foremost distaste, disgust and visceral intolerance of the taste of others. - Pierre Bourdieu, Distinction (1979)

For example, the internet widely despises certain bands, like Nickelback, who have achieved a widespread popular hit. By hating Nickelback, Nickelback-haters signal their non-mainstream tastes. Negative coordination is at least as powerful as positive coordination for drawing group boundaries, and possibly more so, as disliking popular things early carries higher risk of social exile and thus may signal more independence. This leads to a type of “coordination game” where agents are attempting to anticipate the current and future tastes of others.

Keynes’s Beauty Contest

Successful investing is anticipating the anticipations of others. - John Maynard Keynes

Keynes considered coordination games of this nature in his 1936 book²¹, using the metaphor of the “beauty contest” (allegedly based on a real practice in British newspapers).

In the original beauty contest, readers were asked to select the prettiest faces from a set of photographs. The prize went to the participants whose selections matched the most popular selections across all participants. This involves anticipating others’ preferences rather than expressing your own. This involves some degree of “social metacognition”.

Simple versions of the beauty contest are empirically testable. For example, consider the game “guess 2/3 of the average”. If you make the “zeroth-order” assumption (that everyone else chooses uniformly at randomly from the list of numbers) then your prediction of the average is 50, so you should make first-order guess is ~33. If you assume everyone else is making the first-order prediction, then your prediction of the average is ~33, and you should make the second-order guess of ~22. This process can be repeated (giving a Nash equilibrium guess of zero). This thought experiment (literally a beauty contest) can be extended to art. Level 0 is naive aesthetics (“I like this”), level 1 is “others will like this”, level two is “others will predict others will like this”, and so on and so forth. In the limit, players converge on a Schelling point: the choice that’s salient because everyone expects everyone else to choose it.

It’s important to note that in practice, the winning guess is likely not zero, as the actual distribution of guesses depends on the level of strategy actual used among the general population. The game extends to a metagame: players are judged not only on their choices but on their strategies. If a player behaves too “strategically”, it seems “fake” and the behavior is punished. Level 0 grounding is needed to seem “authentic”.

Based on the game, we now have two different definitions of beauty. On the one hand, we have the individual definition of beauty, based on “pure aesthetics”. On the other hand, we can define beauty socially, as whatever wins the beauty contest. But this invites analysis of the social domain. Who constructs the beauty contest, who participates, and who judges?

Institutions

In The Construction of Social Reality (1995), John Searle argues that coordination can create entirely new ontological phenomena. “Collective intentionality”, he writes, “is a biologically primitive phenomenon”. Through collective acceptance, groups bring “institutional facts” into existence, like money, property, and marriage. A piece of paper becomes a money not through any physical process but through collective agreement. However, in context there are still “objective facts” about money (for example, how much money someone has in their bank account). Similar social processes affect art: collective acceptance turns certain artifacts into “great art.”

In George Dickie’s Art and the Aesthetic: An Institutional Analysis (1974), he offers the most extreme version of this argument, claiming that “art is whatever an ‘artworld’ presents as art.” The “artworld” is simply an institution, and there is no essence of art beyond institutional recognition. We once again reminded of Duchamp’s “fountain”, a mass-produced urinal, signed with a pseudonym and placed on a pedestal. It functions as art purely through institutional nomination.

This framing helps explain the social machinery around art. Artists, critics, and curators gain status by influencing what the group coordinates on. In some ways, institutions are defined by what art (or other signals) they coordinate on²². A gallery is differentiated by its exhibits and a canon is differentiated by what it includes.

State Coordination and Weaponized Aesthetics

The coordination function of art has not escaped the attention of the state. If art shapes what groups believe and coordinates around, then controlling art is a lever of power.

During the Cold War, the CIA embarked on a well-documented project of artistic control. Frank Wisner, head of the Office of Policy Coordination, described his propaganda apparatus as “the mighty Wurlitzer”, imagining his program as an organ capable of playing tunes across the world. Through fronts like the Congress for Cultural Freedom, the CIA covertly funded literary magazines (Encounter, Partisan Review), art exhibitions, symphonic tours, and academic conferences. Abstract Expressionism was promoted internationally as evidence of American freedom and creative individualism, deliberately contrasted against Soviet Socialist Realism.

The explicit goal of this project was to coordinate Western intellectuals (and wavering non-aligned intellectuals) around meanings favorable to American interests and conduct information warfare against the Soviets. The program argued that West represented creative freedom and that Marxism was artistically sterile.

A different model of state involvement in art appeared in post-revolutionary Mexico. The Muralist movement (Rivera, Orozco, Siqueiros) was explicitly commissioned by the state to construct a national Mexican identity out of disparate culture groups. Education Minister José Vasconcelos funded monumental public murals depicting Mexican history, indigenous heritage, and revolutionary ideals. The goal was to coordinate a fractured post-revolutionary population around shared meanings and to make “Mexican national identity” real by giving it visible, public, unavoidable form. Unlike the CIA’s covert operations, the Mexican project was explicit and state-sponsored without disguise. The murals were designed to teach the (often illiterate) population their desired historical narrative.

To what degree does state-sponsored art persist outside its sponsoring context? On the one hand, Soviet Socialist Realism has not fared well in the post-Soviet canon. Mexican Muralism has fared better. This may reflect genuine artistic quality, or it may reflect different power dynamics in how art history has been written.

Institutions can persist or fail; the survival of the art and survival of the institutions are linked. Some “canons” endure for millennia; others are forgotten within a generation. If short-run art value is whatever a group coordinates on, what determines which coordination equilibria persist in the long run?

2. Ontological Research

We have played fast and loose with the word “good” in relation to art. But what do we mean by “good”?

In the short run, artistic value is whatever the group converges on. But if value were entirely socially constructed, all art would be equally valid. This is clearly not true, as some artistic ideas survive centuries, while others quickly lost or forgotten. What determines why some information persists in societies, and other information does not?

The answer is that art does something beyond simply coordinating: it encodes “true” information about reality. Even practices with false explicit justifications can persist if they confer adaptive advantage. For example, ritual child sacrifice during famines, however horrifying, can function as population control or signal commitment. For these reasons, groups practicing child sacrifice may outcompete those that don’t, in some sense “justifying” the sacrifice. Similarly, Leni Riefenstahl’s Triumph of the Will was effective at coordination despite leading to heinous outcomes. Groups that coordinate on “useful” information (information that helps the group model the world and cohere socially) persist better than groups that coordinate on less “useful” information. In the long run, cultural selection²³ filters for art that tracks “truth”.

“Truth” comes in multiple non-equivalent senses. Ontological truth concerns the world’s actual structure (physical existence, cause-and-effect, invariants) whether or not anyone is there who can articulate that structure. Epistemic truth is about where the specific claims a work advances correspond to reality. Pragmatic truth is about evolutionary utility: a belief or practice is “true” in the sense that it improves persistence, even if the explicit content of the knowledge is epistemically or ontologically false²⁴.

The view that art may represent some “truth” about the world stretches back at least as far as Plato. Plato argued that true reality consists of eternal Forms. He was suspicious of art, claiming that since art was copied from physical objects, which were in turn imperfect copies of Forms, that art was thrice-removed from Truth, thus rendering it a poor form of inquiry.

We previously explored Borges’ metaphor of The Library of Babel. Almost all books are noise, but somewhere in the stacks are “texts” (or other artwork represented as strings) that present genuine truths about reality, predict the future, etc., or at least present them in compressed form.

Why do compressed ideas tend to be beautiful? One answer²⁵ is that beauty is compression. That is, we find things pleasing when they help us compress our model of the world. But in the framing of this essay the causation runs the other way. Compressed ideas are easier to transmit, remember, and coordinate around. A compact formulation spreads faster than a sprawling one. Compression correlates with persistence, and the ideas that persists are the beautiful ones. The aesthetic preference for elegance is downstream of transmission dynamics²⁶.

Regardless, the problem is finding the texts, evaluating them, and ultimately agreeing on them.

Grounding

If long-run selection filters for useful information, how does this filtering actually work? The utility of a belief or practice may not be apparent for generations. A group might coordinate on a harmful idea and not discover the cost until it’s too late.

Several mechanisms help close this gap:

Proxies

Taste intuitions evolved to track utility without computing it directly. If your ancestors who preferred certain landscapes survived more often, you inherit that preference as a felt sense of beauty.

Cross-group observation

Groups can observe which other groups thrive and imitate their practices. A canon that persists across multiple independent cultures is more likely to encode genuine truth than one confined to a single group.

Nested selection

Selection operates across all groups and timescales simultaneously. Within a group, individuals compete for status by predicting future consensus. Across groups, cultural packages compete for adoption. Across generations, biological evolution shapes the taste machinery itself. Faster loops provide feedback to slower ones²⁷.

Over many generations, these mechanisms select for individuals with good taste intuitions and for the preservation of objects and texts those individuals create. Groups also develop meta-taste, such as judgment about which curation mechanisms to trust, which preservation traditions to maintain, which critics to follow (or at the very least, the bad ones are selected out).

Can coordination itself create ontological depth where none existed? Searle argued that collective intentionality creates institutional facts. Perhaps art works similarly: collective acceptance doesn’t just recognize value but instead bootstraps it into existence. The canon becomes real because we treat it as real, and treating it as real makes it function as a coordination device that actually helps the group persist.

There are real, historical examples of art instantiating cultural practices and reorganizing institutions ab initio. For example, Upton Sinclair’s The Jungle contributed to the passage of major U.S. food safety laws in 1906. More recently (and weirdly) the movie Spectre (2015) depicted a Mexico City “Day of the Dead” parade, and the city subsequently created a real parade beginning in 2016. Fiction can seed tradition. Successful works become shared reference points, which shift coordination, which shifts policy and practice.

If aesthetic pleasure is a heuristic for utility, why do we sometimes coordinate on art that makes us depressed, nihilistic, or self-destructive? Does the theory account for art that hacks the pleasure heuristic without providing ontological benefit? I would argue yes. First of all, unpleasant art can still encode ontological truth. Tragedy, horror, and nihilistic fiction may accurately model ideas such as mortality or betrayal. Facing these truths, may be more adaptive than ignoring them. Secondly, consuming difficult art signals differentiation and resilience. If most people avoid confronting hard truths, those who seek them out signal cognitive toughness and independence. Third, some art may genuinely be parasitic. Superstimuli exist in other domains (junk food, pornography, gambling), so it stands to reason there could be “parasitic” art as well.Selection is slow and imperfect, so parasitic art can exist in equilibrium, especially if its harms are diffuse or delayed. Finally, harm may operate at different levels. As we’ve discussed, art that damages individuals may still benefit groups (martyrdom narratives, sacrifice myths), or vice-versa. What looks parasitic from one level may be functional from another.

Search Process

How does new “true” art enter the canon?:

An artist finds a text outside the current consensus.
Early adopters recognize it, taking reputational risk by endorsing something unproven.
If the text spreads and becomes a new coordination point, the early adopters gain status.
As adoption increases, the signal degrades. “Everyone likes it now” means liking it no longer differentiates you.
Status-seekers must find new true texts to distinguish themselves.
Return to step 1.

This is kind of a “high-dimensional” version of the Keynesian beauty contest.

This mechanism explains why avant-garde art is polarizing by design (high variance means high expected status payoff for correct bets), the power of critics and curators (they offload risk onto others while reaping rewards for correct calls), and why AI-generated art feels “cheap” (zero risk taken in production, so low signaling value).

Signals naturally degrade, hence the red queen race of getting “ahead of the curve”, the behavior of hipsters²⁸, etc. Good taste involves predicting future consensus. While it may correlate with other desirable mental properties (openness, political views, etc), it presumably is also high value as it predicts what the group wants now and in the future, which is key to leading a group. Similarly, if taste defines the group in some way, then very poor taste could result in exile (or death). Naturally, the successful artists and tastemakers will rise in status²⁹³⁰³¹.

Agent’s Perspective

From the artist’s perspective, the layers we have considered are all blended together. A creator is simultaneously (a) following a local aesthetic gradient (b) considering the audience (c) placing a reputational wager and sometimes (d) trying to compress something real about the world into transmissible form.

Why Disagreement Persists

If long-run selection filters for useful art, why does taste vary so widely?

A few possible hypotheses. First, division of labor. Groups benefit from having members with heterogeneous taste. Some may favor novelty, while others favor tradition. A group of pure novelty-seekers would lose accumulated wisdom, while a group of pure traditionalists would fail to adapt. Variance in taste is itself adaptive.

Second, usefulness depends on context. Art useful for a warrior caste (glorifying honor, sacrifice, or physical prowess) differs from art useful for a priestly caste (emphasizing contemplation, transcendence, and textual authority). Subgroups within a society may correctly coordinate on different art for different functions. Disagreement across niches is specialization.

Third, ongoing search. The space of possible texts is vast, and which texts are “true” depends on the current situation. Disagreement is part of the exploration mechanism.

Art and Science

Art and science are closer than they appear. Both are collective processes for discovering and coordinating on “true” information. Both operate through institutions that canonize some contributions and forget/ignore others. Both advance through individuals who break existing conventions and (if vindicated) reshape the consensus.

In The Structure of Scientific Revolutions (1962), Thomas Kuhn argued that science doesn’t progress through steady accumulation but instead through “paradigm shifts”: periods of “normal science” punctuated by revolutionary breaks that reorganize the entire field. The same dynamic appears in art.

Most artistic production is competent work within established conventions (“normal art”). Occasionally, someone produces work that violates those conventions in a way that others come to recognize as revelatory rather than merely deviant. If the break succeeds, it becomes the new convention. If it fails, it’s forgotten or dismissed as incompetent.

Crucially, “good art” breaks artistic conventions, not necessarily political or moral ones. Transgression for its own sake (shock value, provocation) is not the same as genuine innovation. The test is whether the break opens new expressive or coordinative possibilities that others can adopt and build on. Duchamp’s urinal was revolutionary not because it was offensive but because it revealed something about the institutional structure of art itself. A merely offensive urinal would have been forgotten.

3. Direct Aesthetic Experience

We have investigated the relationship between social coordination and ontological research. Let us now consider the relation to direct aesthetic experience. Why do we experience “beauty” at all? Why do we enjoy seeing a well-composed image, or discomfort at hearing a dissonant chord?

Some thinkers treat aesthetic experience almost as a type of drug, referring to the experience of art as a “disinterested pleasure” (Kant), an “aesthetic emotion” (Beardsley), or an “intensified experience (Dewey).” Perhaps these reactions could also be extended to to explain the creation of art (although many artists seem to view their art as labor). But why would we have these innate reactions?

Denis Dutton’s The Art Instinct (2009) offers a better answer. In it, he argues that aesthetic pleasure is an evolved heuristic. We find certain landscapes beautiful because the ancestors who preferred such landscapes were more likely to survive. We find symmetrical faces attractive because symmetry correlates with developmental health. We enjoy narrative because tracking social causation was essential for navigating coalition politics. Conversely, our disgust reaction to certain types of art signals long-run evolutionary disadvantage. In this account, aesthetic pleasure and displeasure are compressed, preconscious signals that information is likely to be adaptively useful.

If art is information evaluated through taste rather than direct verification, then taste must track something real, otherwise groups relying on it would be differentially outcompeted. Aesthetic pleasure guides individuals through the coordination game without explicitly computing fitness consequences.

But heuristics are imperfect. Ideas can be beautiful and wrong. Pleasure is only an individual proxy, not a guarantee. This is why the social machines around art and other signals exist.

Aesthetic pleasure is the base layer. Social coordination amplifies and filters this signal, and long-run selection pressure ensures (slowly and imperfectly) that what we find beautiful tends to track what actually helps us persist.

Conclusion

In the short run, art is a coordination game: value accrues to whatever the group converges on. On medium timescales, “artistic entrepreneurs” (artists, critics, curators) compete for status by anticipating and shaping future coordination. In the long run, the groups that coordinate on “useful” signals (that best help model reality and cohere socially) tend to persist; the art that persists is therefore the “good” art. Aesthetic pleasure is the evolved heuristic that lets individuals navigate this process without computing it explicitly.

Art, then, is ontological research conducted through social coordination and experienced as beauty.

Additional Content

Major Open Questions

I already compared this process to science, but there’s no reason many of the mechanisms can’t apply to various other socially defined processes or symbols (like legal concepts, status of specific people, etc). To what extent can these be delineated?
On that note, how do specific institutional architectures vary when producing different types of art, and how to specific institutional architectures vary for producing other types of social information? Can we engineer institutions to produce the desired effect?
This is a very broad and abstract theory. Similar to evolution, it likely fails to produce mesoscale or microscale explanations (Why did this artistic movement arise? Why was this particular piece of art formed?). This theory would probably admit multiple hypotheses for these questions. Is there a more granular theory that can be tested empirically and used instrumentally?
Is there an alternate theory of art, completely alien to this one? One vague idea I’ve seen kicking around (but not fully developed) is a kind of “financial” theory (think auctions, NFTs, tax evasion, etc).
Not all evolutionary theorists agree that art is directly adaptive. Stephen Davies, in The Artful Species (2012), argues that art may be a byproduct of other adaptations (language, imagination, social cognition, play) rather than selected for its own benefits. On this view, we make art because we have big brains that evolved for other reasons. The byproduct view and the adaptive view are not mutually exclusive. Art could have originated as a byproduct and subsequently been recruited for adaptive functions (coordination, signaling, ontological compression). Once art existed, groups that used it well would outcompete groups that didn’t, even if the initial capacity was incidental. The stronger claim of this essay is that art is now under selection pressure regardless of its origins. Whether the capacity for art was a target of selection or a side effect, the use of art is clearly functional, and that function shapes which art persists. Byproduct origins would explain why the art instinct is imperfect and hackable, while adaptive function explains why it’s structured and convergent.
We are still missing a lot of the mechanism. How are texts evaluated, for example? Are there particular functional forms? Could we implement a working model in code?
What does a “beauty contest” look like across high dimensional embeddings or encodings?
Different institutional architectures produce different selection dynamics. For example, centralized curation (academies, state patronage) has faster convergence, risk of capture, less exploration Market-based has more exploration, risk of pure popularity-tracking, winner-take-all dynamics. Decentralized prestige (peer networks, critical communities) has intermediate properties. How does this affect the art produced? Can we tell?

Implications & Predictions

If the preceding analysis is correct, what should we expect as technology (especially AI) reshapes the conditions of artistic production, distribution, and coordination?

Falling Costs

These values are falling simultaneously:

Cost/time required for production.

AI can now generate text, images, music, and video at near-zero marginal cost. This weakens signaling via artifacts as technical impressiveness no longer signals fitness.

Cost/time required for search.

For most of human history, the bottleneck was preservation (most art was lost: monks used to spend enormous time and resources copying manuscripts by hand). Now the bottleneck is search. The problem is now finding the good books in the Library of Babel. Discovery arbitrage (finding hidden gems before others) may collapse as search tools improve.

Time for a signal to diffuse and opinions to “equilibrate”.

The internet accelerates how fast groups can converge on (and abandon) consensus. Signals degrade faster. The result is a permanent Red Queen race: by the time something is widely recognized as good, the status value of recognizing it has already dissipated.

Value Migration

We should expect value to accrue to the constraint. As production and discovery become commoditized, value migrates up the stack:

Object-level taste: Which works are good? (Increasingly automated)
Meta-taste: Which curators are good? (Still requires judgment)
Meta-meta-taste: Which curation mechanisms are good? (Emerging frontier)

Alternatively, the signal migrates to something AI can’t (yet) fake: 1. Consistency over time (hard to fake at scale) 2. Physical presence (performance, live art) 3. Relational (I trust your taste because I know you personally)

In the limit, value may shift entirely to identity: I trust you because you’re you, not because of any specific judgment you’ve made.

It’s also possible that value may collapse entirely. If anyone can find any text, finding texts is worthless. Artistic signalling will become too noisy and humans will increasingly divvy up status through “games” with instrumental, empirical outcomes (sports with score, twitter likes, etc.)

One (outlandish) thought: in the cultural saturation essay we discussed Onsager like vortices. If individuality ceases to provide signalling information, it’s possible that the only possible signal will instead be conformity. This will cause an inversion in the game, and a type of “emergent structure” as people seek to gain status by rushing to conform.

Another possibility: if material wealth is mostly solved due to AI and economics, and money doesn’t matter, then human society will be reduced to a popularity contest. How many views and followers someone has will entirely determine their worth. (Maybe this has already happened).

Cultural Variation

This framework should explain cultural differences related to respect for tradition vs. novelty. Groups that historically faced stable vs. volatile conditions selected for different preservation/innovation ratios. For example, cultures from stable environments (e.g., long-settled agricultural societies) should value tradition, canonical texts, and respect for elders’ taste, while cultures from volatile environments (e.g., frontier societies, frequent disruption) should value novelty, young tastemakers, and rapid fashion cycles. This seems to somewhat match rural/urban political divides? Similarly, established institutions should be tradition-oriented, while marginal/startup institutions should be novelty-oriented.

AI Evolution

Since AI can falsely point to texts, signs associated with AI pointed to texts will initially decline in status. However, the “highest status” AI companies will survive. Over time, there is thus an evolutionary selection mechanism such that AI will eventually align with human tastes

Edge Cases & Puzzles

Let’s (ask ChatGPT to) generate some edge cases and questions that don’t fit neatly into the framework and attempt to justify them:

Outsider art. Consider cases like Van Gogh or Henri Rousseau, untrained creators who toil in obscurity and are only acknowledged late in life (or) posthumously. Why do they do this? And why the delay to discover them? Easy to explain why they become popular posthumously (they are coopted by others for status later).
Why do children make art? Children have minimal coordination sophistication. However, it could be for private pleasure (biologically ingrained), for parental approval, as “practice” or a “test drive” for later artistic endeavors, or to help the child coordinate across time with their own future selves.
Why do people enjoy art privately? A few ideas:

Private enjoyment trains the taste module for future coordination games.
Taste coordinates with your future self. Good art creates stable preferences over time and thus predictable internal common knowledge (“I know I will still value this in the future”).
Aesthetic pleasure evolved for survival-relevant stimuli (symmetrical faces, fertile landscapes) and now serves a purpose vestigially.

Biophilia - No artist, no pointing, no social game, yet we find mountains beautiful. Explained via Dutton and pure natural selection.

Appendices

Appendix A: Ostensive Definition of Art

Let’s take our definition and try to delineate between art and not art (especially among the edge cases).

Paradigmatic Art

Painting, music, novels, film, sculpture, dance, theatre, poetry, opera, etc. Evaluated almost entirely through taste mechanisms.

Clear Art, Sometimes Contested

Video games, fashion, cuisine, graphic novels, graffiti, advertising, product design. These are sometimes excluded from “fine art” discourse for institutional or historical reasons, but they fit the functional definition: extrabiological information evaluated primarily through taste.

Hybrid Cases (Art + Instrumental)

Architecture (must be structurally sound and beautiful), industrial design (must function and please), rhetoric (must persuade and move). The artifact has verifiable constraints, but significant latitude remains for taste-mediated evaluation.

Art in Process/Curation

DJing, playlists, photography of found objects, Readymades, anthology editing. Minimal creation; the art lies in selection, framing, and institutional presentation.

Edutainment

Documentaries, infographics, educational games. The information transmitted is verifiable; the choices of presentation are where taste operates.

Taste-Engaging Non-Art

Natural landscapes (no creator, no social process—pure evolved heuristic). Mathematical proofs (correctness is verifiable, but choice of proof is aesthetic). Athletic performance in objectively-scored sports (outcome isn’t art; form still engages taste).

Clear Non-Art

Engineering solutions evaluated purely by function. Scientific data. Sports scores. Tax returns. These are evaluated instrumentally, not through taste.

Changelog

2026-03-01: Added “The Missing Senses” subsection covering cuisine (gustatory), perfumery (olfactory), tactilia (haptic), thermae (thermoceptive), and exotic sensory arts (interoceptia, proprioceptia, nociceptia, vestibulia), and why they resist canonization. Reorganized Artifacts section into three parallel subsections (Data Types, The Missing Senses, Performance).

AI Disclosure

I used Claude to help research and edit this essay.

Footnotes

“Selection pressure” here refers to both cultural and biological. The pressure is primarily cultural selection, which filters which meanings and practices spread between groups on a short timeframe. Biological evolution, which more slowly alters the underlying taste and learning machinery based on how those cultural patterns affect survival and reproduction, is secondary.↩︎
See the appendix for my attempt to delineate between art and non-art.↩︎
This essay omits any financial theory of art.↩︎
The “fine arts” are typically distinguished from the seven “liberal arts”, which split into the trivium (rhetoric, grammar, and logic) and the quadrivium (astronomy, arithmetic, geometry, and music).↩︎
There is also some art that fits these categories in ways that don’t match up with the names. For example, a programmatic sequence of rhythmically flashing LEDs would presumably fall under “film”.↩︎
The “organ” (the perfumer’s palette of raw materials) typically contains 500-2000 ingredients, and a finished fragrance may use 30-80. The structure of a perfume (top notes that evaporate in minutes, heart notes that last hours, base notes that persist for days) could also make some scents time-based.↩︎
One could argue thermae are a subcategory of architecture (designed environments). But the primary aesthetic dimension is thermal and somatic, not visual. A great bathhouse with bad water is a failure; a bare concrete room with perfect water temperature and mineral content can be sublime.↩︎
A tea ceremony combines cuisine, perfumery, tactilia (the feel of the bowl), thermae (the warmth), and interoceptia (the meditative state) into a single integrated experience. The ceremony is evaluated almost entirely through taste, not through any instrumental metric.↩︎
Whether nociceptia is a subcategory of tactilia or its own thing is debatable. The distinction matters if you think the aesthetic of pain is qualitatively different from the aesthetic of touch, which most people who have experienced both would affirm.↩︎
In some cases we have partial substitutes, like recipes or perfume formulas. But these still require labor on the part of the consumer. This is changing slowly. Electronic noses, flavor profiling, headspace capture technology, and scent diffusion devices are primitive recording and playback systems for smell. If a reliable “scent codec” were developed, perfumery might undergo the same explosion that music did after the phonograph.↩︎
“Sport” is also a type of performance. Is sport art? I think the answer is: sometimes. Sport is about finding the fundamental limits of the human body. Some sports are adjacent to art (figure skating, diving), especially the ones judged subjectively. Sports with purely objective scores are probably not art. That being said, spectators might find certain running styles more beautiful than others, and athletes do seem to care about form beyond pure optimization. The artifact itself (the score/outcome) isn’t art, but the process still engages taste mechanisms. Similarly, the design of sports (modulo economic considerations) is art.↩︎
Genes are themselves a type of communication, but this is outside the scope of this essay.↩︎
We can also ask if art can transmit “factual” information. Are documentaries, infographics, educational games, or other forms of edutainment art? I’d argue yes, but with an asterisk (see the appendix on the ostensive definition of art). In these cases, the art is predominately in the choice of how the information is transmitted rather than the information itself; “photosynthesis takes sunlight and carbon dioxide and produces sugar” is not art, but the choices of how to convey that information to children via a cartoon is art.↩︎
One question I plan to explore: Is there some statistical notion of “primitive signalling” akin to the inspection paradox?↩︎
With caveats. Of course, there are numerous examples both in nature and human society of false signalling. But this is out of scope of this post.↩︎
While I have seen the mechanism in Prum’s book, I haven’t seen the parallel point about variance framed specifically in these terms before (but I haven’t looked that hard). One question I have is whether the incentives (via mate selection) or the sexual reproduction came first. It feels more truthy to me that incentives would come first, but I don’t know enough about the subject to comment.↩︎
How far do we expect a trait to change based purely on this phenomenon? Is there a mathematical way to calculate this? Probably it exists in the literature but I have yet to explore it.↩︎
One potential philosophical issue is that the validation itself may be part of a social process. Resolving this is out of scope of this essay. There are attempts to deal with these delimitations in works such as Searle’s The Construction of Social Reality or Epstein’s The Ant Trap.↩︎
In my opinion, this is (or is at least deeply related to) the Fundamental Problem of Ethics: an individual may be part of a group (or groups), and the group’s preference may conflict with the individual’s. Should the individual take the best action for the group or for themself? Hopefully more on this in a future essay.↩︎
I previously discussed art and identity with respect to the Pierre Menard story.↩︎
The General Theory of Employment, Interest and Money↩︎
This may seem circular but I think these two concepts may actually be dual in some sense. We will see if I ever formalize this idea.↩︎
And possibly group selection, although group selection is controversial inside biology.↩︎
Separately, as Searle argues in The Construction of Social Reality, collective acceptance can create institutional facts. For example, canons, credentials, etc. “I have five dollars” is a fact, even if the concepts of property and dollars are socially constructed.↩︎
See, for example Schmidhuber, Driven by Compression Progress, or George D. Birkhoff, Aesthetic Measure (1933).↩︎
This is visible in domains where content can be instrumentally verified. Mathematical proofs are not themselves are art, as correctness can be checked mechanically in formal theorem provers. But the choice of proof is aesthetic. Mathematicians tend to prefer the elegant proof that reveals structure with minimal machinery. As is often attribute to Einstein: “Everything should be made as simple as possible, but not simpler.” The proof that compresses without losing truth is the one that gets taught and built upon. Compression is a side effect of selection for transmissibility.↩︎
See Gwern’s writing on the relationship between learning and evolution as complementary search processes.↩︎
This framework differs from Girard’s mimetic desire. Here, agents seek differentiation rather than converging on the same objects. However, at the meta-level, agents still imitate what others point to, so in part the search for novelty is mimetic. Squaring this circle is outside the scope of this essay.↩︎
One question I have is which way to define “status”. Is the highest status person the person with the best taste (the best at predicting future coordination), or is the “best taste” simply want the leader does? A truly “high status” person doesn’t need to signal, because the group already knows they are in charge and will coordinate on their decisions. If the group coordinates on whatever they choose, then their choice becomes “correct” by definition. That is, you stop being a “price-taker” in the taste market and become a “price-setter”. You are the Schelling point. So both the high and low status don’t signal: high because they don’t have to, and low because they can’t afford to.↩︎
There are likely multiple paths to status. For example, in a “dominance” path, you accumulate resources/power until exile from you is more costly than exile from the group, and you become the new Schelling point (i.e. threat of direct punishment). In the foresight path (or “prestige” path), you predict coordination so well, so early, so consistently that people start looking to you as the oracle and you become the Schelling point. Both end at the same place: creating common knowledge. “Everyone knows everyone knows that X matters.”↩︎
It’s possible that status is fundamentally about demand. High-status things are things that are demanded; high-status people are people who are demanded. To display status is to display that there is demand for you. This explains why some strategies (NFTs, certain dating tactics) attempt to simulate overdemand by restricting supply, even though artificial scarcity isn’t the same as genuine demand. Brands work similarly. A brand identifies you with the group that demands that product. To wear the brand is to claim membership in that group, and to see someone wearing it is to classify them as a member.↩︎

Reduction

Sat, 20 Dec 2025 05:00:00 GMT

Introduction

In this post, I introduce gauge equivalence, and also investigate a few different types of reduction under symmetry (to build out a taxonomy).

If you haven’t followed along, in the last few posts we introduced the Lagrangian in the context of geometric controls. We then proved Noether’s theorem with time and applied it to similar systems.

Epistemic status: This post is still a bit rough: these are my informal notes navigating this subject. I’m more interested (ultimately) in computation so I’m not necessarily aiming for maximal rigour, and in fact I probably need to introduce more geometric machinery (bundles, connections, differential forms, symplectic geometry) to make the exposition more clean and rigorous. See the read more section for more rigorous sources.

Equivalence

Before reducing anything, let’s introduce a new notion of “equivalence” (and recall one we saw before).

Gauge Equivalence

We once again consider a manifold and a system with start and end configurations .

The Lagrangian is , and the action is:

Let’s consider the adjusted action

where is some constant. Clearly, does not affect the minimizing path for .

Next, consider some function . Let , so the action becomes:

but this is equal to

Therefore, given some Lagrangian , we can add an arbitrary without changing the underlying mechanics. That is, two Lagrangians and produce the same Euler-Lagrange equations.

By analogy with our previous posts, if for some group and some , we have

we say that that is a quasi-symmetry of , and that the Lagrangians and are “gauge-equivalent”¹.

If we combine this with our view of equivariance, we get:

We can consider two Lagrangians and to be “equivalent” if for some .

In discrete coordinates, this becomes

The telescope away, leaving the actual dynamics the same.

The situation should be unchanged if the Lagrangian depends or does not depend on .

The Noether charge is slightly modified in this case. We have an extra term:

And the Noether’s charge is

The proof (sketched in a later section) differs from the original in that action doesn’t equal under variation, but instead equals the variation in the total derivative.

Equivalence under Equivariance

Here we have

If we have some , really we want to work in quotient space

for .

We’ve already covered this in the dynamical similarity post so I won’t belabor it. Essentially we end up with

where . The shows up due to maps between and .

Can we look at this with respect to “general representations”? I.e. more complex characters? It seems not really, we would need to have generalized Lagrangians (i.e. not just a scalar), which is out of scope of this post.

Equivariant “Reduction”

Equivariance won’t help us lower dimension the way quotienting by does, since it only tells us when different-looking Lagrangians describe the same trajectories up to scaling. But, it did allow use to produce new coordinates that index entire families of solutions. This isn’t “reduction” per se, but reparametrization

What should this look like? (Presented without proof, we saw the simplified version in the last Kepler proof).

There’s some representation and character (homomorphism)

Such that

Even more generally, with and diffeomorphisms (not necessarily linear), hand-waving

This would be cool if we wanted to “transport” our solutions around between frames.

Summary

Putting it together:

And we can define some equivalence relations in terms of gauge equivalence and equivariance.

As an aside:

It is interesting to consider if we wanted to consider conditions such that the group actions composed. That is,

We know .

So we would need

This may be interesting if we ever want to classify Lagrangians.

Reduction

Now that we have some equivalence relations on , it makes sense to work in “reduced” space of Lagrangians, modulo symmetry. We’ll look at the equivalence relations above, plus others. Let’s go through each type of reduction one-at-a-time.

1. Gauge “Reduction”

As established, if such that , we write .

What’s the point of this adding extra term? Why care about it?

For some systems, we may not have “symmetries”, but by adding an extra term we can enforce a quasi-symmetry on the system.

Example

Consider the following Lagrangian on , where :

As written, the system is not invariant to rotation by .

Let

And consider new coordinates ,

We know the first term is invariant to rotation:

The second term transforms as:

If there exists a function such that

then it would be quasi-invariant. When is this true?

Rearranging, we would have to have

Since it’s true that

So we conclude it is quasi-invariant iff

A function is only a gradient if the mixed partials are the same. So (skipping one or two steps) we end up needing

Call . So this is true if is invariant under rotation.

We will need in the next section.

Noether Charge

What is the Noether charge?² Let’s compute it. The transformation is

We need , which is

Then we need .

We can rearrange the terms of the Lagrangian with gauge term to get an expression for . Differentiating the quasi-invariance condition at gives

Since we know the typical Noether charge formula along solutions, the left side can be replaced by the Noether charge:

Thus the adapted charge for the gauge:

In the example, the conserved quantity is

We just need to compute this particular (this will be relatively difficult since we don’t have an functional expression for , just some abstract criteria).

First, expanding in coordinates and plugging in the components of

Call

Now, return to our definition of :

Since

We get

We can expand the using their Taylor approximations

If we split on the coordinates and rearrange we get

Also, the mixed-partials must agree

If you do a bunch of algebra, and substitute the two equations we have for the partials of , you can get

Notice that, for , we have

So we get

And if is constant, then we integrate the partials and put them together to get .

But also

So the Noether charge reduces ultimately to

Assuming B is constant. This is the angular momentum (the mass is included in the term).

If is some other rotation-invariant function, we can integrate

to find the Noether charge.

Thoughts on Example

In summary:

given A term
compute , rotation-invariant
find based on the components of
this could give based on the difference between and , or just use it to compute

If there’s an easier way to do this, I don’t know what it is. This is a bit ugly because is gauge-dependent. What we really want is a symbolic way to automatically get given the gauge. In the code we can compute numerically (or using autodiff).

You’d have to

decide if a quasi-symmetry exists
construct
differentiate it
Return

ChatGPT 5.2 says step 1 isn’t solvable. So we’d have to supply the symmetry ahead of time (like we already do with regular symmetries), then integrate the gradient of (which we can get becausse we can compute )

But none of this really matters, we don’t even have to compute because is just a boundary term, the gauge telescopes away in the actual dynamics.

Also note that this isn’t a true reduction, as it doesn’t really reduce dimension.

2. Configuration Space Reduction

Here we will reduce into a simpler space, by action .

We have a map .

Let , where . The projection map sends each element to its corresponding orbit in .

How does the associated tangent bundle change under quotient by ?

Before, for a Lie group , we had the pair

If acts on itself (call the acting element ), we have

If we trivialize this

So is unaffected by the quotient.

Subcase 1: Euler-Poincare Reduction

Let’s consider . Then , which is a single point. This is the same as our original reduced lagrangian, which was .

If you recall, we completely got rid of any dependence on the actual manifold and worked completely in the Lie algebra. So we’ve already solved this case.

Subcase 2: Lagrange-Poincare Reduction

What if is just some arbitrary manifold? What does it even mean to take , in general? We need to define an equivalence relation.

Consider the orbit of with respect to :

We say if there exists some such that (they are in the same orbit).

So we are talking about , the set of orbits.

The problem is the induced equivalence relation of . We need the velocities to transform:

We can define another equivalence relation in this way. Two elements and are equivalent if there exists a such that . This constructs .

We also know there’s a map

that just takes the equivalence classes on (which are among pairs ) to their corresponding equivalence classes in (whice are among ). Basically, it forgets the velocity.

If we have some we can take the fiber . This points back to the entire orbit of and associated velocities. The question becomes: how do we resolve the ambiguity of which to use as representative?

Pick an arbitrary as representative; all other in the orbit equal for some . The equivalence classes over velocities of the vertical part can be represented as

for some ³.

Once a representative is fixed, the velocity component along the group orbit is determined by an element . What remains is the component of the velocity transverse to the orbit. So we can decompose

Where is along the orbit and is the projection onto .

(This split isn’t canonical, it depends on a choice of connection on .)

3. Phase Space Reduction

Subcase 1: Marsden-Weinstein Reduction

Note: I believe this is the original paper. I haven’t introduced symplectic geometry so I am omitting that language and keeping things informal.

Let’s say we have a system with Noether charges . We can reduce this system by picking corresponding values for each charge , etc., then setting . Let’s look in more detail.

We have the space of pairs (the phase space aka cotangent bundle):

Suppose a Lie group acts on configurations:

with tangent map

We have:

We want our momentum functional to work as so:

Conveniently, for with induced vector field on (so ):

We’ve contructed the “momentum map” . This is essentially the definition of a Noether charge but written in functional form ().

We can assign a value to the corresponding Noether charge for that symmetry⁴:

Consider the -preserving symmetries (the ’s are preserved under action by ):

So we can think of the reduced state space as

So we’ve restricted the phase space to a specific value of a conserved quantity and then quotiented out the symmetry corresponding to that quantity. This generalizes to multiple conserved quantities when they arise as components of a momentum map (for a product symmetry group) or via staged reduction (multiple commuting symmetries).

Example - Kepler’s Third Law - Marsden-Weinstein Reduction

Let’s return to Kepler’s third law in 2-dimensions.

Let

Convert to polar coordinates, :

Now let’s consider

(Symmetry under )

The conserved quantity is

Let’s fix this

This determines

(this is angular momentum, the same as the free rotor)

So we’ve reduced the Lagrangian to one dimension ().

However, there’s a problem. The dynamics are correct, but this is no longer necessarily a Lagrangian. We’ve introduced a constraint (we haven’t looked at constraints yet). We’ll need a way to correct that (in the next section).

The last step is to handle the phase space.

Start with:

Compute the canonical momenta:

Invert:

Then the Hamiltonian is the Legendre transform

So restricting to gives

then on it becomes

Subcase 2: Routh Reduction

Note: Typically Routh reduction is for cyclic coordinates specifically. I’m looking at it a bit more generally.

We know, from the last section, that we have . We are looking to modify our variational problem to consider this constraint.

By Lagrange multipliers, we can augment the action with the constraint:

Since is constant along the orbits of the symmetry, we just need to ensure the solutions move along that submanifold to ensure the new equation is variational. From the earlier section on Lagrange-Poincare, we can decompose as , where is along the “symmetry direction” (with conserved quantity ) and is “transverse” to the symmetry direction.

Define:

This is the same as the original Lagrangian, just reparametrized.

Thus,

And we enforce .

Plugging back in to the action formula

Here the only subtlety is that is the symmetry-direction velocity (e.g. ), so the variation that produces the symmetry equation is really a variation of the symmetry coordinate .

That means , so after integrating by parts the stationarity condition is

Now that we have this, let’s try to modify to cancel the -chain rule term, without affect or .

Consider

So (chain rule in shorthand)

is the coefficient. Since the -coefficient is along the desired solutions, we set , and thus , and .

Thus, the corrected Lagrangian is

Which is the form of the Routhian.

Example - Kepler’s Third Law - Routh Reduction

Take our solution from the end of the last example:

Subtract . In this case motion is split into and components,

So the new is the variational formula that gives the dynamics that obeys the constraint.

Code

No code this time. There’s probably already enough here to implement Routh reduction, but I’m going to leave that for later, once I’ve thought more carefully about how it composes with the other notions of equivalence and reduction above.

Conclusion

We now have the start of a picture of how the Lagrangian works and reduces under symmetry. So far, we are still recapitulating well-known results, but I have a much better grasp on the subject than before. In subsequent posts, we will look at this picture an algebraic viewpoint and look at extensions. I also plan to look at applications of these principles to controls, games, and agents.

Marsden & Ratiu, Introduction to Mechanics and Symmetry
Marsden & Weinstein (1974), “Reduction of symplectic manifolds with symmetry.”
Marsden, Ratiu & Scheurle (2000), “Reduction theory and the Lagrange–Routh equations.”
Cendra, Marsden & Ratiu (2001), “Lagrangian Reduction by Stages”
Strongly suspect ChatGPT has memorized this blog post by Michael Kraus.

Footnotes

It seems the full notion from physics of “gauge symmetry” or “gauge theory” from physics implies a fair amount of structure I have not yet introduced, so I avoid it. Really this is “variational equivalence or Lagrangian equivalence modulo exact 1-forms on path space.”↩︎
I used ChatGPT to assist with some of the algebra here, though I checked in thoroughly and I think it works. Even with ChatGPT and significant effort, I think the proof is inelegant. It’s not important to the overall throughline I’m trying to build so skip it if it seems confusing. I think in most cases you’d have some formula for and you’d just compute the derivative.↩︎
The -action needs to technically be “free and proper” for all of this to work out nicely such that is a smooth manifold. Freeness: if we form the matrix whose columns are the velocity directions generated by each symmetry at the current state, that matrix has full column rank (no non-trivial nullspace). Properness: basically we require “large symmetry actions” to produce “different enough” parameters; we cannot send the symmetry parameter to infinity while the Jacobian and the transformed state both stay small. From the reduction point of view: 1. there is only one symmetry motion corresponding to a given “along-orbit” velocity, and 2. states that are “the same up to symmetry” stay close when you evolve or project them. ↩︎
I suppress the individual J_i and p_i from here out, but this can be done for each Noether charge.↩︎

Inspection Bias

Tue, 16 Dec 2025 05:00:00 GMT

Introduction

Suppose we have a population of objects of different lifespans (starting at different times). Given a sample from a specific time point, we should expect “most” of the objects we see to be drawn from objects of longer life spans. Let’s briefly look into this phenomenon.

All these results all well-known in the literature, under “renewal theory”, “Palm theory”, “length-based sampling”.

Lifespans

Assume objects are created over time with a constant-rate process.

Let and call “lifespan” of an object ¹.

We are interested in the following distribution:

by Bayes’ rule

If the objects lifespan is , then

is the birth of the object. Let’s assume is roughly constant density. Then

Rename to . We know:

for some constant , and splitting this up we get

Plugging this in, we get

For , we have

This is ultimately dependent on a choice of distribution over , and the constant birth rate.

This also implies that

(which is strictly larger than for non-degenerate ).

Also

Exponential

Let’s try exponential (which corresponds to “random death”/constant hazard rate)

Then we have

The conditional density is thus

Which is a gamma distribution.

Even though exponential lifetimes are “memoryless”, the population is not. Memorylessness is not preserved under selection-by-survival.

Also: if the variation in the population is large, the bias can be large.

Gamma

Let’s do the gamma distribution.

By length bias formula:

which is also a gamma

with expected value .

The Gamma shape parameter measures how many “chances to die” have already been survived. Observing an object at a random time guarantees at least one “survival”. So we increase the shape by one.

Log Uniform

This example is just to show how strong the effect can be.

Let’s say we have a log-uniform distribution over 10 orders of magnitude.

Multiplying by gives a constant. So the new distribution is uniform!

Let’s look at the top decile

So now most of the mass lives in the top decile!

Population Traits

Let’s now connect the lifespan to a set of “traits”. So

To simplify further, assume . So a

What happens to traits in our sample? We should expect the positively correlated to be overrepresented in the sample, and vice-versa.

In fact

where is . Since is a vector, is actual a matrix: the covariance of with itself.

This entire expression shows that the component of variability aligned with is what drives the sample bias. So if all the bias is “orthogonal” to , there will be little selection bias, but if the bias is “in the direction of” , there will be substantial selection bias. This is interesting as it grants us a “direction” based purely on the persistence of objects, which we can tie to geometry.

Conclusion

In the typical statistical story, we are interested in information about the population, and we observe a sample obtained through a random process to infer the relevant information. I’m interested in two related statistical concepts. That is:

We know the population and the sampling process, and we are interested in the properties of the sample (this example).
We know the population and the sample, and we are interested in what process was used to obtain the sample.

This example is interesting because we managed to derive a “direction” purely from conditioning on persistence.

Footnotes

The window avoids any issues with measure zero that I’m too lazy to think through.↩︎

Dynamical Similarity and Equivariant Symmetry

Mon, 08 Dec 2025 05:00:00 GMT

Introduction

We have continued our investigation of geometric controls by investigating conserved quantities derived from Noether’s (first) theorem. Here we look at a slight extension, where the Noether charge is not conserved for the system itself, but across classes of systems.

Here, we will look at a particular case of this phenomenon: dynamical similarity.

Note: Much of this material was briefly included in the last post prior to a refactor (for cleaner conceptual organization).

AI disclosure: I had ChatGPT draft the one bridge section required to complete the refactor (the “Modified Noether” section).

Background

Given for some group , there is a group action (and associated tangent maps) that the Lagrangian is invariant to:

But the Euler-Lagrange equations are homogeneous in . That is, multiplying by a nonzero constant doesn’t change the equations of motion¹.

This suggests a looser restraint on the Lagrangian:

where is a group homomorphism², called the “character”. Note that if , , we have the original invariance condition.

In this case, the Lagrangian is not invariant but is instead “equivariant”.

When the symmetry is only equivariant, the usual Noether quantity is no longer conserved. Instead, it drifts predictably, as determined by the scaling factor. The combination (with the integral correction) is the piece that remains constant across similar systems.

Modified Noether

Given this, how is the Noether charge modified?

Assume we have a transformation depending on a small parameter and the Lagrangian transforms by a scalar factor

where and is smooth.

For the symmetries we care about, we can write

for some constant .

Differentiate both sides at .

Left-hand side:

(plus any time-related terms. I omitted those here but they pass through as you’d expect.)

Right-hand side:

Equating both expressions gives the equivariant Noether identity:

This replaces the conservation law of the invariant case.

Integrating in time, the combination

is constant across all trajectories related by the symmetry.

For (i.e. ), we recover the usual Noether charge.

Reintroducing Time

When we impose an equivariant symmetry on the Lagrangian,

the usual continuous Noether statement produces the modified identity

and the conserved quantity

At first glance, this seems to require an additional term in the computation of the Noether charge.

What happens if we reintroduce time?

We have extended Lagrangian:

Recall

with no additional terms.

For the scaling symmetry

we have

which is exactly the form we derived in the example section.

Since we know

and the extended-time identity

hold at the same time, we can subtract:

Integrating from to ,

So (up to a constant)

So (assuming is homogeneous, has a Hamiltonian and the symmetry rescales time) we actually don’t have to compute that integral! This entire formulation is equivalent to our extended Noether framework!

Examples

Homogeneous Potentials

Let’s look at dynamical similarity.

Suppose we have:

i.e. movement under some potential . Let’s assume the potential is homogeneous of degree :

and we have symmetry of form:

With (so it’s infinitesimal) and (which causes all of to scale homogeneously: scales as , as , as , as ).

That is:

(since ).

Let’s compute the Noether charge:

And from a previous example we know

This specializes to some known cases:

Free Fall

in this case. . If we double the initial height, we need to “stretch time” by dividing by to remain on a valid solution.

Here potential scales with height , so and . .

Kepler’s Third Law

in this case. . So doubling the radius impies you must “stretch time” (like the time to complete one orbit) by dividing by to stay on a valid solution.

Here, , so . , as in Kepler’s third law.

Code

We don’t need to adjust the code at all. It should already work!

Examples

Kepler’s Third Law

We define

class Kepler(VariationalSystem):
    def control_plane(self):
        return {
            "r": Rn(2)
        }
        
    def params(self):
        return ["mass", "mu"]

    def lagrangian(self, ctrl, dctrl):
        r = ctrl.r
        rdot = dctrl.r
        m = self.params.mass
        mu = self.params.mu
            
        r_norm = torch.sqrt((r * r).sum() + 1e-10)
        T = 0.5 * m * (rdot * rdot).sum()
        V = -mu / r_norm
        return T - V

We run it with

if __name__ == "__main__":
    kepler = Kepler({
        "mass": 1.0,
        "mu": 1.0
    })

    h = 0.01
    recorder = StepRecorder()
    integrator = VariationalIntegrator(kepler, step_size=h, on_step=recorder.on_step)

    # Energy (time translation): (eps, t) -> t + eps
    energy_sym = Symmetry(
        space_transform=lambda eps, t, q: q,
        time_transform=lambda eps, t, q: t + eps
    )
    integrator.register_noether_charge("energy", energy_sym)

    r_slice = kepler.model.layout["r"][1]
    def rotate_r(eps, t, q, sl=r_slice):
        qn = q.clone()
        x, y = q[sl]
        c, s = math.cos(eps), math.sin(eps)
        qn[sl] = torch.tensor([c*x - s*y, s*x + c*y], dtype=q.dtype)
        return qn

    angular_sym = Symmetry(space_transform=rotate_r)
    integrator.register_noether_charge("angular_momentum", angular_sym)

    alpha = 1.5

    def scale_r(eps, t, q, sl=r_slice):
        qn = q.clone()
        qn[sl] = math.exp(eps) * q[sl]
        return qn

    dyn_sim = Symmetry(
        space_transform=scale_r,
        time_transform=lambda eps, t, q: math.exp(alpha * eps) * t
    )

    integrator.register_noether_charge("dynamical_similarity", dyn_sim)

    # Initial conditions for elliptical orbit
    r0 = torch.tensor([1.0, 0.0], dtype=torch.float64)
    v0 = torch.tensor([0.0, 0.8], dtype=torch.float64)
    t0 = torch.tensor([0.0], dtype=torch.float64)

    ctrl0 = AttrObject({"r": r0, "t": t0})
    q0 = kepler.model.pack(ctrl0)

    steps = 500

    ctrl1 = AttrObject({"r": r0 + h * v0, "t": t0 + h})
    q1 = kepler.model.pack(ctrl1)

    qs = [q0.clone(), q1.clone()]
    q_prev, q_curr = q0, q1

    for _ in tqdm.tqdm(range(steps - 2)):
        q_next, ok = integrator.step(q_prev, q_curr)
        qs.append(q_next.clone())
        q_prev, q_curr = q_curr, q_next

    qs = torch.stack(qs, dim=0)

    print("\nKepler Problem:")
    energies = [float(rec["noether_charges"]["energy"]) for rec in recorder.records]
    angular = [float(rec["noether_charges"]["angular_momentum"]) for rec in recorder.records]
    similarity = [float(rec["noether_charges"]["dynamical_similarity"]) for rec in recorder.records]
    
    print("Energy (should be constant):")
    print("  min:", min(energies), "max:", max(energies), "drift:", energies[-1] - energies[0])
    print("Angular momentum (should be constant):")
    print("  min:", min(angular), "max:", max(angular), "drift:", angular[-1] - angular[0])
    
    J_values = [float(rec["noether_charges"]["dynamical_similarity"]) for rec in recorder.records]

    # Second Kepler run, scaled initial conditions
    lam = 2.0
    kepler2 = Kepler({
        "mass": 1.0,
        "mu": 1.0
    })
    recorder2 = StepRecorder()
    h2 = h*(lam**alpha)
    integrator2 = VariationalIntegrator(kepler2, step_size=h2, on_step=recorder2.on_step)

    r0_sc = lam * r0
    v0_sc = (lam ** (1.0 - alpha)) * v0
    t0_sc = torch.tensor([0.0], dtype=torch.float64)

    ctrl0_sc = AttrObject({"r": r0_sc, "t": t0_sc})
    q0_sc = kepler2.model.pack(ctrl0_sc)

    ctrl1_sc = AttrObject({"r": r0_sc + h2 * v0_sc, "t": t0_sc + h2})
    q1_sc = kepler2.model.pack(ctrl1_sc)

    qs2 = [q0_sc.clone(), q1_sc.clone()]
    q_prev, q_curr = q0_sc, q1_sc

    for _ in tqdm.tqdm(range(steps - 2)):
        q_next, ok = integrator2.step(q_prev, q_curr)
        qs2.append(q_next.clone())
        q_prev, q_curr = q_curr, q_next

    qs2 = torch.stack(qs2, dim=0)

    base_t, base_J = compute_J_from_records(kepler, recorder, alpha)
    sc_t, sc_J = compute_J_from_records(kepler2, recorder2, alpha)

    lam_t = lam ** alpha
    lam_J = lam ** (2.0 - alpha)

    errors = []
    for tb, Jb in zip(base_t, base_J):
        target_t = lam_t * tb
        idx = min(range(len(sc_t)), key=lambda k: abs(sc_t[k] - target_t))
        J_scaled_rescaled = sc_J[idx] / lam_J
        errors.append(J_scaled_rescaled - Jb)

    print(f"\nDynamical similarity comparison (lambda={lam}):")
    print("  J_scaled/lam^{2-alpha} - J_base stats:")
    print("    min:", min(errors))
    print("    max:", max(errors))
    print("    mean:", sum(errors) / len(errors))

This is the previous example, but we run it twice, at two different scales.

We get:

Kepler Problem:
Energy (should be constant):
  min: 0.6735297151115243 max: 0.6868012832352087 drift: -0.0026780225061522334
Angular momentum (should be constant):
  min: 0.8000193606114198 max: 0.8000206253911845 drift: -1.9376809246018922e-07
100%|████████████████████████████████████████████████████████████████████████████████| 498/498 [00:01<00:00, 361.06it/s]
Dynamical similarity comparison (lambda=2.0):
  J_scaled/lam^{2-alpha} - J_base stats:
    min: -3.547062643605159e-11
    max: 4.892536153988658e-09
    mean: 9.990145743197486e-10

Which looks good. So we have the same for similar curves.

Conclusion

We showed how equivariance results in Noether charges across similar systems, rather than within a single system. In the next post in this series, I plan to dig into some more interesting examples.

Footnotes

There’s also another way (gauges) to transform the Lagrangian while preserving the physics that I’ll explore in a later post.↩︎
Ignoring the cases where is less than zero, as that flips the minima and maxima.↩︎

Are We Approaching Cultural Saturation?

Sat, 06 Dec 2025 05:00:00 GMT

Introduction

In the Paradox of Taste, I looked at novels as information-theoretic objects. One question I asked was: what if all novels that could ever exist were enumerated and indexed in the Library of Babel?

The Library of Babel is gigantic. A back-of-the envelope calculation shows that there are roughly to grammatical-ish English strings of novel length¹. Even is a superastronomical number. There are only an estimated particles in the observable universe.

But despite the vast number of possible stories, we seem to see the same stories over and over. Even if we just look at novels, we see clusters around a handful of templates. An orphaned farm boy is destined to defeat an ancient evil. A socially awkward young woman circles a slow-burn romance in a polite society. A brooding detective unravels a plot in a corrupt organization.

This seems to extend beyond the novel. In another essay, I considered images in terms of the number of semantic bits they encode. While there are many possible pictures, humans only care about a few semantic bits worth of knowledge, and so we see many similar forms over and over again.

Other art forms also seem to gravitate towards a few recurring forms. Consider movies. We often see the same Marvel origin stories or the same Disney film remade over-and-over.

Every artistic medium shows the same pattern. Early on, discoveries feel abundant. New genres, new forms, new conventions. As the medium matures, novelty becomes harder, and there are endless sequels and reboots. Or, genres fragment and microstyles proliferate, with innovation occurring along narrower and narrower dimensions.

If there is such an abundance of possible art, why do we seem to see the same cultural objects over and over? Is cultural novelty a finite resource? And if so, are we approaching some sort of equilibrium?

In the sciences, there is even some concern that an exponential amount of energy input could lead to a mere linear payoff, or worse². Could something similar be true in the cultural fields?

In this post we will briefly consider these questions.

Caveat Lector: All arguments are back-of-the-envelope. As with all of my writings, please view it as semi-experimental.

Abstraction

One obvious answer to this conundrum is that humans don’t remember or analyze stories in their entirety, but only consider abstractions of stories. Various theorists have tried to build models of specific stories, or classes of stories. The most well-known of these is Campbell’s “Hero’s Journey”, from The Hero with a Thousand Faces (1949).

There is an entire corpus of scholarly work attempting to build narratological models of stories.

Most of the academic work seems aimed at classifying or characterizing existing stories. These range from role-based models (Greimas models stories as interactions among a small set of roles) to grammars (Vladimir Propp’s Morphology of the Folktale, which treats Russian folktales as sequences of standardized “functions” that behave roughly like the states of a finite-state automaton).

In folklore, academics have even attempted to index the space of known plots. The Aarne–Thompson–Uther (ATU) folktale type index assigns each traditional tale a numeric “type” (ATU 510A for “Cinderella”, 300–749 for various hero tales, and so on), while Stith Thompson’s Motif-Index of Folk-Literature catalogues recurring “mythemes”, story motifs like “cruel stepmother,” “magic helper,” “journey to the underworld”. In effect, these systems treat the corpus of folktales as a finite catalogue of plot skeletons and motifs that can be recombined.

There’s also a small cottage industry of “implementable” story frameworks aimed at aspiring screenwriters (Syd Field’s three-act paradigm, Snyder’s “Save The Cat”, John Truby’s Anatomy of Story) which break all stories down into a “templated” set of steps (to be implemented by a writer)³.

Finally, there are attempts to embed stories (and indeed, all of language) into low-dimensional spaces using machine learning. These are outside the scope of this post.

Long story short: long stories can be made short.

Counting Abstract Stories

Now that we’ve concluded stories can be abstracted, let’s investigate the “crowdedness” of the space of stories. Let’s suppose a story’s “semantic type” can be encoded in only “semantic” bits⁴, similar to our analysis of pictures. If there are “evenly distributed” stories, how close are they to each other?

To be clear, we are modeling each “abstract story” in our model as a string of 1s and 0s. Each bit represents some abstract story element (for example, “comedy or tragedy”). We can view this as an “index” into the set of stories (like in Borges’ library). If a story differs from another story in bit places, we say the two stories are “Hamming distance ” away from each other.

We can also similarly assume each story “claims” the nearby radius. Then the “volume” of the Hamming neighborhood is:

Therefore, we say there are stories within radius of the original story.

Since there are stories total, if we assume the stories are evenly distributed⁵, we get (crudely)

Suppose we want to create a new story in this space. What’s the minimal overlap it will have with an existing story?

If we know and , we can find this by solving the following equation for :

and then computing “bits in common”

in fractional form:

How large are and in reality?

Let’s stick with English. For novels, according to Fredner, . Let’s use this estimate for now (although in reality the number of novels is dwarfed by the number of short stories, and to consider all stories we might also want to consider narrative poems, film scripts, etc.).

Reusing our semantic bit estimate of :

Solving this, we get somewhere between and . So is bounded by .

That is, under this model, a new novel would semantically overlap with an existing story by 84% or more.

Let’s look at how varying and affect this result.

This first figure is a heatmap. We vary on the x-axis and on the y-axis. The color of the grid depends on .

In the second plot we instead compare to for each value of .

rises sharply at low , then approaches 1 as grows larger.

So even at relatively high , the space of stories is somewhat crowded at . What’s more, most of the crowding occurs early, then slows down, with “steeper crowding” occurring at lower .

This matches what we expect. Stories are getting crowded. Even before we bring in energy or dynamics, a simple packing argument already suggests that most new stories must live semantically close to ones we’ve already told.

Not All Bits Are Equal

By using “semantic bits” we have implicitly assumed that the bits are (mostly) orthogonal and ordered by importance. In reality, the first bit (e.g. “tragedy vs. comedy”, or more likely some division of two common story structure types) probably matters more to human perception than the 47th bit. So our original model is a bit naive. Let’s try to improve it by adding weighting for perceptual importance.

If there is a recursive hierarchical decomposition of meaning (i.e. at the top level is comedy vs. tragedy, comedy splits into romantic comedy vs. dark comedy, etc.), and this is “scale-free”, we should wind up with a power law. Many similar natural phenomena follow power laws, such as Zipf’s Law⁶, which comes up in the relative frequences of words in natural corpora of texts.

So let’s assume the importance of the -th bit roughly follows a power law:

where is some weight.

So we can compute the weighted distance between two works (note the shift by one to keep indexing, consistent with the last example):

and are strings in this case, and the sum is over their characters. returns 0 if the characters are equal, and 1 if they are different.

So the complete diameter of the space (the distance between the furthest two objects, for example the string of “1”s and the string of “0”s) is:

This is the partial sum of the -series⁷. In the limit, the -series only converges for . Call the partial sum for a given .

Now let be the typical nearest-neighbor distance between works in this weighted metric. We’ll also assume there’s a perceptual threshold distance below which differences between works become imperceptible. This threshold probably varies by person: more casual consumers have a higher , whereas experts have a lower .

For a given and , is the maximum possible distance. If the typical nearest-neighbor distance falls well below a perceptual threshold , then most works within that ball of radius will feel indistinguishable to a given observer. When is still a substantial fraction of , there is still plenty of perceptual room for works to feel distinct.

We can now repeat our earlier analysis.

Let be the number of strings within weighted distance of a given string. Then by our crude packing argument we get a volume:

and a new weighted

Given , we can solve implicitly for .

There’s three different cases for what the approximation of looks like, depending on .

where is the zeta function, and is the Euler-Mascheroni constant. Note that the partial sum only actually converges in the limit if .

Let’s think about each case.

Case 1:

In this case, the partial sum diverges like a polynomial in .

The intuition is that the weights fall off very slowly as we progress to increasingly low-order bits, so “low-order” bits still contribute meaningfully to perception (though less than higher-order bits).

Plot above at .

Case 2:

In this case, we have a harmonic series, and the partial sum is roughly a logarithm in :

In this case, each bit is worth roughly the same. This is the “Zipfian” case.

Plot above at .

Case 3:

Finally, if we have a convergent series. The partial sum converges to the zeta function in the limit, plus an additional term ().

In this regime, marginal bits are worth much less than earlier bits: most of the perceptual “distance budget” is carried by the first few coordinates. Geometrically, the metric has a finite diameter (bounded by ) even as the number of distinct works grows exponentially. Beyond some point, adding more bits creates many more states that are all crammed into almost the same finite perceptual space.

Plot above at . See how it is noticeably more “squashed” than the last two?

Dynamics

Now that we have a method to measure the “crowdedness” of the space is terms of the total number of possible artifacts, let’s look at production in terms of total spent energy, and how that relates to the amount of novelty remaining. If we can link energy spent to the number of remaining microstates, we can use methods from statistical mechanics to model the state of the culture, and how it will change over time.

Empirical Energy Estimates

Let’s do some Fermi estimates of the total energy expenditure spent over time to reach the current cultural stock.

Let’s break this down.

We already estimated total novels. If a novel takes roughly 800 hours⁸, and a human burns 135 watts while writing⁹, we get:

or total joules (roughly 10 Hiroshima-class atomic bombs).

That’s the metabolic energy to actually produce the works. We’ll pretend that all works are magically placed in the library instantly, but note that there is also some additional analysis possible regarding distribution¹⁰.

Statistical Mechanics of Semantic Space

Can we now connect the energy input to our model of cultural generation?

Let’s arbitrarily choose a reference frame, the -length string = “000..0”¹¹.

We can then compute the “internal energy”

The number of microstates for a given energy level is:

That is, for a given and , we count up the strings where . We can define the “entropy” now:

where is the number of microstates within distance of the reference.

We are actually seeking the “novelty” per unit energy added to the system. This looks like

It just so happens that this is related to the temperature.

which we can approximate as

Solving for

So (by definition) as increase, the marginal entropy (the amount of space for new stories) per unit of marginal energy declines.

Using the and , we can plot against the total cumulative energy added to the system :

The above plot is at . Basically, the temperature doesn’t change much at first, but past a certain amount of energy drops rapidly as we “run out of room” and temperature increases rapidly.

Heat Capacity

Let’s now look at the marginal gain compression (changes in ) as energy increases.

Under the microcanonical ensemble, we can define the heat capacity as

which we can approximate as

Alternately this can be written in terms of :

With the heat capacity, we can now connect the effect of the marginal energy input into the system with the change in the remaining capacity.

Intuitively, measures how the energy we pump into the system relates to increases in the effective temperature (scarcity of novelty per unit energy).

When is large, adding energy changes slowly. This corresponds to a regime where there is still a lot of unexplored semantic volume, and we can keep investing in new works without running out of “cheap” novelty.

When is small, adding a little energy produces a big jump in . In this regime, most of the low-hanging novelty has already been harvested, so further investment mostly churns inside already-occupied regions of semantic space.

If we plot the heat capacity against total energy added to the system, we see the heat capacity is pretty constant until it suddenly declines.

Synthesis

Let’s put the argument together and examine the dynamics of this situation from a society-level perspective. Human society expends energy (in the form of food and fuel) to find cultural objects. How much energy does it take to find “novel” cultural objects?

We can connect the entropy change over time to the energy input and the temperature :

For now, let’s assume a constant energy input rate. Therefore

and so

Let’s recall our three cases and go case-by-case

Case 1:

This is our “slow drop off” case. In the regime, the earlier bit-weight analysis gave a polynomial growth of the effective “diameter” with ,

and if we take energy to be proportional to we can write entropy as a power law in :

for some constants , .

Then

Using

we obtain

The entropy production rate under constant energy input is

So, given constant energy input falls off as . Alternatively, each doubling of our energy input should yield amount of “novelty”.

Case 2:

This is our Zipfian case. Before, we had

so suppose

with and constants. Then

Using

we get

With constant energy input,

therefore

Integrating in time:

So in the Zipfian regime, even with constant energy input, entropy (the number of distinguishable cultural microstates) grows logarithmically in time. Equivalently, exponential energy input leads to linear growth output.

Sound familiar? This is similar to the story we heard earlier, related to science.

Case 3:

This is the bounded regime. Since the p-series converges to a finite limit for , there is an effective maximal energy scale and corresponding maximal entropy

We can’t do the same analysis we did in the previous two cases because the number of bits is no longer proportional to the energy. That being said, we can reason intuitively that under constant energy input, diverges as , while the entropy production rate collapses to zero.

Time to Saturation

How far are we from saturation? More specifically, given some energy input rate , how long does it take before novelty per unit time falls below the perceptual threshold , or before we’ve exhausted a fixed fraction of the available semantic space?

We can find this by computing:

We have expressions for from the previous section, and we have estimates for and . To find , we compute , then compute .

I’ll omit the details. We can now plot out the dynamics using software.

Plots

AI Disclosure: Don’t take these plots too seriously. They’re mostly for illustrative effect. The parameter choices and saturation threshold are arbitrary; the only thing that really matters is the qualitative shape: initially flat, then a sharp rise once the space becomes crowded.

Now that we know how to compute everything, let’s take a look at historical and future trends. Instead of a universally constant energy input rate, let’s assume roughly exponential increase in energy input starting at the year 1700. There were also very few English language novels in the year 1700, so I used “100” as an approximation.

These are toy graphs. The saturation level is arbitrary. In the above graphs I’ve just set it to 10x the current temperature. But we can see that the saturation point could be quite close, especially if we think is high.

Regardless, under these assumptions, the qualitative picture is straightforward. At first, additional energy buys a lot of entropy. The system is “cold”, and new works carve out genuinely new regions of semantic space. As cumulative energy grows, the effective temperature remains roughly flat for a while, then begins to rise rapidly as we enter the crowded regime. Beyond that point, most of the marginal energy goes into producing works that sit inside already-populated neighborhoods in story space, rather than opening up genuinely new directions.

Now, let’s imagine that AI lets us move past metabolic limits for cultural production sometime in the near future. What will that do to our saturation times? The computations are easy: time to saturation and energy required are proportional in this framework. For example, AI might let us instantly increase the energy rate used to mine culture by one or more orders of magnitude. How does the projection change?

Here I’ve let AI increase energy expenditures by 5x. This massively decreases the saturation timeline

Negative Temperature

The case (especially high ) has some especially interesting properties. In this case we can get “negative temperature”.

In ordinary thermodynamic systems, adding energy increases the number of accessible microstates, so is increasing and , which implies a positive temperature. In some long-range interacting systems, however, the density of states is not monotonic. Beyond a threshold, adding more energy actually reduces the number of accessible microstates, so and the effective temperature becomes negative.

Onsager’s classic example is a gas of point vortices in two dimensions. At low energy you get many small, disordered vortices but at very high energy the system prefers to concentrate that vorticity into a few large, coherent “supervortices.” These macroscopic structures are more “ordered,” but they correspond to the highest energies and thus to negative temperature states.

If we push the cultural analogy, a negative-temperature regime in semantic space would be one where driving the system to higher “energy” (more extreme, differentiated works) eventually reduces the number of distinct configurations, because the only way to pack that much structure into a bounded perceptual manifold is to form large-scale superstructures.

Maybe this is already happening? Mega-franchises, shared universes… all are examples of canonical templates that organize huge numbers of micro-variations. In such a regime, additional energy no longer produces fine-grained diversity. Instead, adding energy reinforces a few giant, highly ordered attractors that dominate the landscape.

Conclusion

We’ve built a simple model of the space of stories using methods inspired by statistical mechanics. The model shows that, over time, the space of stories becomes more crowded. If we increase the amount of energy we pour into constructing cultural objects, the space will “run out” more quickly. As we proceed, innovation happens in “lower-order” bits.

How seriously should we take this? Since there’s such a huge amount of stories, it seems outlandish that we could actually run out. Regardless, I think this exercise is useful as a first step in tying some of the intuition I’m developing around information bottlenecks to physical and social processes.

Additional Thoughts

In no particular order:

It’s possible that differs based on different segments of the population.
If there are different segments of the population at different , do they proceed independently through this progression? Will “intellectual superfranchises” emerge?
could vary along the “bit direction”. So early bits are at a different than later bits.
Since human intelligence is bounded, the space of stories must ultimately be bounded.
Stories are not actually evenly distributed. However, this doesn’t necessarily weaken the argument. In fact, if there are a limited number of “story attractors”, then we would expect some regions of story space to actually grow more crowded more quickly.
There may be additional modeling that could be done here. For example, can this model predict punctuated equilibrium? Maybe there are regions that are separated by “high-energy barriers” or areas of extremely low density. Maybe the semantic manifold has disconnected or quasi-disconnected components.
If something like Propp’s model could be made into a full-out recursive structure (like a context-free grammar), does this change the analysis? Then we are not necessarily looking at “fixed strings”.
Stories can be forgotten, freeing up space for stories to reoccur.
Are there empirical ways to test this model? What concrete, falsifiable hypotheses does it make?

AI Disclosure

Probably the most heavily I’ve used AI on a post. I used ChatGPT and Claude to make a bunch of the graphs (an extremely painful process, I ended up making several graphs myself), to format LaTeX, to find sources, and for general feedback and brainstorming.

Footnotes

A grammatical English text of length words has about characters. Using an entropy rate of – bits/char (a conservative estimate) gives total information Thus the number of grammatical English sequences is If we assume the “empty word” is in our vocabulary, we can include shorter works in the same counting argument.↩︎
See here, here or here for some claims about science. While this essay focuses on culture, it’s possible similar arguments apply to other intellectual pursuits. We may even see some similar input-output relationships. That being said, science, math, and code have inductive structures that render some of this analysis less pertinent. For example, in math, a theorem might be stated very simply but imply an extensive proof of unknown length. Perhaps more on this in a future post.↩︎
It would be interesting to see attempts to combine these resources with LLMs to construct new stories.↩︎
We will get into the value of later.↩︎
We will examine this assumption later.↩︎
Claude suggested that, under a Chinese Restaurant Process or stick-breaking construction, the resulting ranked piece sizes follow a distribution due to a connection between Dirichlet processes and Zipf’s law. This is also potentially related to pink noise.↩︎
https://math.stackexchange.com/questions/2848784/general-p-series-rule↩︎
https://www.jenniferellis.ca/blog/2016/8/27/hourstowriteanovel. Seems reasonable AFAICT.↩︎
I assume 35% over a baseline 100 watt human.↩︎
This is a bit weird because you could have cases where some consumers have access to some works but not others. It seems outside the scope of this post.↩︎
A better reference frame would probably be somehow drawn from the typical set, and then the “higher-energy” texts would be more ordered, but I couldn’t figure out how to do this properly. Possibly the metric should be engineered along these lines as well.↩︎

Noether’s Theorem with Time

Fri, 05 Dec 2025 05:00:00 GMT

Introduction

Let’s extend the discrete controls framework and Noether’s theorems to include time-invariance. We’ll start with extending the continuous version, then discretize.

Edit: I refactored this post on December 8th 2025, moving the discussion of homogeneous potentials and dynamical similarity to the subsequent post. Some symmetries (such as dynamical similarity) do not make the physical Lagrangian strictly invariant, but become invariant only after lifting to the extended space and parametrizing by . In such cases, the associated Noether quantity is conserved with respect to but not generally with respect to the physical time .

Background

Lagrangian

Before we defined our Lagrangian as

where the action was

Let us now alter our definitions to explicitly include time. We want:

where

Let’s start by defining a new manifold:

Where looks like .

In terms of , the Lagrangian is

where takes as data and returns a number.

However, there’s a problem. Since is now part of the state, we need to introduce a new variable to play to role of in the adjusted formulation. Thus, we introduce a dummy parameter, “virtual time”, denoted ¹.

A curve through is thus , and the action is

Unpacking this further, the ’s don’t actually depend on directly. They only depend through , so our “dot” operator is now with respect to . What does this mean for our derivation?

Let’s try casting this back into physical time.

By definitions of

Consider:

The action is preserved, so:

Or, by adjusting the RHS

Let’s define some temp variables: .

By and :

Therefore:

Which implies

Seen another way, the curve we are integrating over in is defined by . Its derivative is . If , then (up to a constant), and the curve is with the derivative is .

So the factor is just a linear reparametrization of our curve. Glossing over a few steps, we “normalize” by and match arguments to get the following:

(Note )

The upshot is that if the symmetry affects , then needs to be adjusted by “dividing out” the change in the time variable with respect to virtual time².

Also notice, if , we have .

In particular, if , we have

This implies that any Lagrangian can be extended “for free” into under the trivial reparametrization ³.

G-Invariance and Noether Charge

By analogy with the time-free case, if , we define a map such that

Where

and

The are just operators that “unpack” the tuple and return the -th argument.

We also construct the map

Note that .

Once again, this matches our previous derivation, but on instead of .

We can now state our updated -invariance principle. We say that is -invariant if:

At this point, we have reduced the problem back to the original Noether’s theorem proof.

We need to make one change. In the original proof we started here

has two tuples as arguments

Our Noether charge ends up being:

Lie Groups

What changes when , where is a Lie group⁴?

is a Lie group, and Lie groups are closed under product. So the product of and is also a Lie group.

We can then use the same reparametrization trick. Call , with and .

If you recall, for left-invariant Lie groups we actually are interested in the reduced lagrangian for a Lie Group⁵:

where

(this is the tangent along some path starting from the origin).

Now, we instead seek the reduced Lagrangian dependent on time

with action

For our extended problem, we get “for free”

We want to put this in terms of . First of all, since, we can unpack into constituent parts⁶

We need to work out , since the dot operator is now with respect to . By the chain rule:

Thus⁷

Call .

The action is preserved, so

We need (in terms of ) that makes this equation true.

Mapping back to the original reduced lagrangian via matching argument (same as in the original derivation; omitted), we therefore have

Interestingly, if you squint you can see a lot of “conjugation actions” that might pop up for a general extension by an arbitrary non-abelian “time group ”, rather than specifically⁸.

As a last aside, notice if we change variables back to :

So we successfully added a to , and can in fact do this to any arbitrary “without penalty”. So our Lie groups are already normalized by time, “naturally” (it’s included in ).

Examples

Let’s look at some examples.

Energy

Let’s say we have

preserved under symmetry

This implies that

We have that

So we need

In the usual parametrization () the Noether charge is:

In our Hamiltonian exposition, we defined

If we evaluate this at the unique such that , then

So the (negative) Hamiltonian is conserved (the total energy).

Linear Momentum

Consider the following transformation:

is static so we only need to worry about .

So is conserved. Setting gives conservation of . If we view as vector valued we can view this as conservation of momentum along each dimension.

Galilean Boost

Omitted.

ChatGPT suggested that you might be able to derive Kinetic energy via symmetry of the “free Lagrangian” under .

Discrete Noether with Time

As we saw above, for Lie groups the reduced Lagrangian is unchanged for Lie groups. The only thing we really need to do is update the computation of the Noether charge (if time is included).

Code

Claude Opus 4.5 + ChatGPT (with some coaxing through several major issues) was able to modify the code.

In the last post, we had this function

    def register_noether_charge(self, name:str, symmetry: Symmetry):
        def _new_charge(qk, qk1):
            p = self.D2_Ld(qk, qk1)

            qk1_eps = symmetry.log(symmetry.exp(self.tol * symmetry.generator) * symmetry.exp(qk1))
            omega_qk1 = (qk1_eps - qk1)/self.tol

            return (p*omega_qk1).sum()
        self.noether_charges[name] = _new_charge

For time, we need to add the time-dependent term:

To handle time-dependent symmetries, we update both Symmetry and register_noether_charge to work with an infinitesimal parameter acting on both space and time:

class Symmetry:

    def __init__(
        self,
        space_transform: Callable[..., torch.Tensor],
        time_transform: Optional[Callable[..., float]] = None,
    ):
        self._space_transform = space_transform
        self._time_transform = time_transform

    def apply_space(self, eps: float, t: float, q: torch.Tensor) -> torch.Tensor:

        try:
            return self._space_transform(eps, t, q)
        except TypeError:
            # Allow simpler signatures like f(eps, q)
            return self._space_transform(eps, q)

    def apply_time(self, eps: float, t: float, q: torch.Tensor) -> float:
        if self._time_transform is None:
            return t
        try:
            return self._time_transform(eps, t, q)
        except TypeError:
            # Allow simpler signatures like f(eps, t)
            return self._time_transform(eps, t)

    def register_noether_charge(self, name: str, symmetry: Symmetry):
        def _charge(qk, qk1, t_k, t_k1):
            p = self.D2_Ld(qk, qk1)
        
            eps = self.tol
        
            q_eps = symmetry.apply_space(eps, t_k1, qk1)
            omega = (q_eps - qk1) / eps
        
            t_eps = symmetry.apply_time(eps, t_k1, qk1)
            tau = (t_eps - t_k1) / eps

            charge = (p * omega).sum()
        
            if tau != 0.0:
                E = self.discrete_energy(qk, qk1)
                charge = charge - E * tau
        
            return charge
    
        self.noether_charges[name] = _charge

Example

We define

class Kepler(VariationalSystem):
    def control_plane(self):
        return {
            "r": Rn(2)
        }
        
    def params(self):
        return ["mass", "mu"]

    def lagrangian(self, ctrl, dctrl):
        r = ctrl.r
        rdot = dctrl.r
        m = self.params.mass
        mu = self.params.mu
            
        r_norm = torch.sqrt((r * r).sum() + 1e-10)
        T = 0.5 * m * (rdot * rdot).sum()
        V = -mu / r_norm
        return T - V

We run it with

if __name__ == "__main__":
    kepler = Kepler({
        "mass": 1.0,
        "mu": 1.0
    })

    h = 0.01
    recorder = StepRecorder()
    integrator = VariationalIntegrator(kepler, step_size=h, on_step=recorder.on_step)

    # Energy (time translation): (eps, t) -> t + eps
    energy_sym = Symmetry(
        space_transform=lambda eps, t, q: q,
        time_transform=lambda eps, t, q: t + eps
    )
    integrator.register_noether_charge("energy", energy_sym)

    r_slice = kepler.model.layout["r"][1]
    def rotate_r(eps, t, q, sl=r_slice):
        qn = q.clone()
        x, y = q[sl]
        c, s = math.cos(eps), math.sin(eps)
        qn[sl] = torch.tensor([c*x - s*y, s*x + c*y], dtype=q.dtype)
        return qn

    angular_sym = Symmetry(space_transform=rotate_r)
    integrator.register_noether_charge("angular_momentum", angular_sym)

    alpha = 1.5

    def scale_r(eps, t, q, sl=r_slice):
        qn = q.clone()
        qn[sl] = math.exp(eps) * q[sl]
        return qn

    dyn_sim = Symmetry(
        space_transform=scale_r,
        time_transform=lambda eps, t, q: math.exp(alpha * eps) * t
    )

    integrator.register_noether_charge("dynamical_similarity", dyn_sim)

    # Initial conditions for elliptical orbit
    r0 = torch.tensor([1.0, 0.0], dtype=torch.float64)
    v0 = torch.tensor([0.0, 0.8], dtype=torch.float64)
    t0 = torch.tensor([0.0], dtype=torch.float64)

    ctrl0 = AttrObject({"r": r0, "t": t0})
    q0 = kepler.model.pack(ctrl0)

    steps = 500

    ctrl1 = AttrObject({"r": r0 + h * v0, "t": t0 + h})
    q1 = kepler.model.pack(ctrl1)

    qs = [q0.clone(), q1.clone()]
    q_prev, q_curr = q0, q1

    for _ in tqdm.tqdm(range(steps - 2)):
        q_next, ok = integrator.step(q_prev, q_curr)
        qs.append(q_next.clone())
        q_prev, q_curr = q_curr, q_next

    qs = torch.stack(qs, dim=0)

    print("\nKepler Problem:")
    energies = [float(rec["noether_charges"]["energy"]) for rec in recorder.records]
    angular = [float(rec["noether_charges"]["angular_momentum"]) for rec in recorder.records]
    similarity = [float(rec["noether_charges"]["dynamical_similarity"]) for rec in recorder.records]
    
    print("Energy (should be constant):")
    print("  min:", min(energies), "max:", max(energies), "drift:", energies[-1] - energies[0])
    print("Angular momentum (should be constant):")
    print("  min:", min(angular), "max:", max(angular), "drift:", angular[-1] - angular[0])

Ignore the “dynamical similarity” material for now.

We get:

Kepler Problem:
Energy (should be constant):
  min: 0.6735297151115243 max: 0.6868012832352087 drift: -0.0026780225061522334
Angular momentum (should be constant):
  min: 0.8000193606114198 max: 0.8000206253911845 drift: -1.9376809246018922e-07

As expected.

Conclusion

I originally thought this would be a super short post but the tale grew in the telling, plus led me to several additional rabbit holes that I may pursue in future posts (algebraic geometry! self-similarity!). We are still doing “physics” but I will eventually reach “controls”. Thanks for reading.

Changelog

12/8/2025 - Refactored post to move part of Kepler example.

Footnotes

Why not just assume the Lagrangian is constant under translation by time? My intent here is to try and preserve the option of “dynamical similarity”, where time is scaled simultaneously with some other variable, or by some transformation other than . Another thought: could you have some kind of degenerate dynamical similarity without some notion of virtual time?↩︎
One quick note: would correspond to some kind of degenerate symmetry, where . So it shouldn’t happen.↩︎
In code terms, this means we can just add a dummy argument to any existing Lagrangian function and discard it when doing calculations.↩︎
Slight notation discrepancy - this is a different than in the previous section.↩︎
Slight notation discrepancy - this is a different than in the previous section.↩︎
is it’s own Lie Algebra.↩︎
Please note that doesn’t imply anything about the group structure of the time dimension. It is simply the group inverse.↩︎
I’m not 100% sure this makes sense due to the way the parametrizations work, but I think there may be some general algorithm “extending” a Lagrangian by appending new Lie groups to the manifold.↩︎

Is a Picture Worth a Thousand Words?

Sun, 23 Nov 2025 05:00:00 GMT

Introduction

In a previous essay I considered the problems associated with generating novels. With the introduction of the new Nano-Banana Pro, let’s revisit those same basic information bottleneck argument with respect to pictures. All arguments are back-of-the-envelope.

Count the Bits

Consider the map:

is Nano-Banana (or any other generative model that produces pictures). You feed it a sequence of words and out pops a picture¹.

Let’s suppose for simplicity that is a function (so a given input produces just one deterministic output) and that is surjective² (for a given picture, there is at least one “text” that maps to it).

How many possible pictures are there?

Let’s assume pictures are 2048 x 2048 pixels³. Then there are total pixels. At 24 bits/pixel, we have bits in a picture.

If the map was a surjective function, the number of inputs must be at least as large as the number of possible outputs. In order to specify a particular pixel array uniquely, we need to come up with a set of words that point to it. If we assume a generous 15 bits per word⁴, and an input must be at least bits, we therefore need at least words, or roughly single-spaced pages of text, to exactly specify one image.

Natural Image Manifold

Obviously, that conclusion is absurd. It’s not actually that hard to generate roughly what you want in Nano-Banana Pro. For example, the picture above I generated with 11 words. The result was reasonable, and close enough to my intent that I included it here.

Most random arrangements of pixels look like static noise. The ‘natural image manifold’ is the (very small) subsection of pixel space containing images recognizable (or at least, of interest) to humans. And the map from text to images is not actually surjective.

How big is the natural image manifold?

State-of-the-art codecs can get 0.3–0.7 bpp before noticeable artifacts show up⁵. Let’s take the middle value. Our 2048 by 2048 image has pixels. At 0.5 bpp, that’s approximately bits of perceptually relevant information. That’s around 140000 words, or 280 pages.

So we might say that a picture is worth 140000 words.

Semantic Images

In practice, usually a human is looking for an image drawn from a rough equivalence class of images rather than a particular set of pixels.

A prompt like “a robot on a bench” is a whole region of the natural image manifold, which contains millions of possible pictures. You can ask:

which robot?
what exact pose?
surrounded by which tree species?
where is the sun?
what’s the weather?
how high is the camera?
is it a 35mm lens? an 85mm lens?
are there spiderwebs? broken branches?

People care about object identity, relations, scene type, lighting category, viewpoint class, mood, style, etc. How many bits of semantic control over images does a user really need? And given a piece of text, most of the text is redundant in terms of semantic bits. The same choices are reinforced: which objects are present, the style, the lighting, etc.

A toy upper bound might be 50 bits. That’s roughly possible categories. And a 50 bit password is already very secure⁶. That’s why prompting can work as well as it does. The gap is filled with non-semantic visual content: whatever the user takes for granted.

So is a picture worth a thousand words? It depends on the picture, the words, and what the viewer cares about.

Art Golf

Consider the following game (“Art Golf”). Take an image or piece of art (especially an abstract image, or unknown work). Without giving the name of the artist, the name of the piece of art, or the name of a particular artistic movement, try to prompt the generative model to produce the original piece of art using only text. Fewer words is better.

Final Thoughts

If you use a prompt of 10 to 50 words (150 to 700 “bits on disk”), and a natural 2K image is 2.5 million bits, then the model must be filling in 99.9%+ of the visual detail. You can call this “hallucinating”, or “inference”: at the end of the day the human is making a small number of decisions to determine a much larger object.

I don’t think making the model bigger can “fix” this. We might better approximate the natural image distribution (needing fewer words to specify an image), but at the end of the day there’s no way to produce a uniquely specified image unless the prompt contains enough bits to select that image. The human must specify all the necessary bits to identify the required image.

This isn’t limited to AI, but applies in general to all principals and agents. If you commission an artwork or design, you commission a specification and the artist figures out the details. Outsourcing to an external party is only valuable when you don’t want to specify every bit.

While they are both considered “AI”, the problem of intent seems fundamentally different than the problem of converting between words and text.

AI Disclosure

All of the images in this post were made using Nano-Banana Pro.

Footnotes

In theory you can also feed a generative model a picture, or information in some other form. You could add sketches, CAD models, structured scene graphs, reference images, etc. But while this does make things much more efficient, we ultimately run into the paradox where we are specifying the image to specify the image.↩︎
Probably not actually realistic. Most images will appear to be noise and will not be determinable in words. This is why we get such vast numbers: you’d have to specify each pixel one-by-one. Note also that we don’t ask for injectivity (one-to-one). If the function was both injective and surjective it would be invertible (we could invert it to take pictures to text). Lack of injectivity allows more that one prompt to map to a single image.↩︎
I looked for the technical specs and didn’t immediately find them. Seems like 2048 by 2048 at 2k resolution. This is an approximation; I don’t think it should affect the argument.↩︎
Classic results by Claude Shannon put an English word somewhere between 10-15 bits. See the bottom of the section here.↩︎
If we did have an injective and invertible Nano-Banana, we could store an image as text and encode/decode it with the model. I would be very interested to see if some variation of Nano-Banana or other Neural Image Compression frameworks can reliably compress images better than a modern codecs, without visual artifacts.↩︎
See here. ~50 bits of “real” entropy is enough for a reasonable password. We could extend the number of semantic bits by more without affecting the argument.↩︎

Noether’s Theorem and Geometric Controls

Fri, 21 Nov 2025 05:00:00 GMT

Motivation

What’s the point of geometric integrators? Why bother with the formalism around Lagrangians and Hamiltonians to manage basic control problems or simulations?

In the last post, I worked on the intuition behind geometric controls. This post continues that effort.

Setup

Group Actions

We have a configuration manifold with tangent bundle .

Let be a Lie group. Introduce the smooth map such that

Choose a smooth curve with . For each , the map

is a smooth curve in . We can think of as a curve through , induced by the action of each on .

The derivative of at defines a tangent vector

and hence the map defines a smooth vector field on .

We call this map the “infinitesimal generator” of the curve .

-invariance

Construct the map

Which takes an element of the tangent bundle at and gives the element of the tangent bundle at .

If , we have

then we say that the Lagrangian is -invariant¹.

Noether’s Theorem

Suppose the Lagrangian is -invariant:

Define the –shifted trajectory by

Note that multiplying by is a smooth operator on .

It’s also true that, by construction,

First we will need a quick identity.

Start by differentiating with respect to :

Then differentiate again, with respect to at

Since both and appear only through smooth compositions, the order of differentiation can be interchanged²:

Subbing in the definition of , this becomes

Which is the identity we need.

We are now ready to derive the main result. Differentiate the Lagrangian at and apply the chain rule (with implicit evaluation).

The first term on the RHS we sub in , the second term we sub in using Equation 1:

Since the Lagrangian is –invariant, and is the application of the smooth operator to , then we can rewrite as .

Therefore:

the derivative on the LHS is zero.

We obtain

Substitute in (previously defined in the last post)³.

Now, assume the Euler–Lagrange equations hold. Then

Since we have

Therefore (using the product rule in reverse)

So is a conserved quantity over time!

We define the “Noether charge”

Noether’s (first) theorem says: given some invariant Lagrangian, and some smooth operator (a symmetry) on , is conserved along every solution of the Euler–Lagrange equations.

Example

Let’s compute the Noether charge for a free rotor⁴ under the action .

is just in this case, so the Lagrangian is

We have

We just need

So we have

which is the angular momentum.

Noether’s with Lie Groups

The proof for Lie groups is essentially the same (I will omit it).

One note: if is a Lie group, symmetries are described directly by group multiplication.

That is, fix an element of the Lie algebra (the generator), and define a perturbed configuration by acting on with a small group element:

Usually this is group multiplication

Recall that in our code the map converts from group representation back to vector.

Discrete Noether

Let’s get the discrete version of Noether’s Theorem.

-invariance is:

The infinitesimal generator is

The discrete momentum:

And the Noether charge is:

How do we compute the infinitesimal generator ?

Before we had

So is given by and the derivative with respect to is approximated by

We can compute in Lie Algebra coordinates

Note that in this reduces to

Code

Let’s start to code. Here’s the FreeRotor:

class FreeRotor(VariationalSystem):
    def control_plane(self):
        return {
            "theta": Rn(1)
        }
    
    def params(self):
        return ["mass", "length"]

    def lagrangian(self, ctrl, dctrl):
        th  = ctrl.theta
        thd = dctrl.theta

        m = self.params.mass
        l = self.params.length

        T = 0.5 * m * l*l * thd*thd
        return T

I swapped theta to Rn(1) due to continued problems with SOn. Will come back to it later⁵.

Let’s add a new method to VariationalIntegrator

class VariationalIntegrator: 
    ...
    def register_noether_charge(self, name:str, symmetry: Symmetry):
        def _new_charge(qk, qk1):
            p = self.D2_Ld(qk, qk1)

            # qk_eps = symmetry.log(symmetry.exp(self.tol * params) * symmetry.exp(qk))
            # omega_qk = (qk_eps - qk)/self.tol

            qk1_eps = symmetry.log(symmetry.exp(self.tol * symmetry.generator) * symmetry.exp(qk1))
            omega_qk1 = (qk1_eps - qk1)/self.tol

            return (p*omega_qk1).sum()
        self.noether_charges[name] = _new_charge 

    def step(self, q_prev: torch.Tensor, q_curr: torch.Tensor):

        q_next = q_curr + (q_curr - q_prev)
        q_next = q_next.clone().detach().requires_grad_(True)
        const_term = self.D2_Ld(q_prev, q_curr).detach()

        success = False

        for _ in range(self.max_iters):
            F = const_term + self.D1_Ld(q_curr, q_next)

            if F.norm().item() < self.tol:
                success = True
                break

            def F_of(x: torch.Tensor) -> torch.Tensor:
                return const_term + self.D1_Ld(q_curr, x)

            J = torch.autograd.functional.jacobian(F_of, q_next)
            delta = torch.linalg.solve(J, F)

            with torch.no_grad():
                q_next -= delta
            q_next.requires_grad_(True)

            if delta.norm().item() < self.tol:
                success = True
                break
        q_prev_det = q_prev.detach()
        q_curr_det = q_curr.detach()
        q_next_det = q_next.detach()

        noether_charge_outputs = {
            name: fn(q_curr_det, q_next_det)
            for name, fn in self.noether_charges.items()
        }

        if self.on_step is not None:
            self.on_step({
                "q_prev": q_prev_det,
                "q_curr": q_curr_det,
                "q_next": q_next_det,
                "noether_charges": noether_charge_outputs,
                "success": success
            })

        return q_next_det, success

This takes a Symmetry, computes the Noether charge. step is also modified. How does a user provide the Symmetry?

Let’s do

@dataclass
class Symmetry:

    def __init__(self, group: LieGroup, generator: torch.Tensor):
        self.group = group
        self.generator = generator

    def exp(self, v: torch.Tensor):
        return self.group.exp(v)

    def log(self, g):
        return self.group.log(g)

    def shift(self, q: torch.Tensor, eps: float) -> torch.Tensor:
        g_q   = self.group.exp(q)
        g_eps = self.group.exp(eps * self.generator.to(q.device))
        g_new = g_eps * g_q
        return self.group.log(g_new)

We take a Lie group and define a “shift” operation that must give a symmetry.

We must also modify our callback function junk:

class StepRecorder:
    def __init__(self):
        self.records = []

    def on_step(self, record):
        self.records.append(record)

Let’s try with the free rotor:

if __name__ == "__main__":

    freeRotor = FreeRotor({
        "mass": 1.0,
        "length": 1.0
    })
    
    h = 0.001
    recorder = StepRecorder()
    integrator = VariationalIntegrator(freeRotor, step_size=h, on_step=recorder.on_step)

    theta_group, _ = freeRotor.model.layout["theta"]
    omega = torch.tensor([1.0], dtype=torch.float64)
    rotor_symmetry = Symmetry(group=theta_group, generator=omega)
    integrator.register_noether_charge("angular_momentum", rotor_symmetry)

    theta0 = 0.8
    theta_dot0 = 1
    ctrl0 = AttrObject({"theta": torch.tensor([theta0])})
    q0 = freeRotor.model.pack(ctrl0)

    ctrl1 = AttrObject({"theta": torch.tensor([theta0 + h * theta_dot0])})
    q1 = freeRotor.model.pack(ctrl1)

    steps = 10000
    qs = [q0.clone(), q1.clone()]
    q_prev, q_curr = q0, q1

    for _ in tqdm.tqdm(range(steps - 2)):
        q_next, ok = integrator.step(q_prev, q_curr)
        qs.append(q_next.clone())
        q_prev, q_curr = q_curr, q_next

    qs = torch.stack(qs, dim=0)
    print([record["noether_charges"] for record in recorder.records])

And we see the angular momentum is conserve at 1.0000:

[{'angular_momentum': tensor(1.0000)}, {'angular_momentum': tensor(1.0000)},...]

Conclusion

We’ve successfully implemented a Noether’s theorem implementation in our geometric controls library. Next time, I will presumably add in time dependence and look at jet controls, but it is possible we will instead look more deeply at invariants.

Footnotes

We have essentially reduced the Lagrangian to a single variable , and then applied to by “pushing forward” through the tangent mapping.↩︎
Clairaut’s theorem.↩︎
I now realize this wasn’t explicit in the last post, but our fiber derivative , restricted to a particular path (instead of arbitrary ) reduces to . Potentially there is some technical subtlety here I am unclear on. If I look in Marsden & Ratiu they seem to define as .↩︎
I’d do the pendulum but thanks to gravity it’s not really symmetric. The free rotor removes gravity. The versions of the Lagrangian/Hamiltonian formulation and Noether’s theorem I’ve derived already don’t depend on absolute time (so I can’t derive energy). If we add time back in, time-invariance leads to energy conservation (the Hamiltonian is constant across time). Currently we are doing something more similar to Pontryagin control.↩︎
ChatGPT seems to think this is due to some subtle coordinate representation issue. I am somehow implicitly assuming the group is Abelian which is why it breaks for SOn. There might be some Newton solver issues as well (the code is not fully adapted for Lie groups). Ignoring for now.↩︎

Controls from the Geometric Perspective

Sat, 15 Nov 2025 05:00:00 GMT

Introduction

In my post on differential games and stag hunt I mentioned discrete controls. When looking into this topic, I ended up in a rabbit hole around geometric controls. To learn a bit more, I took a look at the paper Discrete Control Systems, by Taeyoung Lee, Melvin Leok, N. Harris McClamroch, but many parts of the exposition diverged from the paper as I proceeded in my investigation.

In particular, I wanted the control problem to motivate the Lagrangian, rather than deriving the Lagrangian in terms of the desired physics.

AI Disclosure: I used ChatGPT to generate a bunch of the LaTeX in this post. Mistakes my own.

Background

Consider some system with configuration space .

is a set, and each element of is one particular configuration. For mechanical systems, would be all possible positions of the system and would be a particular position. might be a particle location in 3D (), the orientation of a rigid body (), or something else.

Suppose the system starts in configuration and we want it to be in . Which path should we take to transform the configuration? We can frame this problem using the Lagrangian. However, first we need to assume a bit more about , namely that it is smooth manifold equipped with a Riemannian metric and a tangent bundle¹.

Manifold

is a smooth manifold of dimension if around every point we can find a neighborhood and a map such that

that is bijective, infinitely-differentiable, and has a smooth inverse . We call the map a chart, or coordinate system.

Furthermore, we require overlapping charts to agree smoothly. In practical terms, this means we have two differentiable functions:

def to_local_coords(self, q, center):
    raise NotImplementedError
    
def from_local_coords(self, x, center):
    raise NotImplementedError

such that from_local_coords(to_local_coords(q, center), center) = q (at least approximately) and vice-versa.

For a Euclidean space (), is simply the identity.

Riemannian Metric

To get a Riemannian manifold, there is a second requirement. At each point , we construct the tangent space of all possible velocity vectors at (keep in mind we said was infinitely differentiable). We also have an inner product at each point

To be a Riemannian metric², needs to meet the following four conditions:

(Symmetry)
and (Bilinearity)
for all (Positive Definite)
For every chart , the metric components in coordinates are (Smooth)

Basically, to implement our (now Riemannian) manifold, we will need something like:

class Manifold:
    """Base class for smooth manifolds"""
    
    def metric(self, q, v1, v2):
        raise NotImplementedError
    
    def inner_product(self, q, v1, v2):
        return self.metric(q, v1, v2)

Tangent Bundle

The tangent bundle is defined as the collection of all position and velocity pairs:

Lagrangian

Now that we have established the structure, we can define the Lagrangian.

The Lagrangian is a (smooth) function that assigns a number to each element of the tangent bundle:

The Lagrangian encodes a “cost” for each way of moving through the configuration space. Given any trajectory from to , we can evaluate its goodness by computing the action:

The action assigns a single number to each possible path. Hamilton’s Principle states that the physically realized path the one where nearby paths have approximately the same action (that path is a critical point in the space of paths).

Euler-Lagrange Equations

Consider a trajectory and a small variation in that trajectory with fixed endpoints. We know that , because the system must still start and end at the desired configurations (we have just varied the path). Thus the varied path is:

for small . The action along the varied path is

We want our trajectory to be a critical point in the space of paths. So

by differentiating under the integrand and using the chain rule, we have

We can break this up and integrate the second term by parts:

since , the first term is zero. Substituting back we have

Since this is true for any , the integrand is zero everywhere:

The boxed term is the Euler-Lagrange equations.

Hamiltonian

The action functional (when we integrate over the Lagrangian) is a global principle. It cares about the entire path we have looked at. The Lagrangian itself is the infinitesimal contribution to this global quantity: it evaluates how “expensive” it is to move through a infinitesimal segment of the path.

How does that translate into an actual policy? That is, given the current state, which infinitesimal step should we take next?

The Lagrangian is a function

restricts locally to:

This is a “marginal cost” of moving from with velocity .

Given a , how does change if we alter within the tangent space ? That is, if what is the effect of a small change in speed on the marginal cost?

We perturb in the direction , the derivative in that direction at a given and is:

We call this the “fiber derivative”. We may also denote this as

Each tells you how sensitive the cost is to change in .

Consider the canonical evaluation map

such that

is the first-order predicted cost change for a given , which has the same units as . So if we fix , we can say that for any

is the instantaneous “error in cost” (best affine approximation) between the linearization of in local coordinates () and the “actual” instantaneous cost .

The most consistent local description of given is the maximum possible value of this difference over all directions . In other words, given my sensitivity to , and the cost of moving at , which should I move at to obtain to best approximate the true global optimum? In equations, define:

This is the “fiberwise Legendre transform”. The resulting function is called the “Hamiltonian”.

Hamilton’s Equations

To move between the “velocity picture” and the “covector picture” , we need to ensure that every velocity has a unique covector representing its local rate of change of cost, and vice versa.

We already have the “fiber derivative” from before:

This defines the Legendre map

For the Hamiltonian to exist as a function (rather than a possibly multivalue-relation³), must be locally invertible. Lucky for us, the Riemannian metric gives us a canonical way to manage this.

Under the metric isomorphism ⁴, we identify

Because is positive definite and nondegenerate, this map is automatically invertible.

If we assume a unique supremum at , and take our identity:

and differentiate , it gives

hence, matching on the multivariable chain rule:

Thus the evolution of is governed by the first-order flow

These are Hamilton’s equations.

Mechanics on Lie Groups

We should also discuss Lie Groups and Lie Algebras. These become necessary when we are controlling a robot through states more complex than paths (for example, you may want to control the robots rotational orientation in space).

Lie Groups

A Lie group is a group that is also a real smooth manifold. To define one, we need a two group operations (multiplication and inversion) that are both smooth maps. That is

is smooth.

As an standard example, consider

The group operation is matrix multiplication, the identity is the identity matrix, and given an element of the Lie Group, the inverse is

Let’s say we want to track or control a rotating rigid body. We could use the Euler-Lagrange equations

but what is in this case? is a rotation matrix satisfying and . The derivative must preserve this structure, and we can’t just add or subtract rotation matrices.

For any time-dependent configuration in a Lie group, we always have

Differentiating,

Rearranging,

We define

so that

So is the unique object such that multiplying it by reconstructs the time-derivative of the motion. Therefore, .

Consider now the path: :

At , we have . For any other , we have that . So any is a derivative of a path at the identity.

Hence, , the tangent space at the identity of the Lie Group.

Lie Algebras

The tangent space at the identity of a Lie Group is called the Lie Algebra (Lie Algebras are denoted in mathfrak).

It has some nice properties:

It’s a vector space (can add velocities, scale them).
All velocities in the group can be written as for some .

As we determined in the last section, we know that , where . Since the live inside a vector space, if we could rewrite the Euler-Lagrange equations in terms of , we could use the vector space structure to add and scale them! This would solve our problem.

Euler-Poincare’ Equation

So, we want to rewrite the Euler-Lagrange equations in terms of .

We know

The Lagrangian is thus

If we assume the Lagrangian is left-invariant⁵, then we have

Call this the “reduced lagrangian”

Let’s now think of the curves

We will use the same trick we did to derive Euler-Lagrange, where we modify the path by and then find a critical point with respect to the variation.

Define the variation:

The endpoints are fixed, and hence have variation .

So we have

Now define, for each ,

We want: in terms of and .

Start with

Differentiate with respect to ⁶:

Compute using the identity :

Compute :

Substitute both into the expression for :

Now insert and simplify:

So we have

The second term is called the “Lie bracket” and denoted:

So:

The action over the reduced Lagrangian is

Vary it:

Define

is in the dual Lie Algebra , which is linear functionals on ⁷.

Using the constrained variation we have

is linear, so we can break the integral up. Integrating the first term by parts, and using , we find that the first term gives

The second term can be rewritten using the coadjoint operator⁸:

Substituting, the total variation becomes

Since is arbitrary with fixed endpoints, stationarity implies that the functional

vanishes identically (is the zero functional), and hence we have

These are the Euler–Poincaré equations.

Exp and Log

One last note on Lie Groups before we continue.

The “exponential map” lets us move between the Lie Group and the Lie Algebra:

That is, if we solve the equation:

This give us ⁹

Algebraically we get the properties of the typical exponential:

(for matrix Lie Groups)

The logarithm is the (local) inverse:

where is a neighborhood of the identity.

For matrix groups:

(when )

Discrete Mode

We have now derived all the relevant geometric concepts. We need to adapt to a discrete setting. Instead of the lagrangian acting on , it acts on , where . That is, we use the positions at rather than velocities¹⁰.

Instead of the action integral, we have an action sum:

where is the discrete Lagrangian.Instead of the Euler-Lagrange equations, we have the discrete Euler-Lagrange equations¹¹

The discrete Hamilton’s equations

and the discrete Euler-Poincare’ equations:

The are partial derivatives. The various ’s in this last equation are (not more Lagrangians) and ’s are tangent maps. We will break it down in the code section.

Code

Let’s try to build some software to see these methods in action.

Existing Libraries

Before we start, let’s look at how some existing libraries architect similar concepts. ChatGPT gave some examples.

In trep you define a System object, then add frames, forces, etc. and run a variational integrator (MidpointVI, a variational integrator).

Code is vaguely like:

system = trep.System()
frames = [
    ty(3), # Provided as an angle reference
    rx("theta"), [tz(-3, mass=1)]
    ]
system.import_frames(frames)
trep.potentials.Gravity(system, (0, 0, -9.8))
trep.forces.Damping(system, 1.2)
q0 = (0.23,)   
q1 = (0.24,)
mvi = trep.MidpointVI(system)
mvi.initialize_from_configs(0.0, q0, dt, q1)

# then run the main loop

I like that systems are separated from the integrators, but I find the way systems are defined to be somewhat unintuitive, and the API seems to involve a lot of “variable” names, so it doesn’t seem very extensible.

manif and kornia are references for designing a Lie Group API¹².

The minimal operations are the same between the two. manif has right, left, plus and minus operators for perturbations on the tangent space. kornia has the added advantage of differentiable operations. Both have Jacobians (though implemented differently) and “hat” and “vee” operators.

crocoddyl and by extension Pinocchio. I looked at this library for the differential games post as well.

Implementation

Let’s take a (brief) look at the implementation.

Lie Group Library

First, I built a very simple Lie Group library. I won’t belabor the explanations here as this isn’t really the focus of this post, and there are numerous Lie Group libraries you can fin elsewhere.

Elements

Each Lie Group is made up of LieGroupElements. We define a few different types. Each element must implement all of the typical Lie Group operations.

DEFAULT_EPS = 1e-6
DEFAULT_SKEW_TOLERANCE = 1e-5

class LieGroupElement(ABC):
    
    def __init__(self, group: 'LieGroup'):
        self.group = group
    
    @property
    @abstractmethod
    def tensor(self) -> torch.Tensor:
        pass
    
    def __mul__(self, other: 'LieGroupElement') -> 'LieGroupElement':
        if not isinstance(other, LieGroupElement):
            raise TypeError(f"Cannot multiply with {type(other)}")
        return self.group.compose(self, other)
    
    def __matmul__(self, point: torch.Tensor) -> torch.Tensor:
        return self.group.action(self, point)
    
    def inverse(self) -> 'LieGroupElement':
        return self.group.inverse(self)
    
    def log(self) -> torch.Tensor:
        return self.group.log(self)
    
    @abstractmethod
    def __repr__(self) -> str:
        pass

class TensorElement(LieGroupElement):
    def __init__(self, group: 'LieGroup', data: torch.Tensor):
        super().__init__(group)
        self._data = data
    
    @property
    def tensor(self) -> torch.Tensor:
        return self._data
    
    def __repr__(self) -> str:
        return f"Element(type={self.group},shape={self._data.shape})"


class ProductElement(LieGroupElement):
    
    def __init__(self, group: 'Product', components: Tuple[LieGroupElement, ...]):
        super().__init__(group)
        self.components = components
        
        if len(components) != len(group.factors):
            raise ValueError(
                f"Expected {len(group.factors)} components, got {len(components)}"
            )
    
    @property
    def tensor(self) -> torch.Tensor:
        tensors = []
        for elem in self.components:
            t = elem.tensor
            batch_shape = t.shape[:-len(elem.group._element_shape())]
            tensors.append(t.reshape(*batch_shape, -1))
        return torch.cat(tensors, dim=-1)
    
    def __getitem__(self, index: int) -> LieGroupElement:
        return self.components[index]
    
    def __repr__(self) -> str:
        return f"Element(type={self.group},components={len(self.components)})"

Lie Group Abstraction

Next we have the high-level group structure:

class LieGroup(ABC):
    
    @property
    @abstractmethod
    def dim(self) -> int:
        pass
    
    @abstractmethod
    def _identity_impl(self, batch_shape: Tuple[int, ...]) -> LieGroupElement:
        pass
    
    @abstractmethod
    def _compose_impl(self, g: LieGroupElement, h: LieGroupElement) -> LieGroupElement:
        pass
    
    @abstractmethod
    def _inverse_impl(self, g: LieGroupElement) -> LieGroupElement:
        pass
    
    @abstractmethod
    def _exp_impl(self, omega: torch.Tensor) -> LieGroupElement:
        pass
    
    @abstractmethod
    def _log_impl(self, g: LieGroupElement) -> torch.Tensor:
        pass
    
    # Public API
    def identity(self, batch_shape: Tuple[int, ...] = ()) -> LieGroupElement:
        return self._identity_impl(batch_shape)
    
    def compose(self, g: LieGroupElement, h: LieGroupElement) -> LieGroupElement:
        return self._compose_impl(g, h)
    
    def inverse(self, g: LieGroupElement) -> LieGroupElement:
        return self._inverse_impl(g)
    
    def exp(self, omega: torch.Tensor) -> LieGroupElement:
        return self._exp_impl(omega)
    
    def log(self, g: LieGroupElement) -> torch.Tensor:
        return self._log_impl(g)

    def hat(self, omega_coords: torch.Tensor) -> torch.Tensor:
        # Coordinates to "natural form" of tensor (for readability/debugging)
        # Identity by default
        return omega_coords
    
    def vee(self, omega_natural: torch.Tensor) -> torch.Tensor:
        # "Natural form" to coordinates of tensor (for readability/debugging)
        # Identity by default
        return omega_natural

    # Optional operations
    def random(self, batch_shape: Tuple[int, ...] = ()) -> LieGroupElement:
        omega = torch.randn(*batch_shape, self.dim)
        return self.exp(omega)
    
    def action(self, g: LieGroupElement, point: torch.Tensor) -> torch.Tensor:
        raise NotImplementedError(
            f"{self.__class__.__name__} does not implement action"
        )
    
    # Derived operations
    def rplus(self, g: LieGroupElement, omega: torch.Tensor) -> LieGroupElement:
        return self.compose(g, self.exp(omega))
    
    def rminus(self, g: LieGroupElement, h: LieGroupElement) -> torch.Tensor:
        return self.log(self.compose(g.inverse(), h))
    
    def lplus(self, g: LieGroupElement, omega: torch.Tensor) -> LieGroupElement:
        return self.compose(self.exp(omega), g)
    
    def lminus(self, g: LieGroupElement, h: LieGroupElement) -> torch.Tensor:
        return self.log(self.compose(g, h.inverse()))
    
    def adjoint(self, g: LieGroupElement) -> torch.Tensor:
        # Override for analytical formula (much faster)
        batch_shape = g.tensor.shape[:-len(self._element_shape())]
        Ad = torch.zeros(*batch_shape, self.dim, self.dim,
                        dtype=g.tensor.dtype, device=g.tensor.device)
        
        g_inv = g.inverse()
        
        for i in range(self.dim):
            omega = torch.zeros(*batch_shape, self.dim,
                              dtype=g.tensor.dtype, device=g.tensor.device)
            omega[..., i] = DEFAULT_EPS
            
            perturbed = g * self.exp(omega) * g_inv
            Ad[..., :, i] = self.log(perturbed) / DEFAULT_EPS
        
        return Ad
    
    # Utilities
    def _element_shape(self) -> Tuple[int, ...]:
        return self.identity().tensor.shape
    
    def __call__(self, g: LieGroupElement, h: LieGroupElement) -> LieGroupElement:
        return self.compose(g, h)
    
    @property
    def is_compact(self) -> bool:
        return False
    
    @abstractmethod
    def __repr__(self) -> str:
        pass

Specific Lie Groups

Here’s a few specific Lie Groups. Note that we also implement as a Lie Group. It’s operations are typical vector addition (for add) and multiplication by (for inversion).

class SOn(LieGroup):
    def __init__(self, n: int):
        if n < 2:
            raise ValueError(f"SO(n) requires n >= 2, got {n}")
        self._n = n
    
    @property
    def dim(self) -> int:
        return self._n * (self._n - 1) // 2
    
    def _identity_impl(self, batch_shape: Tuple[int, ...]) -> LieGroupElement:
        I = torch.eye(self._n).expand(*batch_shape, self._n, self._n).clone()
        return TensorElement(self, I)
    
    def _compose_impl(self, g: LieGroupElement, h: LieGroupElement) -> LieGroupElement:
        result = torch.matmul(g.tensor, h.tensor)
        return TensorElement(self, result)
    
    def _inverse_impl(self, g: LieGroupElement) -> LieGroupElement:
        result = g.tensor.transpose(-2, -1)
        return TensorElement(self, result)
    
    def _exp_impl(self, omega: torch.Tensor) -> LieGroupElement:
        omega = self.hat(omega)
        if torch.is_grad_enabled():
            skew_error = torch.norm(omega + omega.transpose(-2, -1))
            if skew_error > DEFAULT_SKEW_TOLERANCE:
               print(f"exp() expects skew-symmetric matrix")
        
        R = torch.linalg.matrix_exp(omega)
        return TensorElement(self, R)

    def _log_impl(self, g: LieGroupElement) -> torch.Tensor:
        # really stupid 1st order approximation
        R = g.tensor                           
        skew = 0.5 * (R - R.transpose(-2, -1)) 
        return self.vee(skew) 

    def random(self, batch_shape: Tuple[int, ...] = ()) -> LieGroupElement:
        # Uniform sampling via QR decomposition
        A = torch.randn(*batch_shape, self._n, self._n)
        Q, R = torch.linalg.qr(A)
        
        signs = torch.sign(torch.diagonal(R, dim1=-2, dim2=-1))
        signs = torch.where(signs == 0, torch.ones_like(signs), signs)
        Q = Q * signs.unsqueeze(-2)
        
        return TensorElement(self, Q)
    
    def action(self, g: LieGroupElement, point: torch.Tensor) -> torch.Tensor:
        # Rotate vectors
        return torch.matmul(g.tensor, point.unsqueeze(-1)).squeeze(-1)
    
    ## Inherit this for now 
    # def adjoint(self, g: LieGroupElement) -> torch.Tensor:
    #     # For SO(n), n=2 and n=3, adjoint is the matrix itself
    #     return g.tensor

    def hat(self, omega: torch.Tensor) -> torch.Tensor:
        if omega.shape[-1] != self.dim:
            return omega
        n = self._n
        if omega.ndim >= 2 and omega.shape[-2:] == (n, n):
            return omega
        batch = omega.shape[:-1]
        skew = torch.zeros(*batch, n, n, device=omega.device, dtype=omega.dtype)
        idx = 0
        for i in range(n):
            for j in range(i + 1, n):
                skew[..., i, j] = omega[..., idx]
                skew[..., j, i] = -omega[..., idx]
                idx += 1
        return skew 
    
    def vee(self, omega: torch.Tensor) -> torch.Tensor:
        if omega.shape[-1] == self.dim:
            return omega
        n = self._n
        coords = []
        for i in range(n):
            for j in range(i + 1, n):
                coords.append(omega[..., i, j])
        return torch.stack(coords, dim=-1)
    
    @property
    def is_compact(self) -> bool:
        return True
    
    def __repr__(self) -> str:
        return f"SO({self._n})"

class Rn(LieGroup):
    
    def __init__(self, n: int):
        if n < 1:
            raise ValueError(f"R^n requires n >= 1, got {n}")
        self._n = n
    
    @property
    def dim(self) -> int:
        return self._n
    
    def _identity_impl(self, batch_shape: Tuple[int, ...]) -> LieGroupElement:
        zeros = torch.zeros(*batch_shape, self._n)
        return TensorElement(self, zeros)
    
    def _compose_impl(self, g: LieGroupElement, h: LieGroupElement) -> LieGroupElement:
        result = g.tensor + h.tensor
        return TensorElement(self, result)
    
    def _inverse_impl(self, g: LieGroupElement) -> LieGroupElement:
        return TensorElement(self, -g.tensor)
    
    def _exp_impl(self, omega: torch.Tensor) -> LieGroupElement:
        return TensorElement(self, omega)
    
    def _log_impl(self, g: LieGroupElement) -> torch.Tensor:
        return g.tensor
    
    def action(self, g: LieGroupElement, point: torch.Tensor) -> torch.Tensor:
        # Translation
        return point + g.tensor
    
    def adjoint(self, g: LieGroupElement) -> torch.Tensor:
        batch_shape = g.tensor.shape[:-1]
        return torch.eye(self._n, dtype=g.tensor.dtype, device=g.tensor.device).expand(
            *batch_shape, self._n, self._n
        ).clone()
    
    def __repr__(self) -> str:
        return f"R^{self._n}"

Products

We also implement products and semidirect products.

class Product(LieGroup):
    
    def __init__(self, factors: List[LieGroup]):
        if len(factors) < 2:
            raise ValueError("Product requires at least 2 groups")
        self.factors = factors
    
    @property
    def dim(self) -> int:
        return sum(g.dim for g in self.factors)
    
    def _identity_impl(self, batch_shape: Tuple[int, ...]) -> LieGroupElement:
        components = tuple(g.identity(batch_shape) for g in self.factors)
        return ProductElement(self, components)
    
    def _compose_impl(self, g: LieGroupElement, h: LieGroupElement) -> LieGroupElement:
        g_prod = g if isinstance(g, ProductElement) else self._to_product(g)
        h_prod = h if isinstance(h, ProductElement) else self._to_product(h)
        
        components = tuple(
            g_comp * h_comp
            for g_comp, h_comp in zip(g_prod.components, h_prod.components)
        )
        return ProductElement(self, components)
    
    def _inverse_impl(self, g: LieGroupElement) -> LieGroupElement:
        g_prod = g if isinstance(g, ProductElement) else self._to_product(g)
        components = tuple(comp.inverse() for comp in g_prod.components)
        return ProductElement(self, components)
    
    def _exp_impl(self, omega: torch.Tensor) -> LieGroupElement:
        omega_split = self._split_tangent(omega)
        components = tuple(
            self.factors[i].exp(omega_split[i])
            for i in range(len(self.factors))
        )
        return ProductElement(self, components)
    
    def _log_impl(self, g: LieGroupElement) -> torch.Tensor:
        g_prod = g if isinstance(g, ProductElement) else self._to_product(g)
        logs = [factor.log(comp) 
            for factor, comp in zip(self.factors, g_prod.components)]
        return torch.cat(logs, dim=-1)
    
    def random(self, batch_shape: Tuple[int, ...] = ()) -> LieGroupElement:
        components = tuple(g.random(batch_shape) for g in self.factors)
        return ProductElement(self, components)
    
    def action(self, g: LieGroupElement, point: torch.Tensor) -> torch.Tensor:
        g_prod = g if isinstance(g, ProductElement) else self._to_product(g)
        point_split = self._split_tangent(point)
        
        results = [
            self.factors[i].action(g_prod.components[i], point_split[i])
            for i in range(len(self.factors))
        ]
        
        batch_shape = point.shape[:-1]
        flat_results = [r.reshape(*batch_shape, -1) for r in results]
        return torch.cat(flat_results, dim=-1)
    
    def adjoint(self, g: LieGroupElement) -> torch.Tensor:
        # Block diagonal adjoint
        g_prod = g if isinstance(g, ProductElement) else self._to_product(g)
        batch_shape = g.tensor.shape[:-1]
        
        Ad = torch.zeros(*batch_shape, self.dim, self.dim,
                        dtype=g.tensor.dtype, device=g.tensor.device)
        
        offset = 0
        for i, (comp, factor) in enumerate(zip(g_prod.components, self.factors)):
            dim_i = factor.dim
            Ad[..., offset:offset+dim_i, offset:offset+dim_i] = factor.adjoint(comp)
            offset += dim_i
        
        return Ad
    
    def _split_tangent(self, omega: torch.Tensor) -> List[torch.Tensor]:
        # Split tangent vector into components
        dims = [g.dim for g in self.factors]
        offsets = [0] + list(torch.cumsum(torch.tensor(dims), dim=0))
        return [omega[..., offsets[i]:offsets[i+1]] for i in range(len(dims))]
    
    def _to_product(self, g: LieGroupElement) -> ProductElement:
        # Convert generic element to ProductElement if needed
        if isinstance(g, ProductElement):
            return g
        raise TypeError(f"Expected ProductElement, got {type(g)}")
    
    def __repr__(self) -> str:
        return " × ".join(repr(g) for g in self.factors)

class SemidirectProduct(LieGroup):

    def __init__(self, normal: LieGroup, actor: LieGroup):
        self.normal = normal
        self.actor = actor
        self.factors = [normal, actor]
    
    @property
    def dim(self) -> int:
        return self.normal.dim + self.actor.dim
    
    def _identity_impl(self, batch_shape: Tuple[int, ...]) -> LieGroupElement:
        components = (self.actor.identity(batch_shape), self.normal.identity(batch_shape))
        return ProductElement(self, components)
    
    def _compose_impl(self, g: LieGroupElement, h: LieGroupElement) -> LieGroupElement:
        """Semidirect product composition with action!"""
        g_prod = g if isinstance(g, ProductElement) else self._to_product(g)
        h_prod = h if isinstance(h, ProductElement) else self._to_product(h)
        
        g_actor, g_normal = g_prod.components
        h_actor, h_normal = h_prod.components
        
        result_actor = g_actor * h_actor
        result_normal = g_normal * TensorElement(
            self.normal,
            self.actor.action(g_actor, h_normal.tensor)
        )
        
        return ProductElement(self, (result_actor, result_normal))
    
    def _inverse_impl(self, g: LieGroupElement) -> LieGroupElement:
        """Semidirect product inverse"""
        g_prod = g if isinstance(g, ProductElement) else self._to_product(g)
        g_actor, g_normal = g_prod.components
        
        g_actor_inv = g_actor.inverse()
        g_normal_inv = TensorElement(
            self.normal,
            self.actor.action(g_actor_inv, -g_normal.tensor)
        )
        
        return ProductElement(self, (g_actor_inv, g_normal_inv))
    
    def _exp_impl(self, omega: torch.Tensor) -> LieGroupElement:
        """Exponential (using product structure)"""
        omega_actor = omega[..., :self.actor.dim]
        omega_normal = omega[..., self.actor.dim:]
        
        components = (
            self.actor.exp(omega_actor),
            self.normal.exp(omega_normal)
        )
        return ProductElement(self, components)
    
    def _log_impl(self, g: LieGroupElement) -> torch.Tensor:
        """Logarithm (using product structure)"""
        g_prod = g if isinstance(g, ProductElement) else self._to_product(g)
        g_actor, g_normal = g_prod.components
    
        log_actor = self.actor.log(g_actor).flatten()
        log_normal = self.normal.log(g_normal).flatten()

        return torch.cat([log_actor, log_normal], dim=-1)
    
    def random(self, batch_shape: Tuple[int, ...] = ()) -> LieGroupElement:
        components = (self.actor.random(batch_shape), self.normal.random(batch_shape))
        return ProductElement(self, components)
    
    def _to_product(self, g: LieGroupElement) -> ProductElement:
        if isinstance(g, ProductElement):
            return g
        raise TypeError(f"Expected ProductElement, got {type(g)}")
    
    def __repr__(self) -> str:
        return f"{self.normal} ⋊ {self.actor}"

As a bonus, here’s an example group implemented via semidirect product.

def SE(n: int) -> SemidirectProduct:
    return SemidirectProduct(Rn(n), SOn(n))

Basic Geometric Controls

Now we need to implement the actual control code. Let’s review what we mean when we say “controls” here, as we are still interpreting the Lagrangian in a slightly unusual way, and our picture here may seem “backwards” from the typical picture.

The Lagrangian, in our interpretation, is the “value/cost density” for each infinitesimal path segment that our system travails from (start position) to (end position). We control the system by defining which value/cost density we wish the system to maximize/minimize (depending on the sign). This corresponds to an equivalent “policy” (Hamiltonian) In mechanics, the cost/value density is the energy functional (usually , the difference between the kinetic and potential energies).

Control Plane Helpers

Let’s look at our “control plane” helper functions. What we will do is define a few abstract variables (to control). These functions track those variables and pack/unpack them into state tensors.

@dataclass(frozen=True)
class StateHandle:
    name: str
    group: LieGroup

class AttrObject:
    def __init__(self, mapping):
        for k, v in mapping.items():
            setattr(self, k, v)

class DiscreteModel:
    def __init__(self, control_plane):
        self.vars = control_plane

        offset = 0
        layout = {}
        for name, group in self.vars.items():
            d = group.dim
            layout[name] = (group, slice(offset, offset+d))
            offset += d

        self.layout = layout
        self.dim = offset

    def unpack(self, q):
        out = {}
        for name, (_, sl) in self.layout.items():
            out[name] = q[sl]
        return AttrObject(out)

    def pack(self, ctrl):
        parts = []
        for name, (_, _) in self.layout.items():
            parts.append(getattr(ctrl, name).reshape(-1))
        return torch.cat(parts)

Lagrangian

Let’s look at the actual Lagrangian “control place” abstraction. We implement the discrete langrangian we actually will use as a function of the VariationalSystem class, “under the hood”. To define a Lagrangian system, a user needs to inherit from this class, define their params and control plane variables, and implement the Lagrangian.

class VariationalSystem:
    def __init__(self, param_values):
        
        # Build params object
        names = self.params()
        self.params = AttrObject({name: param_values[name] for name in names})

        # Build model
        cp = self.control_plane()
        self.model = DiscreteModel(cp)

    def discrete_lagrangian(self, qk, qk1, h):
        mid = 0.5 * (qk + qk1)
        vel = (qk1 - qk) / h
        ctrl = self.model.unpack(mid)
        dctrl = self.model.unpack(vel)
        return h * self.lagrangian(ctrl, dctrl)

    def control_plane(self): raise NotImplementedError
    def params(self): raise NotImplementedError
    def lagrangian(self, ctrl, dctrl): raise NotImplementedError

Solvers

When the system definition is in place, we can subsequently run our VariationalIntegrator to actually solve the system.

class VariationalIntegrator:

    def __init__(self, system: VariationalSystem, step_size: float,
                 max_iters: int = 25, tol: float = 1e-10, on_step: Optional[Callable] = None):
        self.system = system
        self.h = float(step_size)
        self.max_iters = max_iters
        self.tol = tol
        self.on_step = on_step

    def D1_Ld(self, qk: torch.Tensor, qk1: torch.Tensor) -> torch.Tensor:
        qk_var = qk.clone().detach().requires_grad_(True)
        Ld = self.system.discrete_lagrangian(qk_var, qk1, self.h)
        grad_qk, = torch.autograd.grad(Ld, qk_var, create_graph=True)
        return grad_qk

    def D2_Ld(self, qk: torch.Tensor, qk1: torch.Tensor) -> torch.Tensor:
        qk_var = qk.clone().detach().requires_grad_(True)
        qk1_var = qk1.clone().detach().requires_grad_(True)
        Ld = self.system.discrete_lagrangian(qk_var, qk1_var, self.h)
        grad_qk1, = torch.autograd.grad(Ld, qk1_var, create_graph=False)
        return grad_qk1.detach()

    def step(self, q_prev: torch.Tensor, q_curr: torch.Tensor):

        q_next = q_curr + (q_curr - q_prev)
        q_next = q_next.clone().detach().requires_grad_(True)
        const_term = self.D2_Ld(q_prev, q_curr).detach()

        for _ in range(self.max_iters):
            F = const_term + self.D1_Ld(q_curr, q_next)

            if F.norm().item() < self.tol:
                return q_next.detach(), True

            def F_of(x: torch.Tensor) -> torch.Tensor:
                return const_term + self.D1_Ld(q_curr, x)

            J = torch.autograd.functional.jacobian(F_of, q_next)
            delta = torch.linalg.solve(J, F)

            with torch.no_grad():
                q_next = q_next - delta
            q_next.requires_grad_(True)

            if delta.norm().item() < self.tol:
                return q_next.detach(), True
        
            if self.on_step is not None:
                self.on_step(q_prev.detach(), q_curr.detach(), q_next.detach())

        return q_next.detach(), False

As promised, let’s look a bit more in-depth at the code above. We have the two “partial derivative functions”. D1_Ld takes the derivative of , then evaluates the discrete lagrangian. D2_Ld takes the derivative of and does the same. The actual step function runs through Newton’s method (repeatedly linearizing that equation and solving a small linear system until convergence).

Example: Pendulum

Let’s look at the simplest possible example.

Lagrangian

Here we define the actual lagrangian for a system. We have a pendulum, with . We control the pendulum angle using a lagrangian that is equivalent to the one found in physics: we take the kinetic energy minus the potential energy.

class Pendulum(VariationalSystem):
    def control_plane(self):
        return {
            "theta": SOn(2)
        }
    
    def params(self):
        return ["mass", "length", "gravity"]

    def lagrangian(self, ctrl, dctrl):
        th  = ctrl.theta
        thd = dctrl.theta

        m = self.params.mass
        l = self.params.length
        g = self.params.gravity

        T = 0.5 * m * l * l * thd * thd
        V = m * g * l * (1 - torch.cos(th))
        return T - V

Helper Functions

Here’s some helper functions to record data, plot, etc.

class StepRecorder:
    def __init__(self):
        self.records = []

    def on_step(self, q_prev, q_curr, q_next):
        self.records.append({
            "q_prev": q_prev.clone(),
            "q_curr": q_curr.clone(),
            "q_next": q_next.clone(),
        })

def pendulum_observables_from_records(pendulum: Pendulum,
                                      recorder: StepRecorder,
                                      h: float):
    m = pendulum.params.mass
    l = pendulum.params.length
    g = pendulum.params.gravity

    ts = []
    thetas = []
    theta_dots = []
    energies = []

    for k, rec in enumerate(recorder.records):
        t = (k + 1) * h 
        q_prev = rec["q_prev"]
        q_curr = rec["q_curr"]

        theta = q_curr[0]
        theta_prev = q_prev[0]

        theta_dot = (theta - theta_prev) / h

        T_k = 0.5 * m * l * l * theta_dot * theta_dot
        V_k = m * g * l * (1.0 - torch.cos(theta))
        E_k = T_k + V_k

        ts.append(t)
        thetas.append(float(theta))
        theta_dots.append(float(theta_dot))
        energies.append(float(E_k))

    return ts, thetas, theta_dots, energies

Outcome

Here’s the code to actually run everything:

if __name__ == "__main__":
    torch.set_default_dtype(torch.float64)

    pendulum = Pendulum({
        "mass": 1.0,
        "length": 1.0,
        "gravity": 9.81,
    })

    h = 0.001
    recorder = StepRecorder()
    integrator = VariationalIntegrator(pendulum, step_size=h, on_step=recorder.on_step)

    theta0 = 0.8
    theta_dot0 = 0.0
    ctrl0 = AttrObject({"theta": torch.tensor([theta0])})
    q0 = pendulum.model.pack(ctrl0)

    ctrl1 = AttrObject({"theta": torch.tensor([theta0 + h * theta_dot0])})
    q1 = pendulum.model.pack(ctrl1)

    steps = 10000
    qs = [q0.clone(), q1.clone()]
    q_prev, q_curr = q0, q1

    for _ in range(steps - 2):
        q_next, ok = integrator.step(q_prev, q_curr)
        qs.append(q_next.clone())
        q_prev, q_curr = q_curr, q_next

    qs = torch.stack(qs, dim=0)

    ts, thetas, theta_dots, energies = pendulum_observables_from_records(
        pendulum,
        recorder,
        h,
    )

    print("Final Theta:", thetas[-1])
    print("Energy stats:")
    print("  min:", min(energies))
    print("  max:", max(energies))
    print("  drift:", energies[-1] - energies[0])

    plot_theta(ts, thetas)
    plot_energy(ts, energies)
    plot_phase(thetas, theta_dots)
    plt.show()

With 10000 steps, we can see the final angle, and compute the min and max energies seen over the course of the simulation:

Final Theta: 0.17636912077013364
Energy stats:
  min: 2.970827133701791
  max: 2.979793319236134
  drift: 0.0020382257398114945

This looks pretty good. Enery is pretty conserved (out to the thousandths place). Furthermore, if I increase or decrease the step number, the error seems to increase/decrease roughly linearly, which is a good sign.

Here’s some visualizations of what the key metrics look like over time:

We see the pendulum rotates through the angles cyclically, and the energy is (roughly) conserved. So the code passes the basic sanity test.

Adding Lie Groups

The implementation above doesn’t actually use our Lie Group formulation. We should hook in our Lie Group library¹³.

We replace our discrete lagrangian code above with the following:

    def discrete_lagrangian(self, qk: torch.Tensor, qk1: torch.Tensor, h: float) -> torch.Tensor:

        mid_coords = []
        vel_coords = []

        for name, (group, sl) in self.model.layout.items():
            qk_i  = qk[sl]      
            qk1_i = qk1[sl]

            gk  = group.exp(qk_i)
            gk1 = group.exp(qk1_i)

            omega = group.log(gk.inverse() * gk1) / h        
            g_mid = gk * group.exp(0.5 * h * omega) # equivalent of midpoint quadrature 

            mid_i = group.log(g_mid)   
            vel_i = omega             

            mid_coords.append(mid_i)
            vel_coords.append(vel_i)

        mid = torch.cat(mid_coords, dim=-1)
        vel = torch.cat(vel_coords, dim=-1)

        ctrl  = self.model.unpack(mid)   
        dctrl = self.model.unpack(vel)   

        return h * self.lagrangian(ctrl, dctrl)

In the purely Euclidean case, with , the discrete Lagrangian I used is just the midpoint rule applied to the action integral. That is:

where

So the discrete action becomes

On a Lie group (or a product of Lie groups), the “midpoint” and the “finite difference velocity” should be expressed in group and algebra coordinates.

Given local coordinates in the Lie algebra , we interpret them as group elements via

The group-relative increment from to is

and its corresponding algebra element is

which plays the role of a discrete velocity.

A natural “midpoint” configuration is obtained by “flowing” halfway along this group velocity:

To feed this back into the Lagrangian written in algebra coordinates, we take

So the Lie-group discrete Lagrangian is

with and .

When with its additive Lie group structure (so and are both the identity), this reduces exactly to the original midpoint formula above.

If we run the code we see:

Final Theta: 0.7583324173185618
Energy stats:
  min: 2.4167871253032307
  max: 2.975317957405105
  drift: -0.1006878866394163

Error is much worse. I suspect this is due to the _log_impl function in the SOn class (which is only a first-order approximation). This took some debugging to get to this point. There could easily be other issues, but I’m content to move on and look again only if I end up needing this library in the future.

Another issue is that the Lie Group wrappers really slow down the code. If we need to return to Lie Groups and the performance becomes a problem, I will optimize.

Conclusion

In this post, we built up some foundational understanding of geometric controls and Lie Groups/Algebras. We did this “geometrically” - our intuition is not derived from physics per se but from the relevant geometric abstractions.

I have a few other posts to complete before I come back to this topic, but I want to continue to pursue this line of reasoning in the upcoming year. This may include -plectic control, Noetherian theorems, approximate Lagrangians, Morse theory, differential cohomology, etc. I’ll also hopefully begin to look at other systems from the geometric viewpoint (like thermodynamics). This is mostly based on a hunch that we can develop more intuitive and general abstractions for controls based on geometric principles, and that we can use these tools to gain some deeper physics insights into how agents function theoretically.

Additionally, I think that there’s low-hanging fruit in our discrete Lagrangian solver. There are three views we can use, that should all be equivalent: the Lagrangian (global optimization) view, the Hamiltonian (local policy) view, and the “black-box simulator” view. We should be able to build a minimal library that allows us to define a problem, develop a unified, abstract geometric representation, and then translate between the different views. There’s also something interesting about how we have dual views around the variables to be controlled and their gradients (perhaps there is something interesting we can do with autograd?). Hopefully we will also develop this further.

Footnotes

This brings us to the realm of differential geometry. I will attempt to explain and derive these principles purely geometrically, without any reference to physics.↩︎
We may be able to get away with less structure here than a Riemannian metric, probably just with invertible Hessian and maybe positive definite (for the Hamiltonian). But let’s assume the full metric structure for now.↩︎
It’s interesting to consider what would happen if the Hamiltonian WAS multi-valued, or if we only had “approximate inverses” for some reason… maybe more on this in a future post (if I can figure it out).↩︎
For a general Lagrangian, the Legendre map is controlled by the Hessian of in ; the Riemannian metric is a special case where this Hessian comes from (as in mechanical systems).↩︎
This seems arbitrary. Geometrically, this is like assuming the initial reference frame inside the robot is arbitrary - that is, modulo the initial position, the paths followed are “the same”. You can also assume right-invariance (which assumes the “outside observers” reference frame doesn’t matter). This is basically dual to the left-invariance case, the same but from the observers point of view. It might come up in tracking or estimation (controlling how you view a robot). If you assume both (bi-invariance), then neither position nor direction matter. This is like a assuming a perfectly round sphere in free space floatig in a featureless fog. If we don’t assume either invariance, then the Lagrangian depends on position, not just velocity. There is something external breaking the symmetry.↩︎
Differentiating in this way makes the implicit assumption that we have matrix Lie Groups. If we don’t make that assumption the upshot is we get a slightly more general formula for the Lie bracket (later). .↩︎
The derivative of a smooth function at a point is a linear map — that is, a covector. So naturally lives in the dual space .↩︎
Sign conventions in the literature sometimes differ by a minus sign. The adjoint map is defined by . The coadjoint map is its dual (transpose) in the linear-algebra sense: for every and , This is exactly what we used when we rewrote the term as in the variation.↩︎
Assumes is constant in time.↩︎
It seems like the reason is that the velocity is already encoded in the differences between positions. This may merit more thought, especially if we progress to “higher-dimensional” Lagrangians, as I may do in a future post. I did not rederive the equation for the discrete Lagrangian for this post.↩︎
The signs seem to be flipped because in the discrete setting, we are varying the points themselves, which does not require integration by parts.↩︎
manif cites this paper in particular.↩︎
One important note: PyTorch doesn’t seem to offer matrix log, which led to a lot of weirdness around the implementation, that I didn’t want to focus too hard on, as it isn’t the focus of the post. Be careful with this code!↩︎

Differential Games and Stag Hunt

Mon, 20 Oct 2025 04:00:00 GMT

Introduction

In the last two posts in this series we looked at games as static functions with discrete strategies. That is, each player picked a strategy and at the end of the game the payoffs for each player were assessed as .

Some games progress not across discrete strategies, but rather across continuous strategies (or strategies continuous in space and time). These games are called “differential games”. In this post, I attempt to construct a continuous version of stag hunt, and a general framework for simulating differential games. The intent here is to enable in-depth exploration of multi-agent control in subsequent posts.

Differential Games

We can think of differential games as an extension of both control theory and game theory, which both in turn extend discrete controls. In a typical Markov decision process/discrete control problem, we use techniques like dynamic programming to help determine the optimal policy for a single agent over a discrete state/action space. Game theory extends discrete control to -agents, while classical control theory extends discrete control to a continuous state/action space. Differential games has -agents operating in a continuous state/action space.

More formally, in a differential game, each player controls a control input , and the state of the world evolves according to a differential equation

Each agent receives some payoff that is a combination of a trajectory factor (an instantaneous loss function integrated over time) and a terminal factor :

We can also think of differential games as continuous-time differentiable programs. Each player’s policy is a differentiable function of its observation, and the dynamics act as a differentiable layer integrating the world forward alongside a natural loss function . This perspective allows us to use modern machine learning tools to analyze equilibria in continuous environments.

Code Architecture

Let’s lay out the requirements to define differential games in Python. We will build up the software in layers. The main concern here is separating game definition from game execution. We will use a similar model to the one laid out in previous posts on classical game theory. First we will implement a “physics” layer, which will manage state. Then we will implement a “decision” layer, agents make choices within those physics¹. Finally, we will have an Arena object that runs simulations and analyses on the outcomes.

Why do this? There are a few reasons:

State as immutable snapshots: We will bundle physical state, time, and payoffs as an immutable object. This enables branching, what-if analysis, and caching.
Pure simulation functions: Our tick and simulate functions are pure functions. This makes the simulator composable: you can pause mid-game, fork multiple futures, or replay with different policies. This is important for model-predictive control and counterfactual reasoning.
Differentiability everywhere: Since functions are pure, every component can support automatic differentiation. This will lets us backpropagate through entire trajectories to learn optimal policies via gradient descent.

We’ll look at the code in the next section².

Physics Layer

State Spaces

The first layer is the state space layer. We will keep the actual state specifications abstract so that they can apply to a multitude of different games. Each agent will have its own state, and the game may have a shared state as well.

The StateSpace object takes a list of agents and a shared state object and produce a single object maintaining the entire state, plus getters and setters for altering and retrieving different aspects of the state.

@dataclass
class StateSpec:
    names: List[str]  # e.g., ['x', 'y', 'vx', 'vy']
    
    def dim(self) -> int:
        return len(self.names)

class StateSpace:
    
    def __init__(self, 
                 agents: List['Agent'],
                 shared_spec: Optional[StateSpec] = None):

        self.agents = agents
        self.agent_names = [a.name for a in agents]
        self.agent_dims = {a.name: a.state_spec.dim() for a in agents}
        self.shared_dim = shared_spec.dim() if shared_spec else 0
        
        self.slices, self.dim = self._build_state_indexing()
    
    def _build_state_indexing(self) -> Tuple[Dict[str, slice], int]:
        slices = {}
        offset = 0
        
        for agent_name in self.agent_names:
            dim = self.agent_dims[agent_name]
            slices[agent_name] = slice(offset, offset + dim)
            offset += dim
        
        if self.shared_dim > 0:
            slices['shared'] = slice(offset, offset + self.shared_dim)
            offset += self.shared_dim
        
        return slices, offset
    
    def zero(self) -> torch.Tensor:
        return torch.zeros(self.dim)
    
    def get_state(self, state: torch.Tensor, agent: str) -> torch.Tensor:
        return state[self.slices[agent]]
    
    def get_shared(self, state: torch.Tensor) -> torch.Tensor:
        if self.shared_dim > 0:
            return state[self.slices['shared']]
        return torch.tensor([])
    
    def set_state(self, state: torch.Tensor, agent: str, value: torch.Tensor):
        state[self.slices[agent]] = value

Lastly, we need the actual GameState object:

@dataclass
class GameState:
    
    physical_state: torch.Tensor 
    time: float
    cumulative_payoffs: Dict[str, float]
    metadata: Dict[str, Any] = field(default_factory=dict)
    
    def clone(self) -> 'GameState':
        """Deep copy for branching"""
        return GameState(
            physical_state=self.physical_state.clone(),
            time=self.time,
            cumulative_payoffs=self.cumulative_payoffs.copy(),
            metadata=self.metadata.copy()
        )
    
    def with_state(self, new_physical_state: torch.Tensor) -> 'GameState':
        """Return new GameState with updated physical state"""
        return GameState(
            physical_state=new_physical_state,
            time=self.time,
            cumulative_payoffs=self.cumulative_payoffs,
            metadata=self.metadata
        )
    
    def with_time(self, new_time: float) -> 'GameState':
        """Return new GameState with updated time"""
        return GameState(
            physical_state=self.physical_state,
            time=new_time,
            cumulative_payoffs=self.cumulative_payoffs,
            metadata=self.metadata
        )
    
    def add_payoffs(self, step_payoffs: Dict[str, float]) -> 'GameState':
        """Return new GameState with updated payoffs"""
        new_payoffs = self.cumulative_payoffs.copy()
        for agent, reward in step_payoffs.items():
            new_payoffs[agent] = new_payoffs.get(agent, 0.0) + reward
        
        return GameState(
            physical_state=self.physical_state,
            time=self.time,
            cumulative_payoffs=new_payoffs,
            metadata=self.metadata
        )

This is the immutable snapshot of the state at any given point in time.

Observations

Given the state, each agent will need to observe part of it (depending on the game). We define an ObservationModel to handle this.

class ObservationModel(ABC):
    
    @abstractmethod
    def observe(self, state: torch.Tensor, agent: str, cumulative_payoff: Optional[float] = None) -> torch.Tensor:
        pass
    
    @abstractmethod
    def obs_dim(self, agent: str) -> int:
        pass

Dynamics

Next we have dynamics. The dynamics determine how the world actually evolves. Here’s the abstract interface.

class Dynamics(ABC):
    
    @abstractmethod
    def derivative(self, 
                   state: torch.Tensor,
                   controls: Dict[str, torch.Tensor]) -> torch.Tensor:
        pass

We hand the dynamics a state and controls and it outputs the change in state³.

Constraints

Real systems have constraints: agents can’t leave the arena, they can’t pass through walls, they shouldn’t collide with each other. We handle constraints in two ways. The first is soft violations (violated()), a differentiable penalty that grows outside the feasible region, used during learning or optimization. The second is hard projection (project()), which pushes the state back onto the constraint surface and is used to enforce physics.

Here’s the abstract interface:

class Constraint(ABC):
    
    @abstractmethod
    def violated(self, state: torch.Tensor) -> torch.Tensor:
        """
        Returns violation amount (0 = satisfied, >0 = violated).
        Must be differentiable
        """
        pass
    
    @abstractmethod
    def project(self, state: torch.Tensor) -> torch.Tensor:
        """Project state onto feasible set"""
        pass

Here’s some examples:

class BoundaryConstraint(Constraint):
    """Box boundaries [x_min, x_max] × [y_min, y_max]"""
    
    def __init__(self, state_space: StateSpace, bounds: Dict[str, Tuple[float, float]]):
        """bounds = {'x': (min, max), 'y': (min, max)}"""
        self.state_space = state_space
        self.bounds = bounds
    
    def violated(self, state):
        # Soft violation for differentiability
        violation = 0.0
        for agent in self.state_space.agent_names:
            pos = self.state_space.get_state(state, agent)[:2]
            # Penalty grows quadratically outside bounds
            violation += torch.relu(self.bounds['x'][0] - pos[0])**2
            violation += torch.relu(pos[0] - self.bounds['x'][1])**2
            violation += torch.relu(self.bounds['y'][0] - pos[1])**2
            violation += torch.relu(pos[1] - self.bounds['y'][1])**2
        return violation
    
    def project(self, state):
        """Hard projection (like game engine collision resolution)"""
        new_state = state.clone()
        for agent in self.state_space.agent_names:
            pos = self.state_space.get_state(state, agent)[:2]
            # Clamp position
            pos_clamped = torch.stack([
                torch.clamp(pos[0], self.bounds['x'][0], self.bounds['x'][1]),
                torch.clamp(pos[1], self.bounds['y'][0], self.bounds['y'][1])
            ])
            # Reflect velocity if hit boundary
            vel = self.state_space.get_state(state, agent)[2:4]
            vel_new = vel.clone()
            if pos[0] <= self.bounds['x'][0] or pos[0] >= self.bounds['x'][1]:
                vel_new[0] *= -0.8  # Bounce with damping
            if pos[1] <= self.bounds['y'][0] or pos[1] >= self.bounds['y'][1]:
                vel_new[1] *= -0.8
            
            self.state_space.set_state(new_state, agent, 
                                torch.cat([pos_clamped, vel_new]))
        return new_state

class CollisionConstraint(Constraint):
    """Agent-agent collision avoidance"""
    
    def __init__(self, state_space, radius=0.3):
        self.state_space = state_space
        self.radius = radius
    
    def violated(self, state):
        violation = 0.0
        for i, a1 in enumerate(self.state_space.agent_names):
            for a2 in self.state_space.agent_names[i+1:]:
                dist = torch.norm(
                    self.state_space.get_state(state, a1)[:2] - 
                    self.state_space.get_state(state, a2)[:2]
                )
                # Soft barrier
                violation += torch.relu(self.radius - dist)**2
        return violation
    
    def project(self, state):
        # Separate overlapping agents (like game engine)
        new_state = state.clone()
        for i, a1 in enumerate(self.state_space.agent_names):
            for a2 in self.state_space.agent_names[i+1:]:
                p1 = self.state_space.get_state(state, a1)[:2]
                p2 = self.state_space.get_state(state, a2)[:2]
                dist = torch.norm(p1 - p2)
                if dist < self.radius:
                    # Push apart
                    direction = (p1 - p2) / (dist + 1e-6)
                    overlap = self.radius - dist
                    # Each moves half the overlap
                    # (would need to update both states)
        return new_state

Payoffs

In differential games, payoffs typically have two components. The first is a running cost, associated with the trajectory, and the second is the terminal reward, assessed at the final stage.

class PayoffModel(ABC):
    
    @abstractmethod
    def agents(self) -> List[str]:
        """Return list of agent names"""
        pass
    
    def step(self, 
             state: torch.Tensor, 
             controls: Dict[str, torch.Tensor], 
             dt: float) -> Dict[str, float]:
        """Incremental payoff for this timestep (override for running costs)"""
        return {a: 0.0 for a in self.agents()}
    
    def terminal(self, state: torch.Tensor) -> Dict[str, float]:
        """Terminal payoff (override for end-of-game rewards)"""
        return {a: 0.0 for a in self.agents()}
    
    def total(self, 
              trajectory: List[Tuple[torch.Tensor, Dict[str, torch.Tensor]]], 
              final_state: torch.Tensor,
              dt: float) -> Dict[str, float]:
        """
        Total payoff over trajectory.
        Default: sum step payoffs + terminal.
        Override for discounting, non-additive payoffs, etc.
        """
        total = {a: 0.0 for a in self.agents()}
        
        for state, controls in trajectory:
            step_payoff = self.step(state, controls, dt)
            for a in self.agents():
                total[a] += step_payoff[a]
        
        terminal_payoff = self.terminal(final_state)
        for a in self.agents():
            total[a] += terminal_payoff[a]
        
        return total

Integration

To integrate the dynamics forward in time, we need a numerical integrator. The integrator takes the derivative from Dynamics and produces the next state. We support multiple schemes with different accuracy/speed tradeoffs:

class Integrator(ABC):
    
    def step(self,
             dynamics: Dynamics,
             state: torch.Tensor,
             controls: Dict[str, torch.Tensor],
             dt: float,
             constraints: Optional[List[Constraint]] = None) -> torch.Tensor:
        """
        Integrate one timestep and project onto constraints.
        
        Args:
            dynamics: dynamics model
            state: current state
            controls: control inputs
            dt: timestep
            constraints: optional list of constraints to enforce
            
        Returns:
            new_state (after constraint projection if provided)
        """
        # Integration
        new_state = self._integrate(dynamics, state, controls, dt)
        
        # Constraint projection (automatic if constraints provided)
        if constraints:
            for constraint in constraints:
                new_state = constraint.project(new_state)
        
        return new_state
    
    @abstractmethod
    def _integrate(self,
                   dynamics: Dynamics,
                   state: torch.Tensor,
                   controls: Dict[str, torch.Tensor],
                   dt: float) -> torch.Tensor:
        """Actual integration scheme (implemented by subclasses)"""
        pass

class EulerIntegrator(Integrator):
    
    def _integrate(self, dynamics, state, controls, dt):
        dstate = dynamics.derivative(state, controls)
        return state + dstate * dt

class RK4Integrator(Integrator):
    
    def _integrate(self, dynamics, state, controls, dt):
        k1 = dynamics.derivative(state, controls)
        k2 = dynamics.derivative(state + 0.5 * dt * k1, controls)
        k3 = dynamics.derivative(state + 0.5 * dt * k2, controls)
        k4 = dynamics.derivative(state + dt * k3, controls)
        return state + (dt / 6.0) * (k1 + 2*k2 + 2*k3 + k4)

Decision Layer

The next layer of code determines how agents actually behave.

Policy

A policy maps observations to control actions: .

class Policy(ABC):
    """Observation to control"""
    
    @abstractmethod
    def __call__(self, obs: torch.Tensor) -> torch.Tensor:
        """Must be differentiable for learning"""
        pass
    
    @abstractmethod
    def control_dim(self) -> int:
        pass

There are two kinds of policies: learnable ( use neural networks with parameters we can optimize via gradient descent) and hand-crafted (regular Python functions, useful for baselines, testing, etc).

class NeuralPolicy(Policy, nn.Module):
    """Learnable policy"""
    
    def __init__(self, obs_dim: int, control_dim: int, hidden_dim: int = 64):
        Policy.__init__(self)
        nn.Module.__init__(self)
        self._control_dim = control_dim
        
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, control_dim),
            nn.Tanh(),  # Bounded output
        )
        
        # Small init for stability
        for m in self.net.modules():
            if isinstance(m, nn.Linear):
                nn.init.orthogonal_(m.weight, gain=0.1)
                nn.init.zeros_(m.bias)
    
    def __call__(self, obs: torch.Tensor) -> torch.Tensor:
        return nn.Module.__call__(self, obs)
    
    def forward(self, obs: torch.Tensor) -> torch.Tensor:
        return self.net(obs)
    
    def control_dim(self) -> int:
        return self._control_dim

class FunctionPolicy(Policy):
    """Hand-coded policy (can be differentiable or not)"""
    
    def __init__(self, 
                 fn: Callable[[torch.Tensor], torch.Tensor],
                 control_dim: int,
                 differentiable: bool = False):
        self.fn = fn
        self._control_dim = control_dim
        self.differentiable = differentiable
    
    def __call__(self, obs: torch.Tensor) -> torch.Tensor:
        if self.differentiable:
            return self.fn(obs)
        else:
            with torch.no_grad():
                return self.fn(obs)
    
    def control_dim(self) -> int:
        return self._control_dim

Agents

An Agent bundles together a state specification (what variables it tracks) and a set of named strategies (the policies it can choose from). This bridges the gap between continuous control (policies) and discrete game theory (strategy names like “Cooperate” or “Defect”).

class Agent:
    """Agent with state specification and named strategies"""
    
    def __init__(self,
                 name: str,
                 state_spec: StateSpec,
                 strategy_set: Dict[str, Policy]):
        """
        Args:
            name: agent identifier
            state_spec: symbolic state specification
            strategy_set: dict of strategy_name -> Policy
        """
        self.name = name
        self.state_spec = state_spec
        self.strategy_set = strategy_set
        self.strategy_names = list(strategy_set.keys())
    
    def get_policy(self, strategy_name: str) -> Policy:
        return self.strategy_set[strategy_name]

Game Definitions

Here’s the actual definition of our differential game. We need a StateSpace, a list of agents (with observation models, dynamics, and payoff models), and an initial sampler.

class DifferentialGame:
    def __init__(self,
                 state_space: StateSpace,
                 agents: List[Agent],
                 obs_model: ObservationModel,
                 dynamics: Dynamics,
                 payoff_model: PayoffModel,
                 initial_sampler: Callable[[], torch.Tensor],
                 name: str = "Differential Game"):
        self.state_space = state_space
        self.agents = {a.name: a for a in agents}
        self.agent_names = [a.name for a in agents]
        self.obs_model = obs_model
        self.dynamics = dynamics
        self.payoff_model = payoff_model
        self.initial_sampler = initial_sampler
        self.name = name
    
    def get_strategy_sets(self) -> Dict[str, List[str]]:
        """Get available strategies per agent"""
        return {name: agent.strategy_names for name, agent in self.agents.items()}
    
    def __repr__(self):
        return f'{self.name}" agents={self.agent_names}>'

Game Execution Layer

Arena

The Arena object executes games. While DifferentialGame defines the rules, Arena actually simulates trajectories. This separation means you can define a game once, then run it with different:

Strategy profiles (cooperate vs defect)
Initial conditions (different starting positions)
Integration methods (Euler vs RK4)
Time horizons (short sprints vs long chases)

The core method is play(), which takes a strategy profile (mapping each agent to a strategy name) and returns the full trajectory plus final payoffs.

class Arena:
    
    def __init__(self,
                 game: DifferentialGame,
                 integrator: Integrator = None,
                 dt: float = 0.02,
                 max_time: float = 10.0):
        self.game = game
        self.integrator = integrator or EulerIntegrator()
        self.dt = dt
        self.max_time = max_time

    def initial_state(self, 
                      physical_state: Optional[torch.Tensor] = None) -> GameState:
        if physical_state is None:
            physical_state = self.game.initial_sampler()
        
        return GameState(
            physical_state=physical_state,
            time=0.0,
            cumulative_payoffs={agent: 0.0 for agent in self.game.agent_names}
        )

    def tick(self, 
             state: GameState, 
             policies: Dict[str, Policy],
             constraints: Optional[List[Constraint]] = None) -> GameState:

        # Observe
        observations = {
            agent: self.game.obs_model.observe(
                state.physical_state, 
                agent,
                state.cumulative_payoffs.get(agent, 0.0)
            )
            for agent in self.game.agent_names
        }
        
        # Act
        controls = {
            agent: policies[agent](observations[agent])
            for agent in self.game.agent_names
        }
        
        # Compute step payoffs (before state changes)
        step_payoffs = self.game.payoff_model.step(
            state.physical_state, 
            controls, 
            self.dt
        )
        
        # Integrate physics
        new_physical = self.integrator.step(
            self.game.dynamics,
            state.physical_state,
            controls,
            self.dt,
            constraints
        )
        
        # Build new state
        new_state = (state
                     .with_state(new_physical)
                     .with_time(state.time + self.dt)
                     .add_payoffs(step_payoffs))
        
        return new_state
    
    def simulate(self,
                 policies: Dict[str, Policy],
                 initial: Optional[GameState] = None,
                 until: Optional[float] = None,
                 constraints: Optional[List[Constraint]] = None) -> List[GameState]:
       
        if initial is None:
            initial = self.initial_state()
        
        end_time = until if until is not None else self.max_time
        
        trajectory = [initial]
        state = initial
        
        while state.time < end_time:
            state = self.tick(state, policies, constraints)
            trajectory.append(state)
        
        return trajectory

    def play(self,
             strategy_profile: Dict[str, str],
             initial: Optional[GameState] = None,
             constraints: Optional[List[Constraint]] = None) -> Tuple[List[GameState], Dict[str, float]]:

        # Convert strategy names to policies
        policies = {
            agent: self.game.agents[agent].get_policy(strategy_profile[agent])
            for agent in self.game.agent_names
        }
        
        # Simulate
        trajectory = self.simulate(policies, initial, constraints=constraints)
        
        # Add terminal payoffs
        final_state = trajectory[-1]
        terminal_payoffs = self.game.payoff_model.terminal(final_state.physical_state)
        
        total_payoffs = final_state.cumulative_payoffs.copy()
        for agent, reward in terminal_payoffs.items():
            total_payoffs[agent] += reward
        
        return trajectory, total_payoffs
    
    def expected_payoffs(self,
                        strategy_profile: Dict[str, str],
                        n_samples: int = 100) -> Dict[str, float]:

        total_payoffs = {agent: 0.0 for agent in self.game.agent_names}
        
        for _ in range(n_samples):
            _, payoffs = self.play(strategy_profile, initial=None)  # Random init each time
            for agent in self.game.agent_names:
                total_payoffs[agent] += payoffs[agent]
        
        return {agent: total_payoffs[agent] / n_samples for agent in self.game.agent_names}

Analysis

Converting to Normal Form

One of the unique features of this framework is the ability to convert continuous differential games back into discrete normal form. This lets us use all our existing game theory tools, like finding Nash equilibria, computing evolutionary dynamics, and visualizing payoff matrices.

class StrategyProfileIterator:
    """
    Iterate over all strategy profiles.
    Essential for converting differential game to normal form.
    """
    
    def __init__(self, strategy_sets: Dict[str, List[str]]):
        """
        strategy_sets: dict of agent -> list of strategy names
        """
        self.strategy_sets = strategy_sets
        self.agents = list(strategy_sets.keys())
    
    def __iter__(self):
        """Yield all strategy profiles"""
        strategy_lists = [self.strategy_sets[agent] for agent in self.agents]
        for profile_tuple in itertools.product(*strategy_lists):
            yield dict(zip(self.agents, profile_tuple))
    
    def count(self) -> int:
        """Total number of profiles"""
        count = 1
        for strategies in self.strategy_sets.values():
            count *= len(strategies)
        return count


class NormalFormConverter:
    """Convert differential game to normal form """
    
    @staticmethod
    def to_payoff_matrix(arena: Arena,
                        players: List[str],
                        n_samples: int = 1) -> torch.Tensor:
        
        # Get strategy sets for selected players
        all_sets = arena.game.get_strategy_sets()
        strategy_sets = {p: all_sets[p] for p in players}
        
        # Other agents use first strategy by default
        fixed_strategies = {
            agent: all_sets[agent][0]
            for agent in arena.game.agent_names
            if agent not in players
        }
        
        # Build tensor shape
        dims = [len(strategy_sets[p]) for p in players]
        payoff_shape = [len(players)] + dims
        payoffs = torch.zeros(payoff_shape)
        
        # Iterate over all profiles
        iterator = StrategyProfileIterator(strategy_sets)
        
        for profile in iterator:
            # Combine with fixed strategies
            full_profile = {**fixed_strategies, **profile}
            
            # Get indices for this profile
            indices = tuple(strategy_sets[p].index(profile[p]) for p in players)
            
            # Compute expected payoff
            if n_samples == 1:
                _, payoff_dict = arena.play(full_profile, differentiable=False)
            else:
                payoff_dict = arena.expected_payoffs(full_profile, n_samples)
            
            # Store in tensor
            for i, player in enumerate(players):
                payoffs[(i,) + indices] = payoff_dict[player]
        
        return payoffs

Visualization

Here’s also a visualization tool, to help see what’s going on in the game.

def plot_trajectory(trajectory: List[GameState],
                   state_space: StateSpace,
                   title: str = ""):
    """Plot 2D trajectories (assumes first 2 dims are x, y)"""
    fig, ax = plt.subplots(figsize=(8, 8))
    
    colors = {name: f'C{i}' for i, name in enumerate(state_space.agent_names)}
    
    for agent_name in state_space.agent_names:
        positions = []
        for game_state in trajectory:  # game_state is GameState
            state = game_state.physical_state  # ← Extract tensor
            agent_state = state_space.get_state(state, agent_name)
            positions.append(agent_state[:2])

        positions = np.array(positions)
        ax.plot(positions[:, 0], positions[:, 1],
               color=colors[agent_name], label=agent_name, alpha=0.7, linewidth=2)
        ax.scatter(positions[0, 0], positions[0, 1],
                  color=colors[agent_name], s=150, marker='o', edgecolor='black', linewidth=2)
        ax.scatter(positions[-1, 0], positions[-1, 1],
                  color=colors[agent_name], s=150, marker='X', edgecolor='black', linewidth=2)
    
    ax.set_aspect('equal')
    ax.legend()
    ax.grid(alpha=0.3)
    ax.set_title(title)
    plt.tight_layout()
    return fig

def plot_payoff_heatmap(payoff_tensor: torch.Tensor,
                       players: List[str],
                       strategy_sets: Dict[str, List[str]]):
    """Visualize 2-player payoff matrix"""
    if len(players) != 2:
        raise ValueError("Can only plot 2-player games")
    
    p1, p2 = players
    p1_strats = strategy_sets[p1]
    p2_strats = strategy_sets[p2]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    # Player 1 payoffs
    matrix1 = payoff_tensor[0].numpy()
    im1 = ax1.imshow(matrix1, cmap='RdYlGn', aspect='auto')
    ax1.set_xticks(range(len(p2_strats)))
    ax1.set_yticks(range(len(p1_strats)))
    ax1.set_xticklabels(p2_strats)
    ax1.set_yticklabels(p1_strats)
    ax1.set_xlabel(f'{p2} strategy')
    ax1.set_ylabel(f'{p1} strategy')
    ax1.set_title(f'{p1} Payoffs')
    
    for i in range(len(p1_strats)):
        for j in range(len(p2_strats)):
            ax1.text(j, i, f'{matrix1[i, j]:.1f}',
                    ha='center', va='center', fontsize=12, weight='bold')
    
    plt.colorbar(im1, ax=ax1)
    
    # Player 2 payoffs
    matrix2 = payoff_tensor[1].numpy()
    im2 = ax2.imshow(matrix2, cmap='RdYlGn', aspect='auto')
    ax2.set_xticks(range(len(p2_strats)))
    ax2.set_yticks(range(len(p1_strats)))
    ax2.set_xticklabels(p2_strats)
    ax2.set_yticklabels(p1_strats)
    ax2.set_xlabel(f'{p2} strategy')
    ax2.set_ylabel(f'{p1} strategy')
    ax2.set_title(f'{p2} Payoffs')
    
    for i in range(len(p1_strats)):
        for j in range(len(p2_strats)):
            ax2.text(j, i, f'{matrix2[i, j]:.1f}',
                    ha='center', va='center', fontsize=12, weight='bold')
    
    plt.colorbar(im2, ax=ax2)
    plt.tight_layout()
    return fig

Example: Stag Hunt

Now we’ll use this framework to build a pursuit-evasion game with stag hunt payoffs.

The setup is:

There are 2 cooperative hunters (c1, c2)
There is 1 stag (worth 4 points, requires both hunters to catch)
There are 2 hares (worth 3 points each, can be caught by one hunter)
The stag is faster than the hares but slower than hunters
The hunters pursue using simple “move toward target” policies
Prey flee using “move away from threats” policies

What differs from standard game theoretic stag hunt is that this plays out as a pursuit-evastion game in continuous space.

Definition

We define each agent as a point mass in 2D. The remaining code defines each element (including state positions) for each Agent.

POINT_MASS_2D = StateSpec(['x', 'y', 'vx', 'vy'])
POINT_MASS_3D = StateSpec(['x', 'y', 'z', 'vx', 'vy', 'vz'])

def build_stag_hunt():
    """Build stag hunt differential game"""
    
    # Build agents with strategies
    def make_pursuit(state_space, agent_name: str, targets: List[str], speed: float):
        def fn(obs):
            # obs is full state - extract this agent's position
            own_pos = state_space.get_state(obs, agent_name)[:2]
            target_pos = torch.stack([state_space.get_state(obs, t)[:2] for t in targets]).mean(dim=0)
            direction = target_pos - own_pos
            dist = torch.norm(direction) + 1e-6
            return (direction / dist) * speed
        return FunctionPolicy(fn, 2, differentiable=True)
    
    def make_flee(state_space, agent_name: str, threats: List[str], speed: float):
        def fn(obs):
            # obs is full state - extract this agent's position
            own_pos = state_space.get_state(obs, agent_name)[:2]
            threat_pos = torch.stack([state_space.get_state(obs, t)[:2] for t in threats]).mean(dim=0)
            direction = own_pos - threat_pos
            dist = torch.norm(direction) + 1e-6
            return (direction / dist) * speed
        return FunctionPolicy(fn, 2, differentiable=True)
    
    # Create agents with state specifications
    agents = [
        Agent('c1', POINT_MASS_2D, {}),  # Strategies added below
        Agent('c2', POINT_MASS_2D, {}),
        Agent('stag', POINT_MASS_2D, {}),
        Agent('hare1', POINT_MASS_2D, {}),
        Agent('hare2', POINT_MASS_2D, {}),
    ]
    
    # Build state space from agents
    state_space = StateSpace(agents, shared_spec=None)
    
    # Now add strategies (need state_space for closures)
    agents[0].strategy_set = {
        'ChaseStag': make_pursuit(state_space, 'c1', ['stag'], 1.5),
        'ChaseHare': make_pursuit(state_space, 'c1', ['hare1'], 1.5),
    }
    agents[0].strategy_names = list(agents[0].strategy_set.keys())
    
    agents[1].strategy_set = {
        'ChaseStag': make_pursuit(state_space, 'c2', ['stag'], 1.5),
        'ChaseHare': make_pursuit(state_space, 'c2', ['hare2'], 1.5),
    }
    agents[1].strategy_names = list(agents[1].strategy_set.keys())
    
    agents[2].strategy_set = {'Flee': make_flee(state_space, 'stag', ['c1', 'c2'], 1.1)}
    agents[2].strategy_names = list(agents[2].strategy_set.keys())
    
    agents[3].strategy_set = {'Flee': make_flee(state_space, 'hare1', ['c1', 'c2'], 0.6)}
    agents[3].strategy_names = list(agents[3].strategy_set.keys())
    
    agents[4].strategy_set = {'Flee': make_flee(state_space, 'hare2', ['c1', 'c2'], 0.6)}
    agents[4].strategy_names = list(agents[4].strategy_set.keys())
    
    # Observation: full state
    class FullObs(ObservationModel):
        def observe(self, state, agent):
            return state
        def obs_dim(self, agent):
            return state_space.dim
    
    obs_model = FullObs()
    
    # Dynamics: kinematic (differentiable!)
    all_agents = ['c1', 'c2', 'stag', 'hare1', 'hare2']
    
    class StagHuntPayoff(PayoffModel):
        def __init__(self, state_space, agent_names):
            self.state_space = state_space
            self._agents = agent_names
    
        def agents(self):
            return self._agents
    
        def terminal(self, state):    
            positions = {a: state_space.get_state(state, a)[:2] for a in all_agents}
        
            payoffs = {a: 0.0 for a in all_agents}
            radius = 0.5
            captured = {'c1': False, 'c2': False}
        
            # Stag (needs both)
            d1 = torch.norm(positions['c1'] - positions['stag']).item()
            d2 = torch.norm(positions['c2'] - positions['stag']).item()
        
            if d1 < radius and d2 < radius:
                payoffs['c1'] += 4.0
                payoffs['c2'] += 4.0
                captured['c1'] = True
                captured['c2'] = True
        
            # Hares (first come first served)
            if not captured['c1']:
                for hare in ['hare1', 'hare2']:
                    d = torch.norm(positions['c1'] - positions[hare]).item()
                    if d < radius:
                        payoffs['c1'] += 3.0
                        captured['c1'] = True
                        break
        
            if not captured['c2']:
                for hare in ['hare1', 'hare2']:
                    d = torch.norm(positions['c2'] - positions[hare]).item()
                    if d < radius:
                        payoffs['c2'] += 3.0
                        captured['c2'] = True
                        break
        
            return payoffs

    class KinematicDynamics(Dynamics):
        def __init__(self, max_speeds: Dict[str, float]):
            self.max_speeds = max_speeds
        
        def derivative(self, state, controls):
            dstate = []
            for agent in all_agents:
                agent_state = state_space.get_state(state, agent)
                control = controls[agent]
                
                # Soft clamping for differentiability
                vel_des = control
                speed = torch.norm(vel_des) + 1e-6
                max_speed = self.max_speeds[agent]
                
                # Soft clamp: vel = direction * min(speed, max_speed)
                scale_factor = max_speed * torch.tanh(speed / max_speed) / speed
                vel = vel_des * scale_factor
                
                dpos = vel
                dvel = torch.zeros(2)
                dstate.append(torch.cat([dpos, dvel]))
            
            return torch.cat(dstate)
    
    dynamics = KinematicDynamics({
        'c1': 1.5, 'c2': 1.5,
        'stag': 1.1, 'hare1': 0.6, 'hare2': 0.6
    })
    
    # Initial state
    def initial():
        return torch.cat([
            torch.tensor([-2.0, -2.0, 0.0, 0.0]),  # c1
            torch.tensor([2.0, -2.0, 0.0, 0.0]),   # c2
            torch.tensor([0.0, 2.0, 0.0, 0.0]),    # stag
            torch.tensor([-1.5, 0.0, 0.0, 0.0]),   # hare1
            torch.tensor([1.5, 0.0, 0.0, 0.0]),    # hare2
        ])
    
    payoff_model = StagHuntPayoff(state_space, all_agents)
    game = DifferentialGame(state_space, agents, obs_model, dynamics, payoff_model, initial, "Stag Hunt")
    
    return game, state_space

Results

Let’s see what the outputs look like:

if __name__ == "__main__": 
    game, state_space = build_stag_hunt()
    arena = Arena(game, dt=0.02, max_time=15.0)
    
    print(f"{game}")
    print(f"Strategy sets: {game.get_strategy_sets()}\n")
    
    # Test scenarios
    profiles = [
        ("Both Cooperate", {'c1': 'ChaseStag', 'c2': 'ChaseStag', 'stag': 'Flee', 'hare1': 'Flee', 'hare2': 'Flee'}),
        ("Both Defect", {'c1': 'ChaseHare', 'c2': 'ChaseHare', 'stag': 'Flee', 'hare1': 'Flee', 'hare2': 'Flee'}),
        ("Asymmetric", {'c1': 'ChaseStag', 'c2': 'ChaseHare', 'stag': 'Flee', 'hare1': 'Flee', 'hare2': 'Flee'}),
    ]
    
    for desc, profile in profiles:
        traj, payoffs = arena.play(profile)
        print(f"{desc}: c1={payoffs['c1']:.1f}, c2={payoffs['c2']:.1f}")
        plot_trajectory(traj, state_space, title=desc)
        plt.show()
    
    # Convert to normal form
    print("\n" + "="*60)
    print("NORMAL FORM EXTRACTION")
    print("="*60)
    
    payoff_tensor = NormalFormConverter.to_payoff_matrix(arena, ['c1', 'c2'], n_samples=1)
    print(f"\nPayoff tensor shape: {payoff_tensor.shape}")
    print(f"Payoff tensor:\n{payoff_tensor}")
    
    plot_payoff_heatmap(payoff_tensor, ['c1', 'c2'], game.get_strategy_sets())

We generate three plots. The first is the trajectories with both agents cooperating:

Second with two defections:

Third asymmetric:

Conclusion and Next Steps

In this post, we implemented the beginnings of a differential games framework, then adapted it for pursuit-evasion games with two chasers and three heterogenous evaders. Specifically, the payoff structure of this game matches the well-known “stag hunt” game from game theory. In the next post, we will attempt to combine our differentiable game canonicalizer with this setup.

Footnotes

This is similar to how optimal control tools are architected, like Drake or Crocoddyl, or robotics simulatos (MuJoCo). I thought a bit about video game engines as well (Unity, Unreal Engine), but there the physics tend to be implicit and most of the abstractions are oriented around entities and components for building the actual content.↩︎
Disclosure: Claude helped with some functions, but I reviewed all code. I do not believe an AI could write this unassisted at time of writing.↩︎
For now, this the linearization of the change in state. In theory the dynamics could also support higher order derivations. Furthermore, right now this is manually computed. We might be able to use automatic differentiation to handle this as well. More on this in a future post.↩︎

SINDy with Control

Mon, 01 Sep 2025 04:00:00 GMT

Introduction

The SINDy method is useful for fitting governing equations to data drawn from a dynamical system. However, for engineering applications we often seek not just to analyze dynamical systems but to control them. In this short post I implement the SINDyC method¹, which generalizes SINDy to include external inputs and feedback control. This continues our investigations from the last post in this series.

Background

In Dynamic Mode Decomposition, we tried to fit a linear operator to a dynamical system such that

DMDc extends this to instead fit the equation:

Instead of just a matrix of snapshots, we also need a matrix of control history (the at each timestamp).

The Koopman analysis for this system also now includes . Instead of

We have

Note that depends on the choice of control vector².

Setup

We start with the same dataset of snapshots , but now we also record our control history at each time point:

As in SINDy, we create a dictionary of basis functions. Then, using the data, we fit a sparse set of coefficients to the basis functions:

However (and this is critical), if the signal corresponds to a feedback control signal, we cannot disambiguate the effect of the feedback control from that of the internal system. That is, we must intervene on the controls³ to separate them from the dynamics.

Implementation

Coding this is almost trivially similar to SINDy, with a few modifications.

Library

We’ll need to upgrade our library of functions to include cross-terms between and

def sindyc_library(X, U, poly_order_x=3, poly_order_u=1,
                   include_cross=True, include_sine=False, include_cosine=False):
    n_vars, n_samples = X.shape
    n_ctrl = U.shape[0]

    feats = [np.ones(n_samples)]
    descriptions = ['1']

    for order in range(1, poly_order_x + 1):
        for combo in combinations_with_replacement(range(n_vars), order):
            term = np.prod([X[i, :] for i in combo], axis=0)
            feats.append(term)
            descriptions.append('*'.join([f'x_{i}' for i in combo]))

    for order in range(1, poly_order_u + 1):
        for combo in combinations_with_replacement(range(n_ctrl), order):
            term = np.prod([U[i, :] for i in combo], axis=0)
            feats.append(term)
            descriptions.append('*'.join([f'u_{i}' for i in combo]))

    if include_cross:
        for i in range(n_vars):
            for j in range(n_ctrl):
                feats.append(X[i, :] * U[j, :])
                descriptions.append(f'x_{i}*u_{j}')

    if include_sine:
        for i in range(n_vars):
            feats.append(np.sin(X[i, :])); descriptions.append(f'sin(x_{i})')
    if include_cosine:
        for i in range(n_vars):
            feats.append(np.cos(X[i, :])); descriptions.append(f'cos(x_{i})')

    Theta = np.column_stack(feats)
    return Theta, descriptions

Algorithm

The actual sindyc algorithm is more or less the same as the sindy method:

def sindyc(X, U, dt, poly_order_x=3, poly_order_u=1,
           include_cross=True, lambda_reg=0.1, max_iter=10,
           include_sine=False, include_cosine=False):
    dXdt = finite_difference(X, dt) 
    Theta, descriptions = sindyc_library(
        X, U,
        poly_order_x=poly_order_x,
        poly_order_u=poly_order_u,
        include_cross=include_cross,
        include_sine=include_sine,
        include_cosine=include_cosine
    )
    Xi = sequential_threshold_least_squares(Theta, dXdt, lambda_reg=lambda_reg, max_iter=max_iter)
    return Xi, descriptions

We first use finite differences to get the derivatives of X, then we use sequential threshold least-squares to compute the actual coefficients.

Example

Let’s take a look at a simple example⁴: a predator-prey system with a sinusoidal forcing function:

where

We’ll pick values for the coefficients as such:

Data Generation

We need to generate the data. Let’s write up the code for the predator-prey system and for our controls:

def predator_prey_control_rhs(state, u, alpha=1.0, beta=0.5, delta=0.5, gamma=1.0, k1=0.8, k2=0.6):
    x, y = state
    u1, u2 = u
    dx = alpha*x - beta*x*y + k1*u1
    dy = delta*x*y - gamma*y - k2*u2
    return np.array([dx, dy])

Above is the predator-prey system. Here’s what our control functions will look like:

u1 = lambda t: 0.3*np.sin(0.3*t) + 0.2*np.cos(0.11*t)
u2 = lambda t: 0.25*np.sin(0.17*t + 0.7)

Now we can simulate the system:

# Helper function, takes a step of fourth order Runge-Kutta
# See any numerical methods book, like https://link.springer.com/book/10.1007/978-3-540-78862-1
def rk4_step(ode_func_f_X, y_current, dt, **ode_kwargs): 
    k1 = ode_func_f_X(y_current, **ode_kwargs)
    k2 = ode_func_f_X(y_current + dt/2 * k1, **ode_kwargs)
    k3 = ode_func_f_X(y_current + dt/2 * k2, **ode_kwargs)
    k4 = ode_func_f_X(y_current + dt * k3, **ode_kwargs)
    y_next = y_current + dt/6 * (k1 + 2*k2 + 2*k3 + k4)
    return y_next

def simulate_predator_prey_with_control(x0, t, u1, u2, rhs=predator_prey_control_rhs):
    n  = len(t)
    dt = t[1] - t[0]
    X  = np.zeros((2, n))
    U  = np.zeros((2, n))
    X[:, 0] = np.asarray(x0, dtype=float)

    for k in range(1, n):
        u_vec = np.array([u1(t[k-1]), u2(t[k-1])], dtype=float)
        U[:, k-1] = u_vec
        X[:, k] = rk4_step(rhs, X[:, k-1], dt, u=u_vec)

    U[:, -1] = np.array([u1(t[-1]), u2(t[-1])], dtype=float)
    return X, U, dt

Outcome

Putting the code together, we get:

if __name__ == "__main__":
    t = np.arange(0.0, 50.0, 0.01) 

    u1 = lambda _t: 0.3*np.sin(0.3*_t) + 0.2*np.cos(0.11*_t)
    u2 = lambda _t: 0.25*np.sin(0.17*_t + 0.7)

    Xpp, Upp, dt_pp = simulate_predator_prey_with_control(
        x0=(1.5, 1.0),
        t=t,
        u1=u1,
        u2=u2,
        rhs=predator_prey_control_rhs
    )

    Xi_c, desc_c = sindyc(
        Xpp, Upp, dt_pp,
        poly_order_x=2,
        poly_order_u=1,
        include_cross=True,
        lambda_reg=0.05,
        max_iter=15
    )

    print_equations(Xi_c, desc_c, feature_names=['x','y'])

And the outcome:

dx/dt = 0.999839*x_0 - 0.499924*x_0*x_1 + 0.799839*u_0
dy/dt = -0.999844*x_1 + 0.499927*x_0*x_1 - 0.599914*u_1

Which matches our expected coefficients closely.

Conclusion

This post was a straightforward extension of SINDy to accommodate controls.

Footnotes

See this paper by Brunton, Proctor, and Kutz. I may use slightly different notation.↩︎
The here is a placeholder indexing the family of possible controls. That is, ↩︎
“Persistent excitation” is required (I don’t fully understand this requirement yet but it’s a requirement for identifiability). Brunton, Proctor, and Kutz recommend injecting a sufficiently large white noise signal, or occasionally kicking the system with a large impulse or step in.↩︎
Also drawn from Brunton, Proctor, and Kutz.↩︎

Learning Equilibria by Gradient Descent

Thu, 21 Aug 2025 04:00:00 GMT

Introduction

Given a set of agents playing a game, how do we determine their optimal strategic behavior?

In the last post we looked at ways to differentiably identify equivalence classes of games. In this short post, we’ll use gradient descent to identify the Nash equilibria for some simple games.

Background

Identifying Nash equilibria (or other strategic behavior) is difficult in general¹.

At some point I will get into traditional algorithms like support enumeration or Lemke-Howson for finding equilibria. However, for the purposes of this post I will investigate composing parametrized agents using games, and learning the Nash equilibria via gradient descent.

Implementation

We’ll build some simple classes to implement this experiment.

Game

First, we need some game representation.

class Game:
    def __init__(self, payoffs: torch.Tensor | nn.Parameter, actions: List[List[str]], name=None):
        self.num_players = payoffs.shape[0]
        self.payoffs = payoffs
        self.actions = actions
        self.name = name or "Unnamed Game"
        self._size = payoffs.shape 

    @property
    def size(self):
        return self._size
    
    def payoff(self, action_indices):
        return self.payoffs[(slice(None),) + tuple(action_indices)]
    
    def to(self, device):
        return Game(self.payoffs.to(device), self.actions, self.name) 

    def clone(self):
        if isinstance(self.payoffs, nn.Parameter):
            return Game(nn.Parameter(self.payoffs.detach().clone()), self.actions, self.name)
        else:
            return Game(self.payoffs.detach().clone(), self.actions, self.name)

    def __repr__(self):
        learnable = isinstance(self.payoffs, nn.Parameter) and self.payoffs.requires_grad
        return f'{self.name}" size={self._size} learnable={learnable}>'

Here, the payoffs are either raw tensors or learnable (if you pass in parameters). Parametrized payoffs are useful for tasks like optimizing welfare, mechanism design, inverse RL, etc.

Agent

Next, we need agents that can play the game.

class Agent:
    def __init__(self, policy, name: str):
        self.name = name
        self.policy = policy

    def act(self, actions: List[str]):
        probs = self.policy(actions)
        dist = torch.distributions.Categorical(probs)
        action_idx = dist.sample().item()
        return action_idx, actions[action_idx]

An agent just samples from the policy distribution and takes an action.

Policies

What kind of policies might the agent have?

Abstract

We’ll start with the abstraction. We have a _run_once helper to prevent double initializing a policy.

def _run_once(method):
    attr_flag = f"__{method.__name__}_has_run"

    @functools.wraps(method)
    def wrapper(self, *args, **kwargs):
        if getattr(self, attr_flag, False):
            return                         
        setattr(self, attr_flag, True)
        return method(self, *args, **kwargs)

    return wrapper

class Policy(ABC):
    def __init__(self):
        pass

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)

        # If the subclass overrides 'initialize', wrap it exactly once.
        if "initialize" in cls.__dict__:
            cls.initialize = _run_once(cls.__dict__["initialize"])

    @abstractmethod
    def initialize(self, actions: List[str]) -> None:
        pass
    
    @abstractmethod
    def forward(self, actions: List[str]) -> torch.Tensor:
        pass
    
    def __call__(self, actions: List[str]) -> torch.Tensor:
        if hasattr(self, "initialize"):
            self.initialize(actions)
        return super().__call__(actions)

Logits

Our first policy is just a logits policy.

class LogitsPolicy(Policy, nn.Module):
    def __init__(self, initialization='uniform'):
        Policy.__init__(self)
        nn.Module.__init__(self)
        self.initialization = initialization
        self.logits = None  
    
    def initialize(self, actions: List[str]) -> None:
        num_actions = len(actions)
        if self.initialization == 'uniform':
            self.logits = nn.Parameter(torch.zeros(num_actions))  
        elif self.initialization == 'random':
            self.logits = nn.Parameter(torch.rand(num_actions))

    def forward(self, actions: List[str]) -> torch.Tensor:
        return torch.softmax(self.logits, dim=0)

We’ll hold off on other policies until later posts.

Arena

Let’s compose Agents and Games in “Arena” objects. This is just to cleanly separate Agents from Games.

class Arena:

    def __init__(self, game: Game, agents: List[Agent]):
        assert len(agents) == game.num_players
        self.game = game
        self.agents = agents

        for i, agent in enumerate(self.agents):
            if hasattr(agent.policy, 'initialize'):
                agent.policy.initialize(self.game.actions[i])

    def play(self):
        action_indices = []
        actions_chosen = []

        for player_idx, agent in enumerate(self.agents):
            actions = self.game.actions[player_idx]
            action_idx, action = agent.act(actions)
            action_indices.append(action_idx)
            actions_chosen.append(action)

        payoffs = self.game.payoff(action_indices)
        return actions_chosen, payoffs

    def expected_payoffs(self):
        dists = [agent.policy(self.game.actions[i]) for i, agent in enumerate(self.agents)]
        joint_dist = dists[0]
        for dist in dists[1:]:
            joint_dist = torch.einsum('i,j->ij', joint_dist.flatten(), dist).flatten()

        payoffs_flat = self.game.payoffs.view(self.game.num_players, -1)
        exp_payoffs = (joint_dist * payoffs_flat).sum(-1)
        return exp_payoffs

The “play” function runs a round of the game. The “expected payoffs” returns the expected payoffs for each agent.

Example

Let’s look at an example. This first example is Stag Hunt.

if __name__ == "__main__":
    staghunt_actions = [["Stag", "Hare"], ["Stag", "Hare"]]
    p1_payoffs = [
        [8, 0],  # P1 plays Stag vs P2's [Stag, Hare]
        [2, 3]   # P1 plays Hare vs P2's [Stag, Hare]
    ]

    p2_payoffs = [
        [3, 1],  # P2 plays Stag vs P1's [Stag, Hare]  
        [0, 2]   # P2 plays Hare vs P1's [Stag, Hare]
    ] 
    payoffs =  nn.Parameter(torch.tensor([p1_payoffs, p2_payoffs], dtype=torch.float))
    stag_hunt = Game(payoffs, staghunt_actions, "Stag Hunt")
    print(stag_hunt)
    
    alice = Agent(
        policy=lambda actions: torch.tensor([1.0 if a=="Stag" else 0.0 for a in actions]), 
        name="Alice"
    )
    bob = Agent(
        policy=lambda actions: torch.ones(len(actions))/len(actions), 
        name="Bob"
    )
    
    stag_hunt_arena = Arena(stag_hunt, [alice, bob])
    
    print(stag_hunt_arena.expected_payoffs())
    print(stag_hunt_arena.play())
    print(stag_hunt_arena.play())

We initialize two deterministic agents, Alice and Bob. Alice always plays Stag. Bob plays are random. We then compute their expected payoffs (and play two rounds).


tensor([4., 2.], grad_fn=)
(['Stag', 'Hare'], tensor([0., 1.], grad_fn=))
(['Stag', 'Stag'], tensor([8., 3.], grad_fn=))

We can see the expected payoff for Alice is 4 and the expected payoff for Bob is 2. In the two rounds they play, Bob first plays “Hare” (low payoffs for both players), then plays “Stag” (high payoffs).

Now let’s build differentiable agents:

...
    diff_alice = Agent(
        policy=LogitsPolicy(initialization='random'),
        name="DiffAlice"
    )
    diff_bob = Agent(
        policy=LogitsPolicy(initialization='random'),
        name="DiffBob"
    )
    diff_agents = [diff_alice, diff_bob]
    diff_stag_hunt_arena = Arena(stag_hunt, diff_agents)
    
    optimizers = [
        optim.Adam(diff_alice.policy.parameters(), lr=0.1),
        optim.Adam(diff_bob.policy.parameters(), lr=0.1)
    ]

    # Training loop
    for step in range(200):
        exp_payoffs = diff_stag_hunt_arena.expected_payoffs()
        
        # Player 1 update
        optimizers[0].zero_grad()
        (-exp_payoffs[0]).backward(retain_graph=True)  
        optimizers[0].step()

        # Player 2 update
        optimizers[1].zero_grad()
        (-exp_payoffs[1]).backward()
        optimizers[1].step()

        if step % 20 == 0:
            print(f"Step {step}, Expected Payoffs: {exp_payoffs.detach().cpu().numpy()}")

    # Final Policies
    for i, agent in enumerate(diff_agents):
        logits = agent.policy.logits.detach().numpy()
        probs = agent.policy(stag_hunt.actions[i]).detach().numpy()
        print(f"Agent {i+1} final logits: {logits}")
        print(f"Agent {i+1} final probabilities: {probs}")
        print(f"Agent {i+1} prefers: {'Stag' if probs[0] > probs[1] else 'Hare'}")
        print()

In the main loop, we optimize Alice and Bob’s policies separately via simultaneous updates (there are other choices, like alternative updates). We see:

Step 0, Expected Payoffs: [2.874151  1.4458582]
Step 20, Expected Payoffs: [7.565231 2.889052]
Step 40, Expected Payoffs: [7.9332013 2.9811559]
Step 60, Expected Payoffs: [7.9630227 2.9893801]
Step 80, Expected Payoffs: [7.972061 2.991976]
Step 100, Expected Payoffs: [7.9772897 2.9934921]
Step 120, Expected Payoffs: [7.9810405 2.994578 ]
Step 140, Expected Payoffs: [7.9839015 2.995403 ]
Step 160, Expected Payoffs: [7.986139  2.9960463]
Step 180, Expected Payoffs: [7.9879236 2.9965587]
Agent 1 final logits: [ 4.3300014 -2.8580525]
Agent 1 final probabilities: [9.9924505e-01 7.5498747e-04]
Agent 1 prefers: Stag

Agent 2 final logits: [ 3.8065035 -3.3711162]
Agent 2 final probabilities: [9.9923706e-01 7.6290034e-04]
Agent 2 prefers: Stag

which is indeed the Nash equilibrium².

Conclusion

Our differentiable agents successfully discovered that mutual cooperation (both playing Stag) is the Nash equilibrium in the Stag Hunt game. This approach can scale to continuous action spaces and handle n-player games, and this framework is also compositional (we can swap in different policy architectures, loss functions, or optimization algorithms to explore different solution concepts or learning dynamics)³.

Gradient descent doesn’t guarantee convergence to Nash equilibria in all games. Zero-sum games may cycle, games with multiple equilibria depend on initialization, and simultaneous updates can lead to instability.

Footnotes

PPAD-complete. See here or here.↩︎
One of them. Running over and over again you can also see the game converge to [Hare, Hare].↩︎
In a future post we will hopefully take composition further, and compose game inputs/outputs.↩︎

Differentiable Game Canonicalization

Tue, 19 Aug 2025 04:00:00 GMT

Introduction

Given two games, how can we tell if they are “strategically equivalent”?

For example, consider the following 2x2 games:

Game A:

Game B:

They have different players, action labels, and payoff values. But strategically, both are “equivalent” (to Battle of the Sexes).

In this post, we build a game “canonicalizer” for 2x2 strict ordinal games. This allows a user to quickly identify the game type based on the payoff matrix¹. Furthermore, this implementation is end-to-end differentiable, so it can be plugged into a PyTorch deep learning pipeline.

Example use cases might be fast lookup in a “game zoo” database, curriculum learning for agents (train on progressively “harder” games), boundary analysis², or graph search over ordinal neighborhoods.

What’s in a Game?

We consider a game to be “finite normal form” if the following conditions are all true:

There is a finite set of players, indexed by .
Each player has possible finite pure strategies they can play, denoted .
There is a payoff function for each player such that

We can stack these to form a tensor:

For a 2-player, 2-strategy game, this is simply two 2x2 matrices, or one 2x2x2 tensor³.

We say a game is a “strict ordinal” game if all the payoffs for each player are distinct. For 2×2 games, each player has exactly 4 outcomes, ranked 1st through 4th (or 0-3 in my implementation).

There are possible strict ordinal 2x2 games, but many are “strategically equivalent”. In their 2005 book, Robinson and Goforth note that accounting for “strategic equivalence” leaves us with 144 strict ordinal 2x2 games. They go on to arrange these in a periodic table⁴:

The periodic table organizes the 144 canonical strict ordinal 2x2 games into a structured topology. Each cell represents a unique game type, with well-known games highlighted in colored regions. Moving horizontally changes one player’s preference ordering, while vertical movement affects the other player’s. Adjacent games differ by a single ordinal swap, creating natural “neighborhoods” in game space.

Strategic Equivalence

Two games are “strategically equivalent” if “rational” players would behave identically in both games. There are different ways to formalize this notion. Commonly, we require that the games have the same Nash equilibria⁵.

Nash Equilibria

A Nash equilibrium is a situation where no player could gain more by changing their own strategy (holding all other players’ strategies fixed).

Formally, a strategy profile is a Nash equilibrium if for all players and all alternative strategies :

where denotes the strategies of all players except .

In other words, each player’s strategy is a best response to the other players’ strategies. For 2x2 games, this can be computed by checking each cell to see if either player wants to deviate unilaterally.

Equilibria-Preserving Transformations

It’s well-known⁶ that Nash equilibria are preserved under the following three actions:

per‑player positive affine transforms.
permuting actions per player.
permuting player order.

Positive Affine Transformations

For each player , let

Claim. Best responses and Nash equilibria are unchanged by .

Proof. For fixed each player , , because is strictly increasing. Thus, the Nash set is invariant.

Action Permutations

Let be a permutation for each player . Define the relabeled game by

Claim. Nash equilibria are preserved under action relabeling.

Proof. Consider a strategy profile that is a Nash equilibrium in the original game. We show that is a Nash equilibrium in the relabeled game.

In the original game, for each player and any alternative strategy :

In the relabeled game, for any deviation to action :

Thus no player can improve by deviating in the relabeled game.

The mapping is a bijection between strategy profiles, establishing a one-to-one correspondence between Nash equilibria in the original and relabeled games.

Player Permutations

Let be a permutation of players. Define by reindexing: where reorders the profile back to the original coordinate order.

Claim. Permuting player order preserves Nash equilibria.

Proof. Similar to action permutations. Consider a strategy profile that is a Nash equilibrium in the original game. We show that defined by is a Nash equilibrium in .

In the original game, for each player and any alternative strategy :

In the reindexed game , for player considering deviation to strategy :

Thus no player can improve by deviating in the reindexed game.

The mapping where is a bijection between strategy profiles, establishing a one-to-one correspondence between Nash equilibria in the original and reindexed games.

Example

Returning to our original example, we can transform Game A into Game B via:

Affine transformation: Transform both players payoffs by .
Relabel actions: for Player 1, for Player 2
Relabel players: Swap Player 1 and Player 2

Thus, Game A and Game B are in some sense “equivalent”. We can also think about decomposing the original game into a tuple consisting of the canonical id, permutations on the row and columns, and an affine transformation, wherein Game A and Game B are equal in the first entry.

Differentiable Permutations

Our goal is to construct an invariant, differentiable, and deterministic function that maps a given game to it’s canonical representation. To do this, we will need to implement differentiable versions of each type of transformation that preserves the Nash equilibria. Since positive affine transformations are already differentiable, we just need a way to differentiably handle permutations. For this post we solve the differentiability problem by using the Sinkhorn algorithm to generate “soft” permutations⁷.

Doubly-Stochastic Matrices

Permutation matrices are discrete, but we can approximate them with doubly-stochastic matrices (non-negative matrices where rows and columns sum to 1).

A permutation matrix has exactly one 1 in each row and column, with all other entries being 0. For example, the permutation that swaps two items:

A doubly-stochastic matrix relaxes this constraint: entries can be any values in , as long as each row sums to 1 and each column sums to 1. For example:

This “soft” permutation mostly swaps the items (0.8 weight) but keeps some probability mass (0.2) on not swapping. As we make the entries more extreme (closer to 0 or 1), we approach a hard permutation.

Sinkhorn Algorithm

The Sinkhorn algorithm iteratively normalizes a matrix to make it doubly-stochastic. We start with a matrix containing positive entries .

We first create a cost matrix

where are target positions.

Then, we convert to “soft positions” with temperature .

Finally, we alternate normalizing the rows and columns until converged (usually 20-30 iterations).

Practical Considerations

The algorithm is guaranteed to converge to a unique solution for strictly positive matrices.

Since we use , all entries are positive. In practice, we work in log-space for numerical stability. As temperature , the soft permutation approaches a hard permutation while remaining differentiable.

When scores are identical, the cost matrix has ties and the soft permutation becomes ambiguous. We break ties using secondary criteria:

where is the mean score, is the variance, is the maximum, and are tiny weights. This ensures a deterministic ordering even for symmetric games.

Finally, we need to handle permutations on multiple axes. To manage this, we compute all permutations from the original ordinal rankings, then apply them in a fixed order. This ensures the canonicalization is consistent and differentiable end-to-end.

Implementation

Now that we have differentiable permutations, here is our general approach for implementing the key functions:

Convert payoffs to ordinal rankings.
Apply soft permutations to sort players and actions.
Use temperature annealing to sharpen soft permutations.
Generate a unique hash to identify the game.

We’ll organize this code as a GameCanonicalizer class. The remaining functions will be methods.

import torchsort
import torch 
from torch import nn

import hashlib
from scipy.optimize import linear_sum_assignment as hungarian

class GameCanonicalizer(nn.Module):
    
    def __init__(
            self, 
            num_players: int, 
            tau_players: float = 0.02, 
            tau_actions: float = 0.02, 
            sinkhorn_iters: int = 30,
            rank_reg: float = 1e-4, 
            tiny_tie: Tuple[float, float] = (1e-3, 1e-6)
            ):
        super().__init__()

        self.num_players = num_players

        self.rank_reg = rank_reg
        self.tau_players = tau_players
        self.tau_actions = tau_actions
        self.sinkhorn_iters = sinkhorn_iters
        self.tiny_tie = tiny_tie

Ordination

First, we create our ordinal values using torchsort’s soft_rank function, which uses projections onto the permutahedron to generate differentiable ranks.

    def ordinate(self, payoffs: torch.Tensor) -> torch.Tensor:
        original_shape = payoffs.shape
        flattened = payoffs.reshape(self.num_players, -1) 
        rankings = torchsort.soft_rank(flattened, regularization_strength=self.rank_reg)
        return rankings.view(original_shape)

Soft Permutations

Next, we need to implement our soft permutations.

Sinkhorn

We’ll start with the Sinkhorn algorithm itself:

   def sinkhorn(self, log_alpha: torch.Tensor, n_iters: int = 30, eps: float = 1e-9) -> torch.Tensor:
        log_P = log_alpha
        for _ in range(n_iters):
            log_P = log_P - torch.logsumexp(log_P, dim=1, keepdim=True) # rownorm
            log_P = log_P - torch.logsumexp(log_P, dim=0, keepdim=True) # colnorm
        return torch.exp(log_P).clamp_min(eps)

We convert to log-space, then alternate normalizing the rows and columns, as described above. Finally, we exponentiate again.

Converting Scores to Permutations

Now that we have the Sinkhorn method, we need to run it. The soft_perm_from_scores method creates these soft permutations from scores by first normalizing scores to , then computing a quadratic cost matrix between scores and target positions, then applying Sinkhorn.

    def soft_perm_from_scores(self, scores: torch.Tensor, tau: float = 0.05, n_iters: int = 30) -> torch.Tensor:
        n = scores.shape[0]
        positions = torch.linspace(0.0, 1.0, n, device=scores.device, dtype=scores.dtype)
        s = (scores - scores.min()) / (scores.max() - scores.min() + 1e-12) 
        cost = (s[:, None] - positions[None, :]) ** 2  
        log_alpha = -cost / (2 * tau)  
        P = self.sinkhorn(log_alpha, n_iters=n_iters) 
        return P

Scoring Functions with Tie-Breaking

Where do we get the scores from in the previous section? The _action_scores and _player_scores methods compute summary statistics for each action or player from the ordinal tensor. Both use the mean as the primary score, with variance and maximum as tie-breakers weighted by tiny constants. This ensures a deterministic ordering even for symmetric games where multiple permutations could be valid.

    def _action_scores(self, ordinal: torch.Tensor, player_idx: int) -> torch.Tensor:
        axes = list(range(ordinal.ndim))
        reduce_axes = [a for a in axes if a != player_idx]
        mean = ordinal.mean(dim=reduce_axes)
        var = ordinal.var(dim=reduce_axes, unbiased=False)
        mx = ordinal.amax(dim=reduce_axes)
        w_var, w_max = self.tiny_tie
        return mean + w_var * var + w_max * mx  

    def _player_scores(self, ordinal: torch.Tensor) -> torch.Tensor:
        P = ordinal.shape[0]
        flat = ordinal.reshape(P, -1)   
        mean = flat.mean(dim=1)
        var  = flat.var(dim=1, unbiased=False)
        mx   = flat.max(dim=1).values
        w_var, w_max = self.tiny_tie
        return mean + w_var * var + w_max * mx

Applying Permutations to Tensors

We also need to be able to apply permutations to tensors. The _mode_matmul method applies a permutation matrix to a specific axis of a tensor. Since matrix multiplication only works on 2D tensors, we permute dimensions to bring the target axis to the front, apply the permutation matrix transpose (which maps items to their sorted positions), then permute back. This allows us to sort along any axis of our multi-dimensional payoff tensor.

    @staticmethod
    def _mode_matmul(t: torch.Tensor, P: torch.Tensor, axis: int) -> torch.Tensor:

        perm = list(range(t.ndim))
        perm[axis], perm[0] = perm[0], perm[axis]
        t_perm = t.permute(perm) 
        n = t_perm.shape[0]
        assert P.shape == (n, n)
        
        t_sorted = torch.tensordot(P.T, t_perm, dims=([1], [0]))  
        inv = list(range(t.ndim))
        inv[0], inv[axis] = inv[axis], inv[0]
        return t_sorted.permute(inv)

Forward Pass

Now that we have the above methods, we can put them together. The forward method orchestrates the soft canonicalization process. First, it converts payoffs to ordinal rankings. Then it computes action permutations for each player from the original ordinals and applies them to their respective axes (axes 1, 2, etc.). Finally, it computes and applies the player permutation on axis 0. The key insight is that action permutations are computed from the original ordinals before any transformations, ensuring consistency—each player’s actions are sorted based on their own payoffs, not influenced by other permutations.

def forward(self, payoffs: torch.Tensor) -> torch.Tensor:
        P = payoffs.shape[0]
        ordinated_payoffs = self.ordinate(payoffs)

        P_actions = []
        for i in range(P):
            s_actions = self._action_scores(ordinated_payoffs[i], player_idx=i)
            P_i = self.soft_perm_from_scores(s_actions, tau=self.tau_actions, n_iters=self.sinkhorn_iters)
            P_actions.append(P_i)

        canon = ordinated_payoffs
        for i, P_i in enumerate(P_actions):
            canon = self._mode_matmul(canon, P_i, axis=1 + i)

        s_players = self._player_scores(canon)
        P_players = self.soft_perm_from_scores(s_players, tau=self.tau_players, n_iters=self.sinkhorn_iters)
        canon = self._mode_matmul(canon, P_players, axis=0)

        return canon, P_players, tuple(P_actions)

Hard Canonicalization

We also want a “hard” path to verify everything is working properly. We can map the “soft” permutations back to “hard” permutations and then use those to recover the discrete cases.

Project Soft Permutations to Hard

The _project_soft_to_perm method converts a soft permutation matrix to a hard permutation (using the Hungarian algorithm). We add tiny tie-breaking noise to ensure deterministic results even when the soft matrix has ambiguous assignments.

    @staticmethod
    def _project_soft_to_perm(P_soft: torch.Tensor) -> torch.Tensor:
        n = P_soft.shape[0]
        idx = torch.arange(n, device=P_soft.device, dtype=P_soft.dtype)
        eps = 1e-11 * (idx[:, None] + 0.73 * idx[None, :])

        cost = (-P_soft + eps).detach().cpu().numpy()
        r, c = hungarian(cost)
        Pi = torch.zeros_like(P_soft)
        Pi[r, c] = 1.0
        return Pi

Lexicographic Ordering For Symmetric Games

The _axis_lexperm method performs lexicographic sorting along a specified axis, which is crucial for handling symmetric games like Matching Pennies. It treats each slice along the axis as a multi-digit number in base (max_rank + 1) and sorts these “numbers” to get a canonical ordering. The _permute_along_axis helper applies the resulting permutation to reorder the tensor. This deterministic tie-breaking ensures that even perfectly symmetric games get a unique canonical form.

    def _axis_lexperm(self, ranks: torch.Tensor, axis: int) -> torch.Tensor:

        perm = list(range(ranks.ndim))
        perm[axis], perm[0] = perm[0], perm[axis]
        X = ranks.permute(perm) 

        n = X.shape[0]
        S = X.reshape(n, -1)

        maxv = int(S.max().item()) if S.numel() > 0 else 0
        base = maxv + 1

        K = torch.zeros(n, dtype=torch.float64, device=S.device)
        pow_ = 1.0
        for j in range(S.shape[1]-1, -1, -1):
            K += (S[:, j].to(torch.float64)) * pow_
            pow_ *= base

        order = torch.argsort(K, stable=True)
        return order

    @staticmethod
    def _permute_along_axis(t: torch.Tensor, order: torch.Tensor, axis: int) -> torch.Tensor:
        perm = list(range(t.ndim))
        perm[axis], perm[0] = perm[0], perm[axis]
        t0 = t.permute(perm)
        t0 = t0.index_select(0, order.to(t0.device))
        inv = list(range(t.ndim))
        inv[0], inv[axis] = inv[axis], inv[0]
        return t0.permute(inv)

Integer Ordinals

The _integerize_ordinals method converts soft ordinal rankings to hard integer ranks (0, 1, 2, 3 for 2×2 games). It adds tiny deterministic noise based on the golden ratio to break ties, then uses argsort twice to get proper rankings. This ensures each player’s outcomes are mapped to distinct integers while preserving the ordering from the soft ranks.

    @staticmethod
    def _integerize_ordinals(ord_tensor: torch.Tensor) -> torch.Tensor:
        P = ord_tensor.shape[0]
        M = ord_tensor[0].numel()
        out = []
        for p in range(P):
            x = ord_tensor[p].flatten()
            idx = torch.arange(M, device=x.device, dtype=x.dtype)
            x_eps = x + 1e-9 * ((idx * 0.61803398875) % 1.0) # Golden ratio trick
            order = torch.argsort(x_eps, stable=True)      
            ranks = torch.empty_like(order)
            ranks[order] = torch.arange(M, device=x.device)
            out.append(ranks.view_as(ord_tensor[p]))
        return torch.stack(out, dim=0).to(torch.int32)

Putting it Together

The hard_canonical method performs the complete canonicalization with hard permutations. It follows the same flow as the soft version but uses the Hungarian algorithm via _project_soft_to_perm to convert each soft permutation to a hard one. After applying player and action permutations, it performs an additional lexicographic sorting step on the integer ordinals to handle symmetric games. This final step ensures a unique canonical form even when the initial permutations leave multiple valid orderings.

    def hard_canonical(self, payoffs: torch.Tensor):
        P = payoffs.shape[0]

        ordinal = self.ordinate(payoffs)

        Pi_actions = []
        hard = ordinal
        for i in range(P):
            s_actions = self._action_scores(ordinal[i], player_idx=i)
            P_i_soft = self.soft_perm_from_scores(s_actions, tau=self.tau_actions, n_iters=self.sinkhorn_iters)
            Pi_i = self._project_soft_to_perm(P_i_soft.detach())
            hard = self._mode_matmul(hard, Pi_i, axis=1 + i)
            Pi_actions.append(Pi_i)

        s_players = self._player_scores(hard)
        P_players_soft = self.soft_perm_from_scores(s_players, tau=self.tau_players, n_iters=self.sinkhorn_iters)
        Pi_players = self._project_soft_to_perm(P_players_soft.detach())
        hard = self._mode_matmul(hard, Pi_players, axis=0)

        ranks = self._integerize_ordinals(hard)

        order0 = self._axis_lexperm(ranks, axis=0)

        E0_h = torch.eye(P, device=hard.device, dtype=hard.dtype)[order0]
        hard  = self._mode_matmul(hard, E0_h, axis=0)

        ranks = self._permute_along_axis(ranks, order0, axis=0)

        for i in range(P):
            ord_i = self._axis_lexperm(ranks, axis=1 + i)

            # float path
            n_i  = hard.shape[1 + i]
            Ei_h = torch.eye(n_i, device=hard.device, dtype=hard.dtype)[ord_i]
            hard  = self._mode_matmul(hard, Ei_h, axis=1 + i)

            # int path
            ranks = self._permute_along_axis(ranks, ord_i, axis=1 + i)
        return hard, Pi_players, tuple(Pi_actions)

Hashing

The class_id method generates a unique identifier for each game’s strategic equivalence class. It runs the hard canonicalization, converts the result to integer ordinals, then computes a SHA-256 hash of the binary representation. This allows us to quickly identify when two games are strategically equivalent.

    def class_id(self, payoffs: torch.Tensor):
        hard_ord, _, _ = self.hard_canonical(payoffs)
        ranks = self._integerize_ordinals(hard_ord)
        b = ranks.detach().cpu().numpy().tobytes()
        digest = hashlib.sha256(b).hexdigest()
        return digest[:12], digest, ranks

Complexity Analysis

The canonicalization process has the following complexity for 2×2 games: - Ordinal ranking: where n = 4 (number of outcomes per player) - Sinkhorn iterations: where is the number of iterations - Hungarian algorithm for hard permutation:

For 2×2 games, this is effectively constant time. For larger games with players and strategies each: - Space: for the payoff tensor - Time: for ranking + for permutations

Edge Cases and Limitations

This implementation handles several tricky cases:

Symmetric games (e.g., Matching Pennies): The lexicographic ordering ensures deterministic canonicalization even when multiple permutations could be valid.
Ties in ordinal rankings: While we assume strict ordinal games, near-ties are handled via the tie-breaking parameters tiny_tie.

The main limitation is the restriction to strict ordinal games. Games with payoff ties would require a different approach or explicit tie-breaking rules.

Example

Let’s look at a simple example.

# Our original games from the introduction
game_a = torch.tensor([
    [[3.0, 0.0],[5.0, 1.0]],
    [[3.0, 5.0],[0.0, 1.0]],
])

game_b = torch.tensor([
    [[10.0, 25.0],[4.0, 16.0]],
    [[10.0, 4.0],[25.0, 16.0]],
])

canon = GameCanonicalizer(num_players=2)

# Ordinal conversion
ord_a = canon.ordinate(game_a)
# [[2, 0], [3, 1]] for P1, [[2, 3], [0, 1]] for P2

ord_b = canon.ordinate(game_b) 
# [[1, 3], [0, 2]] for P1, [[1, 0], [3, 2]] for P2

# Canonicalization
canon_a, _, _ = canon.hard_canonical(game_a)
canon_b, _, _ = canon.hard_canonical(game_b)

# Verify they're the same
id_a = canon.class_id(game_a)[0]
id_b = canon.class_id(game_b)[0]

print(f"Game A canonical: {canon_a}")
print(f"Game B canonical: {canon_b}")
print(f"Game A ID: {id_a}")
print(f"Game B ID: {id_b}")
print(f"Same game? {id_a == id_b}")  # True!

Tests

Let’s now look at some tests. We’ll start with some helper functions⁸.

Helper Functions

def affine(payoffs: torch.Tensor, a0=1.0, b0=0.0, a1=1.0, b1=0.0) -> torch.Tensor:
    P = payoffs.clone().float()
    P[0] = a0 * P[0] + b0
    P[1] = a1 * P[1] + b1
    return P

def permute_actions(payoffs: torch.Tensor, perm0=(0,1), perm1=(0,1)) -> torch.Tensor:
    P = payoffs.clone()
    P = P[:, perm0, :]   
    P = P[:, :, perm1]  
    return P

def swap_players(payoffs: torch.Tensor) -> torch.Tensor:
    P0 = payoffs[0].transpose(0,1)  
    P1 = payoffs[1].transpose(0,1)
    return torch.stack([P1, P0], dim=0)

def class_id(canon, U):
    short, full, _ = canon.class_id(U)
    return full

The first three functions will let us create variants of games. The last function returns the full class_id for a given game.

Game Payoffs

Here’s a number of “base games” we can test:

def prisoners_dilemma():
    return torch.tensor([
        [[3.0, 0.0],[5.0, 1.0]],
        [[3.0, 5.0],[0.0, 1.0]],
    ])

def stag_hunt():
    return torch.tensor([
        [[4.0, 0.0],[3.0, 3.0]],
        [[4.0, 3.0],[0.0, 3.0]],
    ])

def battle_of_sexes():
    return torch.tensor([
        [[2.0, 0.0],[0.0, 1.0]],
        [[1.0, 0.0],[0.0, 2.0]],
    ])

def matching_pennies():
    P1 = torch.tensor([[1.0, -1.0],[-1.0, 1.0]])
    P2 = -P1
    return torch.stack([P1, P2], dim=0)

def hawk_dove():
    return torch.tensor([
        [[0.0, 3.0],[1.0, 2.0]],
        [[0.0, 1.0],[3.0, 2.0]],
    ])

Stability Under Transformation

We can combine our base payoffs with our transfomation functions to build variants:

def pd_variants():
    U = prisoners_dilemma()
    return [
        U,
        affine(U, a0=2.0, b0=5.0, a1=0.5, b1=-1.0),                         
        permute_actions(U, perm0=(1,0), perm1=(1,0)),                       
        swap_players(U),                                                    
        permute_actions(affine(U, a0=3.0, b0=7.0, a1=1.7, b1=2.0), (1,0), (0,1)),  
    ]

def sh_variants():
    U = stag_hunt()
    return [
        U,
        affine(U, a0=4.0, b0=10.0, a1=2.0, b1=-3.0),
        permute_actions(U, perm0=(1,0), perm1=(1,0)),
        swap_players(U),
        permute_actions(affine(U, a0=0.7, b0=0.0, a1=5.0, b1=1.0), (0,1), (1,0)),
    ]

def bos_variants():
    U = battle_of_sexes()
    return [
        U,
        permute_actions(U, perm0=(1,0), perm1=(1,0)),   
        swap_players(U),                                  
        affine(U, a0=2.0, b0=1.0, a1=3.0, b1=-2.0),       
        permute_actions(affine(U, 1.0, 0.0, 4.0, 10.0), (1,0), (0,1)),
    ]

def mp_variants():
    U = matching_pennies()
    return [
        U,
        permute_actions(U, perm0=(1,0), perm1=(1,0)),     
        swap_players(U),                                   
        affine(U, a0=5.0, b0=3.0, a1=2.0, b1=-7.0),       
        permute_actions(affine(U, 2.0, 9.0, 1.0, -4.0), (1,0), (0,1)),
    ]

def hd_variants():
    U = hawk_dove()
    return [
        U,
        permute_actions(U, perm0=(1,0), perm1=(1,0)),
        swap_players(U),
        affine(U, a0=2.0, b0=0.0, a1=3.0, b1=1.0),
        permute_actions(affine(U, 0.5, 2.0, 1.5, -3.0), (0,1), (1,0)),
    ]

Then we can test this as so:

def test_family_equivalence(canonicalizer):
    families = [
        ("PrisonersDilemma", pd_variants()),
        ("StagHunt",         sh_variants()),
        ("BattleOfSexes",    bos_variants()),
        ("MatchingPennies",  mp_variants()),
        ("HawkDove",         hd_variants()),
    ]

    for name, variants in families:
        ids = [class_id(canonicalizer, U) for U in variants]
        assert len(set(ids)) == 1, f"{name}: expected all variants equivalent, got IDs: {ids}"
        print(f"[OK] {name}. ID {ids[0]} (x{len(variants)})")

Negative Controls

def test_negative_controls(canonicalizer):
    reps = [
        prisoners_dilemma(),
        stag_hunt(),
        battle_of_sexes(),
        matching_pennies(),
        hawk_dove(),
    ]
    ids = [class_id(canonicalizer, U) for U in reps]
    assert len(set(ids)) == len(ids), f"Negative control failed: collisions among base games: {ids}"
    print("[OK] Negative controls. Distinct IDs: ", ids)

Stability Under Small Noise

def test_stability_small_noise(canonicalizer):
    U = stag_hunt()
    base = class_id(canonicalizer, U)
    eps = 1e-9
    U_noisy = U + eps * torch.randn_like(U)
    pert = class_id(canonicalizer, U_noisy)
    assert base == pert, "Tiny noise changed ID; consider epsilon tie-handling."
    print("[OK] Stability to tiny noise.")

Conclusion

We have successfully constructed a differentiable way to produce a unique identifier for each strict ordinal 2x2 game. In the next post in this series, we will expand on this capability (or incorporate it into another pipeline for some practical use case).

Footnotes

Or vice-versa, to retrieve the game behavior based on some identifier.↩︎
i.e. “When does Stag Hunt turn into Chicken?” Hopefully more on this in a future post.↩︎
For an -player, -strategy game, we would need an tensor ( copies of ).↩︎
I will hopefully also further investigate the algebraic structure of games, and this periodic table, in a future post.↩︎
Other forms of strategic equivalence might revolve around dominance relationships or best-response correspondences.↩︎
See Myerson (1991), Game Theory: Analysis of Conflict, Chapter 3.↩︎
There are alternative methods. From Claude: NeuralSort (Grover et al., 2019), SoftSort (Prillo & Eisenschlos, 2020), Fast Differentiable Sorting (Blondel et al., 2020), Optimal Transport Sort (Cuturi et al., 2019), Blackbox Differentiable Ranking (Vlastelica et al., 2020; Rolínek et al., 2020), Relaxed Bubble Sort (Petersen et al., 2021). We could also use non-differentiable methods, like REINFORCE. Sinkhorn should be sufficient for these purposes: it’s well-known, controllable, and stable, even for near-ties (common in symmetric games). If we need a faster algorithm later we can swap Sinkhorn for something else.↩︎
Disclosure: ChatGPT and Claude helped write the tests.↩︎

SINDy Method for Learning Dynamical Systems

Sun, 06 Jul 2025 04:00:00 GMT

Introduction

In the last post I looked at DMD and EDMD, two methods for analyzing dynamical systems. Both methods look at pairs of snapshots and fit a linear update rule that best predicts the next snapshot. By inspecting that linear map’s eigenvalues and eigenvectors we can learn which patterns dominate and how fast they grow or decay.

Unfortunately, both of these methods rely on a set of features you have chosen in advance. Without the right features you might miss important physics; too many features and the model will become unwieldy. Additionally, while the spectral methods provide some useful insight, they might not have the most easily interpretable output.

What we need is a sparse method, so that we can test a wide number of features and only select the relevant ones for the dynamics, ideally to recover an actual set of interpretable governing equations.

Sparse Identification of Nonlinear Dynamical Systems (SINDy)

SINDy creates a library of candidate basis functions for the eigenfunctions (like EDMD), then does an L1-regularized¹ regression to determine which ones to use. There are two formulations: discrete-time and continuous-time.

Discrete Time

This setup is the same as in DMD and EDMD. We have a discrete-time dynamical system of form:

We have our data stream, which we use to create our matrix :

and its time-shifted counterpart :

Now, we need some “library matrix” of relevant basis functions applied to the data (similar to EDMD). This may look something like:

This matrix can be quite large. For example, if the basis functions are drawn from combinations of the monomials , , and (up to quadratic), we would have something like:

We are looking for a sparse set of coefficients such that

This leads us to the SINDy equation:

Each basis function is weighted by a . To find the , we run a sparse -regularized regression:

where is a parameter controlling the strength of the regularization.

To see the mapping at a single time-step, take the -th row of :

where

Continuous Time

A more typical formulation is to assume a dynamical system of the form

where is the state (possibly a vector) and is the time. We have moved from discrete-time to continuous time. We now seek to learn the vector field , which relates the current state to the rate of change in the state, rather than the next state. Luckily, as we describe in the previous post the Koopman eigenfunctions satisfy:

By the chain rule, we also have

Combining the two, we have

So we can approximate the eigenfunctions via regression². Approximate by a sparse weighted sum of dictionary elements:

This gives:

Evaluating at the sample states:

Now we have

and so we can solve the continuous-time problem with the following regression

Conveniently, exactly the same as the previous regression equation, except here we are approximating:

Implementation

To implement SINDy, we need a library matrix function and a method of handling the regression. We may also want a finite differences method, to calculate derivatives.

Finite Differences

We will use finite differences³ to approximate the time derivative by dividing small state increments by the timestep .

At an interior point we use the second-order central stencil

which cancels the first-order truncation error, yielding accuracy.

At the boundaries we fall back to the first-order forward and backward stencils:

def finite_difference(X, dt):
    n_vars, n_samples = X.shape
    
    # Use central differences where possible, forward/backward at boundaries
    dXdt = np.zeros((n_samples, n_vars))
    
    # Forward difference at first point
    dXdt[0, :] = (X[:, 1] - X[:, 0]) / dt
    
    # Central differences for interior points
    for i in range(1, n_samples - 1):
        dXdt[i, :] = (X[:, i + 1] - X[:, i - 1]) / (2 * dt)
    
    # Backward difference at last point
    dXdt[-1, :] = (X[:, -1] - X[:, -2]) / dt
    
    return dXdt

The helper above implements exactly this scheme: it accepts a state matrix of shape , returns the derivative matrix , and also transposes so rows correspond to time snapshots for the subsequent regression.

Library Matrix

Next we build the candidate feature matrix by stacking a constant term, all monomials up to poly_order, and optionally sine/cosine transforms of each state variable. Returns that matrix along with a parallel descriptions list so the sparse coefficients can later be mapped back to human-readable terms.

def sindy_library(X, poly_order=3, include_sine=False, include_cosine=False):
    n_vars, n_samples = X.shape
    
    # Start with constant term
    library_functions = [np.ones(n_samples)]
    descriptions = ['1']
    
    # Add polynomial terms
    for order in range(1, poly_order + 1):
        for combo in combinations_with_replacement(range(n_vars), order):
            if order == 1:
                var_idx = combo[0]
                library_functions.append(X[var_idx, :])
                descriptions.append(f'x_{var_idx}')
            else:
                term = np.ones(n_samples)
                term_desc = []
                for var_idx in combo:
                    term *= X[var_idx, :]
                    term_desc.append(f'x_{var_idx}')
                library_functions.append(term)
                descriptions.append('*'.join(term_desc))
    
    # Add trigonometric terms if requested
    if include_sine:
        for i in range(n_vars):
            library_functions.append(np.sin(X[i, :]))
            descriptions.append(f'sin(x_{i})')
            
    if include_cosine:
        for i in range(n_vars):
            library_functions.append(np.cos(X[i, :]))
            descriptions.append(f'cos(x_{i})')
    
    Theta = np.column_stack(library_functions)
    
    return Theta, descriptions

Sequential Threshold Least Squares

We will use Sequential Threshold Least Squares to obtain the full coefficient matrix.

At each iteration :

Set every entry with to zero, forcing small terms out of the model⁴.
Refit for each state component , solve a new least–squares problem using only the surviving (non-zero) columns of to get updated weights.
Repeat until convergence or for at most max_itercycles; the final contains only the terms that remain large after repeated shrink-and-refit, giving a sparse governing equation.

def sequential_threshold_least_squares(Theta, dXdt, lambda_reg=0.1, max_iter=10):
    n_states = dXdt.shape[1]
    
    # Initialize coefficient matrix
    Xi = np.linalg.lstsq(Theta, dXdt, rcond=None)[0]
    
    # Iterative thresholding
    for iteration in range(max_iter):

        # Find small coefficients to remove
        small_inds = np.abs(Xi) < lambda_reg
        
        # Set small coefficients to zero
        Xi[small_inds] = 0
        
        # Identify active (non-zero) coefficients for each state
        for i in range(n_states):
            big_inds = ~small_inds[:, i]
            if np.any(big_inds):
                
                # Recompute non-zero coefficients using least squares
                Xi[big_inds, i] = np.linalg.lstsq(
                    Theta[:, big_inds], dXdt[:, i], rcond=None
                )[0]
            else:
                Xi[:, i] = 0
    
    return Xi

`print_equations`

I’ll add one bonus helper function, which will be useful later

def print_equations(Xi, descriptions, feature_names=None):
    n_functions, n_features = Xi.shape
    
    if feature_names is None:
        feature_names = [f'x_{i}' for i in range(n_features)]
        
    for i in range(n_features):

        # Build equation string
        terms = []
        for j in range(n_functions):
            coef = Xi[j, i]
            if abs(coef) > 1e-10:  # Only include non-zero terms
                if abs(coef - 1.0) < 1e-10:
                    terms.append(f"{descriptions[j]}")
                elif abs(coef + 1.0) < 1e-10:
                    terms.append(f"-{descriptions[j]}")
                else:
                    terms.append(f"{coef:.6f}*{descriptions[j]}")
        
        if terms:
            equation = " + ".join(terms).replace(" + -", " - ")
            print(f"d{feature_names[i]}/dt = {equation}")
        else:
            print(f"d{feature_names[i]}/dt = 0")
    print()

Integration

Putting the piece together:

def sindy(X, dXdt, poly_order=3, lambda_reg=0.1, include_sine=False, 
          include_cosine=False, max_iter=10):
    
    # Build library of candidate functions
    Theta, descriptions = sindy_library(
        X, 
        poly_order=poly_order, 
        include_sine=include_sine, 
        include_cosine=include_cosine
    )
    
    # Sparse regression using sequential thresholded least squares
    Xi = sequential_threshold_least_squares(Theta, dXdt, lambda_reg, max_iter)
    
    return Xi, descriptions

For a production version you may want to factor the library production out of the SINDy method.

Experiments

Let’s generate data from few known systems and see if SINDy can recover their governing equations.

Simple Harmonic Oscillator

Definition

The simple harmonic oscillator is described by the following second-order ODE:

If we let:

We can then derive

So it has an alternative formulation as a set of coupled first-order ODEs.

Approximation

The simple harmonic oscillator has a well-known analytic solution:

Let

def generate_simple_harmonic_oscillator_data(omega=2.0, t_final=10, dt=0.01):
    t = np.arange(0, t_final, dt)
    
    # Analytical solution
    x = np.cos(omega * t)
    xdot = -omega * np.sin(omega * t)
    
    X_train = np.vstack([x, xdot])
    
    return X_train

This will give us data for the harmonic oscillator⁵.

Now we run SINDy on this data:

X = generate_simple_harmonic_oscillator_data()
dXdt = finite_difference(X, dt)                
Xi, desc = sindy(X, dXdt, poly_order=1, lambda_reg=0.1)
print_equations(Xi, desc, feature_names=['x_0', 'x_1'])

We discover the following equations:

dx_0/dt = 0.999925*x_1
dx_1/dt = -3.999764*x_0

Which is close to the true generating equations:

Lorenz System

Definition

The Lorenz-63 model is a three-dimensional ODE derived from a truncated Fourier expansion of the Boussinesq equations for thermal convection. In its nondimensional form the state evolves as

Let us recover this equation from data.

Approximation

def generate_lorenz_data(initial_conditions, sigma=10, rho=28, beta=8/3, t_final=10, dt=0.01):
    
    def lorenz_rhs(state, sigma=sigma, rho=rho, beta=beta):
        x, y, z = state
        return np.array([
            sigma * (y - x),
            x * (rho - z) - y,
            x * y - beta * z
        ])
    
    # Generate training data
    t_train = np.arange(0, t_final, dt)
    n_steps = len(t_train)
    
    X_train = np.zeros((3, n_steps))
    X_train[:, 0] = initial_conditions 
    
    for i in range(1, n_steps):
        k1 = lorenz_rhs(X_train[:, i-1])
        k2 = lorenz_rhs(X_train[:, i-1] + dt/2 * k1)
        k3 = lorenz_rhs(X_train[:, i-1] + dt/2 * k2)
        k4 = lorenz_rhs(X_train[:, i-1] + dt * k3)
        X_train[:, i] = X_train[:, i-1] + dt/6 * (k1 + 2*k2 + 2*k3 + k4)

    return X_train

Following the same steps:

initial_conditions = [-8, 8, 27] 
X = generate_lorenz_data(initial_conditions)

dXdt = finite_difference(X, dt)                
Xi, desc = sindy(X, dXdt, poly_order=2, lambda_reg=0.05)

print_equations(Xi, desc, feature_names=['x', 'y', 'z'])

We discover the following equations:

dx/dt = -9.971816*x_0 + 9.972789*x_1
dy/dt = 27.823397*x_0 - 0.970096*x_1 - 0.994849*x_0*x_2
dz/dt = -2.658203*x_2 + 0.996862*x_0*x_1

Which is once again pretty close to the actual generating equations (rewritten):

Conclusion

We used SINDy to recover governing equations for two toy problems.