advanced category-theory 50 min read

Monads & Comonads

The categorical calculus of effects and contexts — from Bayesian inference to backpropagation, from probability monads to graph comonads

Prerequisites: Adjunctions

Overview and Motivation

We have arrived at the capstone. Over the previous three topics we built the language of categories and functors, extended it with natural transformations, and used these pieces to define adjunctions — the universal pattern connecting free constructions to forgetful functors, discrete spaces to indiscrete ones, and primal optimization to dual. The preview at the end of the Adjunctions topic left an open thread: every adjunction FGF \dashv G generates something called a monad T=GFT = GF on the source category, with the adjunction’s unit and counit providing the monad’s structure maps.

This topic resolves that thread completely. A monad is an endofunctor equipped with two natural transformations — a unit η\eta and a multiplication μ\mu — satisfying associativity and unit laws. Mac Lane’s famous dictum captures the essence: a monad is a monoid in the category of endofunctors. But what makes monads essential for machine learning is not the abstract definition — it is the gallery of instances and the computational structures they organize:

  • The Maybe monad captures partiality — computations that may fail.
  • The List monad captures nondeterminism — computations with multiple outcomes.
  • The Giry monad on Meas\mathbf{Meas} captures probabilistic computation — its Kleisli arrows are Markov kernels, and Kleisli composition is the Chapman-Kolmogorov equation. This is the categorical foundation of Bayesian inference.
  • The continuation monad captures CPS (continuation-passing style) — and backpropagation is Kleisli composition in this monad, with the chain rule emerging as functoriality.

Dually, comonads encode contextual computation — where every point carries information about its surroundings:

  • The Stream comonad models signal processing — each time step sees the entire future.
  • The Neighborhood comonad on a graph models exactly the message-passing operation in GNNs: a coKleisli arrow extracts features from a local neighborhood, and the coKleisli extension applies this extraction at every node simultaneously.

The deep structural result is Beck’s Monadicity Theorem, which tells us precisely when a category of structured objects (groups, vector spaces, modules) can be described as the category of algebras for a monad — drawing the fundamental boundary between algebraic and non-algebraic mathematics.


Monads: Definition and Laws

Definition 1 (Monad).

A monad on a category C\mathcal{C} is a triple (T,η,μ)(T, \eta, \mu) where:

  • T:CCT: \mathcal{C} \to \mathcal{C} is an endofunctor,
  • η:IdCT\eta: \mathrm{Id}_{\mathcal{C}} \Rightarrow T is a natural transformation called the unit,
  • μ:T2T\mu: T^2 \Rightarrow T is a natural transformation called the multiplication,

satisfying the monad laws (Definition 2).

Definition 2 (Monad Laws).

A monad (T,η,μ)(T, \eta, \mu) must satisfy:

Associativity. μTμ=μμT\mu \circ T\mu = \mu \circ \mu_T, i.e., for every object AA: μAT(μA)=μAμTA\mu_A \circ T(\mu_A) = \mu_A \circ \mu_{TA}

Left unit. μηT=idT\mu \circ \eta_T = \mathrm{id}_T, i.e., for every object AA: μAηTA=idTA\mu_A \circ \eta_{TA} = \mathrm{id}_{TA}

Right unit. μTη=idT\mu \circ T\eta = \mathrm{id}_T, i.e., for every object AA: μAT(ηA)=idTA\mu_A \circ T(\eta_A) = \mathrm{id}_{TA}

These laws say that μ\mu is an associative operation with η\eta as its two-sided unit — exactly the axioms of a monoid, but internal to the category of endofunctors.

Remark (Mac Lane's Dictum).

A monad is a monoid in the category of endofunctors [C,C][\mathcal{C}, \mathcal{C}]. The functor category [C,C][\mathcal{C}, \mathcal{C}] carries a monoidal structure with composition \circ as the tensor and IdC\mathrm{Id}_{\mathcal{C}} as the unit. A monoid object in this monoidal category is precisely an endofunctor TT equipped with a multiplication μ:TTT\mu: T \circ T \Rightarrow T and a unit η:IdT\eta: \mathrm{Id} \Rightarrow T satisfying associativity and unit laws — which is exactly a monad.

Concrete Example — Maybe
Partiality — computations that may fail
Input x
3
η(x)
Just(3)
T(T(x))
Just(Just(3))
μ(T(T(x)))
Just(3)
η: η(x) = Just(x)  |  μ: μ(Just(Just(x))) = Just(x), μ(Just(Nothing)) = Nothing
Monad definition: endofunctor, unit, multiplication, monad laws, adjunction construction, monoid analogy

From Adjunctions to Monads

We now make good on the promise from Adjunctions. Every adjunction produces a monad.

Proposition 1 (Adjunction → Monad).

Let FGF \dashv G be an adjunction with unit η:IdCGF\eta: \mathrm{Id}_{\mathcal{C}} \Rightarrow GF and counit ε:FGIdD\varepsilon: FG \Rightarrow \mathrm{Id}_{\mathcal{D}}. Then (T,η,μ)(T, \eta, \mu) is a monad on C\mathcal{C}, where:

  • T=GFT = GF,
  • η\eta is the adjunction unit,
  • μ=GεF\mu = G\varepsilon F, i.e., μA=G(εF(A)):GFGF(A)GF(A)\mu_A = G(\varepsilon_{F(A)}): GFGF(A) \to GF(A).

Proof. We must verify the three monad laws.

Associativity: We need μAT(μA)=μAμTA\mu_A \circ T(\mu_A) = \mu_A \circ \mu_{TA}, i.e., G(εFA)GF(G(εFA))=G(εFA)G(εGFFA)G(\varepsilon_{FA}) \circ GF(G(\varepsilon_{FA})) = G(\varepsilon_{FA}) \circ G(\varepsilon_{GFFA}). Applying GG to both sides, this follows from the naturality of ε\varepsilon: the square εFAFG(εFA)=εFAεFGFA\varepsilon_{FA} \circ FG(\varepsilon_{FA}) = \varepsilon_{FA} \circ \varepsilon_{FGFA} commutes because ε\varepsilon is natural.

Left unit: We need μAηTA=idTA\mu_A \circ \eta_{TA} = \mathrm{id}_{TA}, i.e., G(εFA)ηGFA=idGFAG(\varepsilon_{FA}) \circ \eta_{GFA} = \mathrm{id}_{GFA}. This is exactly the first triangle identity GεηG=idGG\varepsilon \circ \eta G = \mathrm{id}_G applied to FAFA.

Right unit: We need μAT(ηA)=idTA\mu_A \circ T(\eta_A) = \mathrm{id}_{TA}, i.e., G(εFA)GF(ηA)=idGFAG(\varepsilon_{FA}) \circ GF(\eta_A) = \mathrm{id}_{GFA}. Applying GG, this becomes G(εFAF(ηA))=G(idFA)G(\varepsilon_{FA} \circ F(\eta_A)) = G(\mathrm{id}_{FA}), which follows from the second triangle identity εFFη=idF\varepsilon F \circ F\eta = \mathrm{id}_F. \square

The construction T=GFT = GF, μ=GεF\mu = G\varepsilon F is the canonical way to build monads. The converse question — does every monad arise from an adjunction? — is answered by the Kleisli and Eilenberg-Moore categories below.


Gallery of monads: Maybe, List, Giry, Power Set, Reader, Continuation
MonadEndofunctor TTUnit η\etaMultiplication μ\muEffect
MaybeT(X)=X{}T(X) = X \cup \{\bot\}η(x)=Just(x)\eta(x) = \mathrm{Just}(x)Flatten: Just(Just(x))Just(x)\mathrm{Just}(\mathrm{Just}(x)) \mapsto \mathrm{Just}(x)Partiality
ListT(X)=XT(X) = X^* (free monoid)η(x)=[x]\eta(x) = [x]Concat: [[a,b],[c]][a,b,c][[a,b],[c]] \mapsto [a,b,c]Nondeterminism
GiryT(X)=Dist(X)T(X) = \mathrm{Dist}(X)η(x)=δx\eta(x) = \delta_x (Dirac)Integration: μ(Φ)=pdΦ(p)\mu(\Phi) = \int p \, d\Phi(p)Probability
Power SetT(X)=P(X)T(X) = \mathcal{P}(X)η(x)={x}\eta(x) = \{x\}Union: μ(A)=AAA\mu(\mathcal{A}) = \bigcup_{A \in \mathcal{A}} ANondeterminism (Set)
ReaderT(X)=(EX)T(X) = (E \to X)η(x)=λe.x\eta(x) = \lambda e.\, xDiagonal: μ(f)=λe.f(e)(e)\mu(f) = \lambda e.\, f(e)(e)Environment
ContinuationT(X)=(XR)RT(X) = (X \to R) \to Rη(x)=λk.k(x)\eta(x) = \lambda k.\, k(x)μ(Φ)=λk.Φ(λf.f(k))\mu(\Phi) = \lambda k.\, \Phi(\lambda f.\, f(k))CPS/Backprop

Each monad captures a specific notion of computational effect. The unit η\eta is the “pure” computation — wrapping a value with no effect. The multiplication μ\mu collapses a doubled effect into a single one: flattening nested Maybes, concatenating nested lists, or integrating over distributions of distributions.


Kleisli Categories

Definition 3 (Kleisli Category).

Let (T,η,μ)(T, \eta, \mu) be a monad on C\mathcal{C}. The Kleisli category CT\mathcal{C}_T has:

  • Objects: The same as C\mathcal{C}.
  • Morphisms: A morphism ABA \to B in CT\mathcal{C}_T is a morphism ATBA \to TB in C\mathcal{C} (a Kleisli arrow).
  • Identity: ηA:ATA\eta_A: A \to TA.
  • Composition: Given f:ATBf: A \to TB and g:BTCg: B \to TC, the Kleisli composite is gTf=μCTgfg \circ_T f = \mu_C \circ Tg \circ f.

Definition 4 (Kleisli Composition (Fish Operator)).

The Kleisli composition or fish operator is: (g>=>f)(a)=μC(T(g)(f(a)))(g \mathbin{>=>} f)(a) = \mu_C(T(g)(f(a))) for f:ATBf: A \to TB and g:BTCg: B \to TC, yielding g>=>f:ATCg \mathbin{>=>} f: A \to TC.

In Haskell notation, this is (>=>), and its “bind” variant is (>>=): a>>=f=μ(T(f)(η(a)))a \mathbin{>>=} f = \mu(T(f)(\eta(a)))

Proposition 2 (Kleisli Adjunction).

The Kleisli category CT\mathcal{C}_T carries a canonical adjunction FTGTF_T \dashv G_T where:

  • FT:CCTF_T: \mathcal{C} \to \mathcal{C}_T sends AA to AA and f:ABf: A \to B to ηBf:ATB\eta_B \circ f: A \to TB.
  • GT:CTCG_T: \mathcal{C}_T \to \mathcal{C} sends AA to TATA and (f:ATB)(f: A \to TB) to μBT(f):TATB\mu_B \circ T(f): TA \to TB.

Moreover, GTFT=TG_T F_T = T, recovering the original monad.

Proof. We verify GTFT(A)=T(A)G_T F_T(A) = T(A) and that the unit of this adjunction is η\eta. For any object AA, FT(A)=AF_T(A) = A and GT(A)=TAG_T(A) = TA, so GTFT(A)=TA=T(A)G_T F_T(A) = TA = T(A). The unit component ηA:AGT(FT(A))=TA\eta_A: A \to G_T(F_T(A)) = TA is the monad unit. The adjunction condition follows from the monad laws. \square

Kleisli arrows A → Dist(B) are Markov kernels. Composition = Chapman-Kolmogorov.
K₁: {s₁,s₂,s₃}Dist({s₁,s₂,s₃})
Transition: row-stochastic matrix P₁
K₂: {s₁,s₂,s₃}Dist({s₁,s₂,s₃})
Transition: row-stochastic matrix P₂
Markov Kernels — Chapman-Kolmogorov as Kleisli Composition
K₁
s₁s₂s₃
s₁0.200.500.30
s₂0.100.600.30
s₃0.400.200.40
K₂
s₁s₂s₃
s₁0.700.200.10
s₂0.300.400.30
s₃0.100.300.60
K₂ >=> K₁
s₁s₂s₃
s₁0.320.330.35
s₂0.280.350.37
s₃0.380.280.34
Kleisli composition of Markov kernels = matrix multiplication of transition matrices
Kleisli categories: composition, adjunction, effectful computation, Markov kernels

Markov Kernels and the Giry Monad

The Giry monad G\mathcal{G} on Meas\mathbf{Meas} (the category of measurable spaces) sends a measurable space XX to the space of probability measures G(X)=Prob(X)\mathcal{G}(X) = \mathrm{Prob}(X). Its Kleisli category has a beautiful probabilistic interpretation.

A Kleisli arrow AG(B)A \to \mathcal{G}(B) is a function sending each point aAa \in A to a probability distribution on BB — this is precisely a Markov kernel (also called a stochastic kernel or transition kernel). Kleisli composition of two kernels K1:AG(B)K_1: A \to \mathcal{G}(B) and K2:BG(C)K_2: B \to \mathcal{G}(C) is:

(K2>=>K1)(a)=BK2(b)dK1(a)(b)=μC(T(K2)(K1(a)))(K_2 \mathbin{>=>} K_1)(a) = \int_B K_2(b) \, dK_1(a)(b) = \mu_C(T(K_2)(K_1(a)))

This is the Chapman-Kolmogorov equation — the composition law for Markov chains. When A=B=CA = B = C is a finite set with nn states, each kernel is a row-stochastic matrix, and Kleisli composition is matrix multiplication.

The connection to random walks is immediate: a Markov chain with transition matrix PP is a Kleisli endomorphism P:SG(S)P: S \to \mathcal{G}(S). Running the chain for kk steps is the kk-fold Kleisli composite P>=>kP^{>=>k}, and the Chapman-Kolmogorov equation ensures consistency.


Eilenberg-Moore Algebras

Definition 5 (T-Algebra (Eilenberg-Moore)).

Let (T,η,μ)(T, \eta, \mu) be a monad on C\mathcal{C}. A TT-algebra (or Eilenberg-Moore algebra) is a pair (X,h)(X, h) where:

  • XX is an object of C\mathcal{C} (the carrier),
  • h:TXXh: TX \to X is a morphism (the structure map),

satisfying:

Unit law: hηX=idXh \circ \eta_X = \mathrm{id}_X (the structure map inverts the unit).

Associativity: hTh=hμXh \circ Th = h \circ \mu_X (flattening before or after applying hh gives the same result).

Definition 6 (T-Algebra Homomorphism).

A TT-algebra homomorphism (X,h)(Y,k)(X, h) \to (Y, k) is a morphism f:XYf: X \to Y in C\mathcal{C} such that fh=kTff \circ h = k \circ Tf, i.e., ff commutes with the structure maps.

MonadTT-AlgebrasStructure Map h:TXXh: TX \to X
MaybePointed sets (X,x0)(X, x_0)h()=x0h(\bot) = x_0, h(Just(x))=xh(\mathrm{Just}(x)) = x
ListMonoids (M,,e)(M, \cdot, e)h([a1,,an])=a1anh([a_1, \ldots, a_n]) = a_1 \cdot \ldots \cdot a_n
GiryConvex spacesh(piδxi)=pixih(\sum p_i \delta_{x_i}) = \sum p_i x_i (convex combination)
Power SetComplete join-semilatticesh(A)=Ah(\mathcal{A}) = \bigvee \mathcal{A}

Theorem 2 (List-Algebras are Monoids).

A TT-algebra (M,h)(M, h) for the List monad is exactly a monoid.

Proof. Given (M,h:MM)(M, h: M^* \to M), we define:

  • Binary operation: ab=h([a,b])a \cdot b = h([a, b]).
  • Identity element: e=h([])e = h([]) (the empty list).

Associativity: (ab)c=h([h([a,b]),c])(a \cdot b) \cdot c = h([h([a,b]), c]) and a(bc)=h([a,h([b,c])])a \cdot (b \cdot c) = h([a, h([b,c])]). The TT-algebra associativity law hTh=hμh \circ Th = h \circ \mu gives h([])=h(concat())h([\ldots]) = h(\mathrm{concat}(\ldots)) for nested lists, so both equal h([a,b,c])h([a,b,c]).

Identity: ea=h([h([]),a])=h([a])=ae \cdot a = h([h([]), a]) = h([a]) = a by the unit law. Similarly ae=aa \cdot e = a.

Conversely, given a monoid (M,,e)(M, \cdot, e), the structure map h([a1,,an])=a1anh([a_1, \ldots, a_n]) = a_1 \cdot \ldots \cdot a_n (with h([])=eh([]) = e) satisfies the algebra axioms. \square

Remark (Giry-Algebras and Convex Spaces).

A Giry-algebra (X,h:G(X)X)(X, h: \mathcal{G}(X) \to X) assigns to each probability distribution on XX a “center” in XX. The algebra axioms force this assignment to behave like taking convex combinations: h(piδxi)=pixih(\sum p_i \delta_{x_i}) = \sum p_i x_i. Thus Giry-algebras are precisely convex spaces — spaces where you can form weighted averages. This connects the Giry monad to Bayesian nonparametrics: the Dirichlet process is a Giry-algebra on the space of probability measures.

Eilenberg-Moore algebras: definition, gallery, EM adjunction, two canonical adjunctions

Two Canonical Adjunctions

Proposition 3 (Eilenberg-Moore Adjunction).

The Eilenberg-Moore category CT\mathcal{C}^T carries a canonical adjunction FTUTF^T \dashv U^T where:

  • FT:CCTF^T: \mathcal{C} \to \mathcal{C}^T sends AA to the free TT-algebra (TA,μA)(TA, \mu_A).
  • UT:CTCU^T: \mathcal{C}^T \to \mathcal{C} is the forgetful functor sending (X,h)(X, h) to XX.

Moreover, UTFT=TU^T F^T = T, recovering the original monad.

Proof. UTFT(A)=UT(TA,μA)=TA=T(A)U^T F^T(A) = U^T(TA, \mu_A) = TA = T(A). The unit is ηA:AUTFT(A)=TA\eta_A: A \to U^T F^T(A) = TA. The counit at (X,h)(X, h) is h:FTUT(X,h)=(TX,μX)(X,h)h: F^T U^T(X, h) = (TX, \mu_X) \to (X, h), which is a TT-algebra homomorphism by the associativity law. \square

Proposition 4 (Kleisli is Initial, Eilenberg-Moore is Terminal).

In the category of adjunctions that give rise to the monad TT (where objects are adjunctions FGF \dashv G with GF=TGF = T, and morphisms are comparison functors), the Kleisli adjunction FTGTF_T \dashv G_T is the initial object and the Eilenberg-Moore adjunction FTUTF^T \dashv U^T is the terminal object.

Every adjunction FGF \dashv G with GF=TGF = T factors through both: CTcomparisonDcomparisonCT\mathcal{C}_T \xrightarrow{\text{comparison}} \mathcal{D} \xrightarrow{\text{comparison}} \mathcal{C}^T

This tells us that every monad arises from an adjunction — but not uniquely. The Kleisli category gives the smallest target, the Eilenberg-Moore category the largest, and the actual adjunction sits somewhere in between.


Beck’s Monadicity Theorem

Definition 7 (Monadic Functor).

A functor U:DCU: \mathcal{D} \to \mathcal{C} is monadic if:

  1. UU has a left adjoint FF,
  2. The comparison functor K:DCTK: \mathcal{D} \to \mathcal{C}^T (where T=UFT = UF) is an equivalence of categories.

Equivalently, D\mathcal{D} is (equivalent to) the category of TT-algebras.

Theorem 1 (Beck's Monadicity Theorem).

A functor U:DCU: \mathcal{D} \to \mathcal{C} is monadic if and only if:

  1. UU has a left adjoint,
  2. UU reflects isomorphisms: if UfUf is an isomorphism in C\mathcal{C}, then ff is an isomorphism in D\mathcal{D},
  3. UU preserves UU-split coequalizers: if a parallel pair in D\mathcal{D} has a split coequalizer after applying UU, then the pair has a coequalizer in D\mathcal{D} and UU preserves it.

Proof sketch. The comparison functor K:DCTK: \mathcal{D} \to \mathcal{C}^T sends DD to the TT-algebra (UD,UεD)(UD, U\varepsilon_D). For KK to be an equivalence, we need essential surjectivity (every TT-algebra arises from some object of D\mathcal{D}) and full faithfulness.

Necessity: If UU is monadic, then DCT\mathcal{D} \simeq \mathcal{C}^T, and the forgetful functor UT:CTCU^T: \mathcal{C}^T \to \mathcal{C} reflects isomorphisms (an isomorphism of carriers that respects the algebra structure is an algebra isomorphism) and creates coequalizers of UTU^T-split pairs (by the general machinery of algebras).

Sufficiency: Condition (2) ensures KK is faithful and conservative. Condition (3) allows us to construct the required coequalizers in D\mathcal{D} to build the inverse of KK, establishing the equivalence. \square

Beck's monadicity theorem: comparison functor, conditions, monadic examples, split coequalizers

Monadic examples (the forgetful functor is monadic):

  • GrpSet\mathbf{Grp} \to \mathbf{Set}: Groups are algebras for the free-group monad.
  • VeckSet\mathbf{Vec}_k \to \mathbf{Set}: Vector spaces are algebras for the free-vector-space monad.
  • ModRSet\mathbf{Mod}_R \to \mathbf{Set}: RR-modules are algebras for the free-RR-module monad.
  • CompHausSet\mathbf{CompHaus} \to \mathbf{Set}: Compact Hausdorff spaces are monadic over Set\mathbf{Set} (via the ultrafilter monad).

Non-monadic examples:

  • TopSet\mathbf{Top} \to \mathbf{Set}: Topological spaces are not monadic — the forgetful functor has a left adjoint (discrete topology) but does not reflect isomorphisms (a continuous bijection need not be a homeomorphism).
  • PosSet\mathbf{Pos} \to \mathbf{Set}: Partially ordered sets are not monadic.

Beck’s theorem draws the fundamental boundary between algebra and topology: algebraic structures (groups, rings, modules, lattices) are monadic over Set\mathbf{Set}, while topological and order-theoretic structures are not.


Comonads

Definition 8 (Comonad).

A comonad on a category C\mathcal{C} is a triple (W,ε,δ)(W, \varepsilon, \delta) where:

  • W:CCW: \mathcal{C} \to \mathcal{C} is an endofunctor,
  • ε:WIdC\varepsilon: W \Rightarrow \mathrm{Id}_{\mathcal{C}} is a natural transformation called the counit (or extraction),
  • δ:WW2\delta: W \Rightarrow W^2 is a natural transformation called the comultiplication (or duplication).

Definition 9 (Comonad Laws).

A comonad (W,ε,δ)(W, \varepsilon, \delta) must satisfy:

Coassociativity. Wδδ=δWδW\delta \circ \delta = \delta_W \circ \delta

Left counit. εWδ=idW\varepsilon_W \circ \delta = \mathrm{id}_W

Right counit. Wεδ=idWW\varepsilon \circ \delta = \mathrm{id}_W

The intuition is dual to monads: if a monad wraps values with effects, a comonad unwraps contexts to extract values. The counit ε\varepsilon reads the focus from a context, and the comultiplication δ\delta creates a “context of contexts” — every point can see not just its immediate surroundings, but the surroundings of its surroundings.

Proposition 5 (Adjunction → Comonad).

Every adjunction FGF \dashv G with unit η\eta and counit ε\varepsilon yields a comonad (W,ε,δ)(W, \varepsilon, \delta) on the target category D\mathcal{D}, where:

  • W=FGW = FG,
  • ε\varepsilon is the adjunction counit,
  • δ=FηG\delta = F\eta G, i.e., δB=F(ηG(B)):FG(B)FGFG(B)\delta_B = F(\eta_{G(B)}): FG(B) \to FGFG(B).

Proof. Dual to the monad case: the triangle identities imply the comonad laws. \square

Stream comonad: Signal processing — each position sees the entire future stream
ε: ε(s) = s₀ (extract head)  |  δ: δ(s) = [s, shift(s), shift²(s), ...] (stream of all shifts)
Comonads: definition, monad-comonad duality, adjunction yields both, coKleisli composition

Monad-Comonad Duality

Every adjunction FGF \dashv G yields both a monad T=GFT = GF on the source and a comonad W=FGW = FG on the target. This duality runs deep:

Monad (T,η,μ)(T, \eta, \mu)Comonad (W,ε,δ)(W, \varepsilon, \delta)
InterpretationEffects (wrap)Contexts (unwrap)
Unit / Counitη:IdT\eta: \mathrm{Id} \Rightarrow T (pure)ε:WId\varepsilon: W \Rightarrow \mathrm{Id} (extract)
Multiplication / Comultiplicationμ:T2T\mu: T^2 \Rightarrow T (flatten)δ:WW2\delta: W \Rightarrow W^2 (duplicate)
Kleisli arrowsATBA \to TB (effectful)WABWA \to B (contextual)
CompositionμTgf\mu \circ Tg \circ fgWfδg \circ W f \circ \delta
Algebras / Coalgebras(X,h:TXX)(X, h: TX \to X)(X,k:XWX)(X, k: X \to WX)
ML examplesGiry (probability), Cont (backprop)Neighborhood (GNN), Stream (signals)

Definition 10 (CoKleisli Category).

The coKleisli category CW\mathcal{C}_W of a comonad (W,ε,δ)(W, \varepsilon, \delta) has the same objects as C\mathcal{C}, with morphisms ABA \to B being morphisms WABWA \to B in C\mathcal{C}. Composition of f:WABf: WA \to B and g:WBCg: WB \to C is: gWf=gWfδAg \circ_W f = g \circ Wf \circ \delta_A

Definition 11 (W-Coalgebra).

A WW-coalgebra for a comonad (W,ε,δ)(W, \varepsilon, \delta) is a pair (X,k:XWX)(X, k: X \to WX) satisfying:

Counit law: εXk=idX\varepsilon_X \circ k = \mathrm{id}_X

Coassociativity: δXk=Wkk\delta_X \circ k = Wk \circ k

Gallery of comonads: Stream, Store, Neighborhood, Environment
ComonadEndofunctor WWCounit ε\varepsilonComultiplication δ\deltaContext
StreamW(X)=XNW(X) = X^{\mathbb{N}} (infinite seq.)Extract headAll shifts: δ(s)i=shifti(s)\delta(s)_i = \mathrm{shift}^i(s)Signal processing
StoreW(X)=(SX)×SW(X) = (S \to X) \times SEvaluate at positionAll refocusingsCellular automata, lenses
NeighborhoodW(v)=(v,N(v))W(v) = (v, N(v))Extract focus featureNested neighborhoodsGNN message passing
EnvironmentW(X)=E×XW(X) = E \times XExtract valueDuplicate environmentDual of Reader monad

Monads and Comonads in Machine Learning

Giry monad: Prior = Giry unit (δ), Likelihood = Kleisli arrow, Posterior = Kleisli composition (Chapman-Kolmogorov). Marginalization = Giry multiplication μ.
ML connections: Bayesian inference, backpropagation, comonadic attention, GNN message passing

Giry monad and Bayesian inference. A prior πG(Θ)\pi \in \mathcal{G}(\Theta) is a Giry element. A likelihood function :ΘG(X)\ell: \Theta \to \mathcal{G}(X) is a Kleisli arrow. The posterior is computed via Kleisli composition — the Chapman-Kolmogorov equation applied to Bayesian updating. See Bayesian Nonparametrics for the Dirichlet process as a Giry-algebra.

Continuation monad and backpropagation. Each differentiable layer f:RmRnf: \mathbb{R}^m \to \mathbb{R}^n wraps into a Kleisli arrow of the continuation monad: f^(x)=λk.k(f(x))\hat{f}(x) = \lambda k.\, k(f(x)) where kk is the continuation (what comes after). Composing two Kleisli arrows chains the forward computation. The backward pass emerges by running the continuation in reverse — applying kk extracts the gradient. The chain rule Lx=Lyyx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x} is the functoriality of CPS. See Gradient Descent.

Neighborhood comonad and GNNs. On a graph G=(V,E)G = (V, E), the neighborhood comonad sends each node vv to the pair (v,{u:uN(v)})(v, \{u : u \in N(v)\}). A coKleisli arrow f:W(v)Rdf: W(v) \to \mathbb{R}^d extracts features from a neighborhood — this is exactly the aggregation function in message-passing GNNs. The coKleisli extension f^:W(v)W(Rd)\hat{f}: W(v) \to W(\mathbb{R}^d) applies ff at every node simultaneously — one GNN layer. Multi-hop aggregation is iterated coKleisli composition.

Entropy as a monad morphism. Shannon entropy H:G(X)[0,)H: \mathcal{G}(X) \to [0, \infty) is a monad morphism from the Giry monad to the additive reals. The data processing inequality H(f(X))H(X)H(f(X)) \leq H(X) follows from the monad morphism property. See Shannon Entropy.

Remark (Distributive Laws (Preview)).

When a monad TT and a comonad WW live on the same category, a distributive law λ:TWWT\lambda: TW \Rightarrow WT allows their effects and contexts to interact coherently. This arises in probabilistic programming (combining the Giry monad with the Store comonad for spatial probabilistic models) and in reinforcement learning (combining the continuation monad with the Stream comonad for temporal credit assignment). The full development of distributive laws is beyond this topic but connects to recent work in compositional game theory and open dynamical systems.


Computational Notes

The following Python code verifies the monad and comonad laws computationally.

Maybe monad verification

# Maybe monad: T(X) = X ∪ {None}
def maybe_unit(x): return ('Just', x)
def maybe_bind(mx, f):
    if mx[0] == 'Nothing': return ('Nothing',)
    return f(mx[1])

# Test laws
f = lambda x: ('Just', x * 2)
g = lambda x: ('Just', x + 5) if x > 0 else ('Nothing',)

# Left unit: bind(unit(x), f) = f(x)
assert maybe_bind(maybe_unit(5), f) == f(5)  # Just(10) == Just(10)

# Right unit: bind(m, unit) = m
m = ('Just', 42)
assert maybe_bind(m, maybe_unit) == m  # Just(42) == Just(42)

# Associativity: bind(bind(m, f), g) = bind(m, lambda x: bind(f(x), g))
m = ('Just', 5)
assert maybe_bind(maybe_bind(m, f), g) == maybe_bind(m, lambda x: maybe_bind(f(x), g))

Giry monad (discrete) verification

def giry_unit(x, support):
    """Dirac delta: delta_x gives probability 1 to x."""
    return {s: (1.0 if s == x else 0.0) for s in support}

def giry_bind(p, kernel, support):
    """Kleisli composition = Chapman-Kolmogorov."""
    result = {y: 0.0 for y in support}
    for x, px in p.items():
        k = kernel(x)
        for y in support:
            result[y] += px * k.get(y, 0.0)
    return result

support = ['a', 'b', 'c']
p = {'a': 0.5, 'b': 0.3, 'c': 0.2}
kernel = lambda x: {'a': 0.1, 'b': 0.7, 'c': 0.2} if x == 'a' else \
                    {'a': 0.3, 'b': 0.4, 'c': 0.3} if x == 'b' else \
                    {'a': 0.6, 'b': 0.1, 'c': 0.3}

# Left unit: bind(delta_a, kernel) = kernel(a)
result = giry_bind(giry_unit('a', support), kernel, support)
assert all(abs(result[s] - kernel('a')[s]) < 1e-10 for s in support)

Stream comonad verification

class Stream:
    """Stream comonad: infinite sequence with a focus."""
    def __init__(self, gen_fn, index=0):
        self.gen_fn = gen_fn
        self.index = index

    def extract(self):
        return self.gen_fn(self.index)

    def duplicate(self):
        return Stream(lambda i: Stream(self.gen_fn, i), self.index)

    def extend(self, f):
        return Stream(lambda i: f(Stream(self.gen_fn, i)), self.index)

    def take(self, n):
        return [self.gen_fn(self.index + i) for i in range(n)]

# Fibonacci stream
def fib(n):
    a, b = 0, 1
    for _ in range(n): a, b = b, a + b
    return a

s = Stream(fib, 0)

# Counit law: extract(duplicate(s)) = extract(s)
assert s.duplicate().extract().extract() == s.extract()

# Extend-extract law: extend(extract)(s) = s
id_s = s.extend(lambda w: w.extract())
assert id_s.take(5) == s.take(5)

Monad from adjunction

# The List monad arises from the Free monoid adjunction:
#   F: Set → Mon (free monoid = list construction)
#   U: Mon → Set (forgetful: forget the multiplication)
# T = UF: Set → Set sends X to X* (lists over X)
print("T(X) = X* = Free(X) = List(X)")
print("η('a') = ['a']")       # unit = singleton list
print("μ([['a','b'],['c']]) = ['a','b','c']")  # multiplication = concat

Connections and Further Reading

Within-track connections

TopicRelationship
Categories & FunctorsAll foundational definitions — categories, functors, endofunctors, products, coproducts. The category Cat\mathbf{Cat} of small categories.
Natural TransformationsUnit η\eta and multiplication μ\mu are natural transformations. A monad is a monoid in the functor category [C,C][\mathcal{C}, \mathcal{C}].
AdjunctionsDirect prerequisite. Every adjunction yields a monad (T=GFT = GF, μ=GεF\mu = G\varepsilon F) and a comonad (W=FGW = FG, δ=FηG\delta = F\eta G). Kleisli and EM categories provide canonical adjunctions.

Cross-track connections

TopicRelationship
Bayesian NonparametricsThe Giry monad is the categorical foundation of Bayesian probability. Kleisli arrows = Markov kernels. Giry-algebras = convex spaces. The Dirichlet process is a Giry-algebra.
Random Walks & MixingMarkov chains are Kleisli arrows of the Giry monad. Chapman-Kolmogorov = Kleisli composition.
Message Passing & GNNsGNN layers are coKleisli extensions of the neighborhood comonad. Aggregation = comonadic extend.
Gradient DescentBackpropagation is Kleisli composition in the continuation monad. The chain rule = functoriality of CPS.
Lagrangian DualityThe monad T=GFT = GF from the duality Galois connection. TT-algebras are problems with strong duality.
Measure-Theoretic ProbabilityGiry monad unit = Dirac delta. Giry multiplication = integration over probability measures.
Shannon EntropyEntropy HH is a monad morphism from Giry to the additive reals. Data processing inequality follows.

Notation Summary

SymbolMeaning
(T,η,μ)(T, \eta, \mu)Monad (endofunctor, unit, multiplication)
η:IdT\eta: \mathrm{Id} \Rightarrow TMonad unit
μ:T2T\mu: T^2 \Rightarrow TMonad multiplication
CT\mathcal{C}_TKleisli category
CT\mathcal{C}^TEilenberg-Moore category
FTGTF_T \dashv G_TKleisli adjunction
FTUTF^T \dashv U^TEilenberg-Moore adjunction
(W,ε,δ)(W, \varepsilon, \delta)Comonad (endofunctor, counit, comultiplication)
ε:WId\varepsilon: W \Rightarrow \mathrm{Id}Counit (extraction)
δ:WW2\delta: W \Rightarrow W^2Comultiplication (duplication)
G\mathcal{G}Giry monad on Meas\mathbf{Meas}
δx\delta_xDirac delta measure at xx
g>=>fg \mathbin{>=>} fKleisli composition (fish operator)

Connections

  • Direct prerequisite. Every adjunction F ⊣ G yields a monad T = GF on the source and a comonad W = FG on the target. The Kleisli and Eilenberg-Moore categories provide canonical adjunctions that recover the monad. The triangle identities imply the monad laws. adjunctions
  • Transitive prerequisite. The unit η and multiplication μ are natural transformations. Monad laws are commutative diagrams. A monad is a monoid in the functor category [C, C] — whose morphisms are natural transformations. natural-transformations
  • Transitive prerequisite. All foundational definitions. Endofunctors T: C → C, functor composition, the category Cat. Products and coproducts connect to monadic constructions via RAPL. categories-functors
  • Cross-track prerequisite. The Giry monad G on Meas is the categorical foundation of Bayesian probability. Kleisli arrows are Markov kernels. The Dirichlet process — the central object of Bayesian nonparametrics — is a G-algebra on the space of probability measures. bayesian-nonparametrics
  • Cross-track connection. Markov chains are Kleisli arrows of the Giry monad. The Chapman-Kolmogorov equation is Kleisli composition. Random walk mixing corresponds to iterating the Kleisli endomorphism and studying convergence in the EM category. random-walks
  • Cross-track connection. GNN message-passing layers are coKleisli extensions of the neighborhood comonad. The aggregation step is the comonadic extend operation. Multi-hop aggregation corresponds to iterated coKleisli composition. message-passing
  • Cross-track connection. The continuation monad C_R(X) = (X → R) → R provides the categorical framework for automatic differentiation. Backpropagation is Kleisli composition in the continuation monad — the chain rule is functoriality of CPS. gradient-descent
  • Cross-track connection. The monad T = GF from the Lagrangian duality Galois connection captures the primal-dual relationship. T-algebras are optimization problems where strong duality holds. lagrangian-duality
  • Cross-track connection. The Giry monad's unit is the Dirac delta x ↦ δ_x, and its multiplication is integration over probability measures. Meas is the natural habitat of the Giry monad. measure-theoretic-probability
  • Cross-track connection. Entropy H is a monad morphism from the Giry monad to the additive reals. The data processing inequality follows from the monad morphism property. shannon-entropy

References & Further Reading

  • book Categories for the Working Mathematician — Mac Lane (1998) Chapter VI covers monads (triples) definitively — the standard reference for monad laws, Kleisli and EM categories, Beck's theorem
  • book Category Theory in Context — Riehl (2016) Chapter 5 develops monads with Beck's theorem and excellent exercises — freely available online
  • book Category Theory — Awodey (2010) Chapter 10 on monads with accessible examples for algebraists
  • book An Invitation to Applied Category Theory: Seven Sketches in Compositionality — Fong & Spivak (2019) Monads in applied settings — databases, probability, programming
  • paper Notions of Computation Determine Monads — Plotkin & Power (2002) The fundamental correspondence between computational effects and monads
  • paper Category Theory in Machine Learning — Shiebler, Gavranović & Wilson (2021) Sections on monadic and comonadic ML structures
  • paper Comonadic Notions of Computation — Uustalu & Vene (2008) Foundational treatment of comonadic computation patterns
  • paper Backprop as Functor: A compositional perspective on supervised learning — Fong, Spivak & Tuyéras (2019) CPS and the continuation monad in automatic differentiation
  • paper A Categorical Approach to Probability Theory — Giry (1982) The original construction of the probability monad on Meas
  • book Topology: A Categorical Approach — Bradley, Bryson & Terilla (2020) The ultrafilter monad, compact Hausdorff spaces as monadic over Set