Cyclic Structures and Immutability

8 September 2019

A piece of folk knowledge that doesn’t really seem to be explicitly laid out on the Internet is that in a language that (a) has no mutable data types, (b) isn’t lazy, you can’t create cyclic structures.

You can of course always create semantically cyclic structures. For instance, you could represent a cyclic linked list with the array [1, 2, 3, 4, 0], where each element holds the index of its successor. What we’re talking about here are structures that the language understands as cyclic, i.e. where there exists a set of language-level pointers that form a cycle. We’ll come back to the significance of this distinction.

Mutation

It’s obviously easy to create a cycle if you allow mutation. In Scheme (I’ve been using a lot of Python on this blog, so let’s mix things up):

(define a (list 1))
(define b (cons 2 a))
(set-cdr! a b)
(display a)
; => #0=(1 2 . #0#)

(The output is R7RS Scheme’s notation for the infinite list (1 2 1 …).)

Mutably constructing a cyclic structure
In black, data that was immutably constructed; in green, the mutation to the original data.

This can be extended to arbitrary cyclic structures easily, as we can add the backwards links one by one by mutating the existing structure.

Laziness

Being immutable but lazy also lets you create cyclic structures. Haskell, for instance, is famous for being lazy:

let a = 1 : b ; b = 2 : a in a
{- => [1,2,1,2 …] -}

This works because by the time we get to creating the second element of b, a is fully created and points to the correct location.

Graphical representation of the times at which different structures are created and variables assigned. The order is yellow, red, violet.

Note that Haskell has other ways to make lists that look infinite, because the lists can be generated lazily. But there’s a difference: the list we defined above is actually only two elements long and repeats, whereas a true infinite list is a stream of all-distinct elements.

In the example above, the structure of the cycle (that is, the number and position of backward pointers) was known at compile time. Can we create arbitrary cyclic structures, for example structures defined by a user? We can; a well-known technique for this in Haskell is called “tying the knot”.

Take a representation of an arbitrary graph as a list where each element represents a node in the graph and is itself a list of children of that element, as indices in the top level list. The following constructs a cyclic data structure, using exactly the same principle as above:

indirectGraph = [[0, 2], [], [1, 2], [0]]
data DirectGraph t = DirectGraph t [DirectGraph t] deriving (Show)
toDirect :: [[Int]] -> DirectGraph Int
toDirect g =
    let tieList = [direct i n | (i, n) <- zip [0..] g]
        direct idx node = DirectGraph idx [tieList !! i | i <- node]
    in tieList !! 0
toDirect indirectGraph

You can also use laziness to create these graph structures in a strict language, but you will end up with a lazy graph of lazy values, which, in a strict language, is not the same type as a graph of non-lazy values.

Special handling

Some languages muddy the water by allowing a restricted form of implicit laziness in recursive declarations. For instance, in OCaml:

let rec a = 1::b and b = 2::a in a;;
(* - : int list = [1; 2; <cycle>] *)

The restrictions on what can happen in a let rec are laid out in the documentation, but the OCaml typechecker code has comments that I find more enlightening. Essentially, it recognizes static data constructors, and it’s able to reserve space and obtain a pointer to the constructed data before it’s assigned to.

Note that this OCaml trick only lets you define a finite (in fact, fixed) number of recursive arcs. There’s no way to transcribe the indirectGraph Haskell method from above in OCaml.

Not all languages have recursive let constructs that let you do this, however. In R7RS Scheme, we get:

(display
  (letrec ((a (cons 1 b))
           (b (cons 2 a)))
    a))
; => (1 . #<undefined>) or throws an error

Scheme has no special handling for this situation. It creates storage locations for a and b, then tries to evaluate (cons 1 b), but b is not yet assigned! Depending on the implementation, this creates the cell (1 . #<undefined>) or fails. Even if it doesn’t fail, assigning b later does not change the value in the cell: that value is not tied to b, but to the location b points to.

Letrec in scheme
letrec does not help, as b already exists but does not point anywhere yet.

Reference counting

Why do cyclic structures matter? After all, we’ve said we can always create semantically cyclic data, which is what matters for the expression capabilities of a language. Language-level cycles do matter, however. First, structures that rely on indices into arrays incur the cost of additional indirection when accessing their children or successors, as a single pointer access gets replaces with an array access followed by the actual pointer dereference.

Garbage collection is also affected by cycles: in non-lazy immutable languages, cycles can’t exist, so we can replace complex garbage collection processes with reference counting, which is much simpler. This is what jq’s (Turing-complete!) interpreter does, and also what you can do in Unlambda.

Comments