Breandan's Blog

Computation graphs and graph computation

A carefully edited anthology in which I vindicate my illustrious career as a hype-chasing Hacker News junkie, AI astrologer, and Twitter fortune-teller, while debunking my imaginary critics in the peanut gallery. I also extol the virtues of graphs, algebra, types, and how these concepts can help us think about software. Finally, I share my predictions for the path ahead, which I consider to be the start of an exciting new chapter in computing history.

TLDR: Research has shown a great many algorithms can be expressed as matrix multiplication, suggesting an unrealized connection between linear algebra and computer science. I speculate that graphs are the missing piece of the puzzle. Graphs are not only useful as cognitive aides, but are suitable data structures for performing a wide variety of computation, particularly on modern parallel processing architectures. Finally, I propose a computational primitive based on matrix multiplication, bridging graphs and computation.

n.b.: None of these ideas are mine. Shoulders of giants. Use landscape mode for optimal reading experience.

New decade, new delusions

Over the last decade, I bet on some strange ideas. A lot of people I looked up to at the time laughed at me. I’ll bet they aren’t laughing anymore. I ought to thank them one day, because their laughter gave me a lot of motivation. I’ve said some idiotic things to be sure, but I’ve also made some laughable predictions which were correct. Lesson learned: aim straighter.

In 2012, I was in Austin sitting next to an ex-poker player named Amir who was singing Hinton’s praises. Hypnotized by his technicolor slides, I quit my job in a hurry and started an educational project using speech recognition and restricted Boltzman machines. It never panned out, but I learned a lot about ASR and Android audio. Still love that idea.

In 2016, I quit my next job as a tech evangelist to run around the world giving incoherent talks about deep learning. Met Yoshua at the United Nations and decided to study in Canada. I applied to UofT and UdeM. Ended up at UdeM because I hate asking for recommendations, and they were the only ones who didn’t care about them anyway. Best decision I ever made. Move to Montréal, thank me later.

In 2017, I started writing a book on the ethics of automation and predicted mass unemployment and social unrest. Although I got the causes wrong (pandemic, go figure), the information economy and bias takes were all dead right. Sadly, this is now driving the world completely insane. Don’t say I warned you, go out and fix our broken systems. The world needs more engineers who care.

In 2017, I witnessed the birth of differentiable programming, which I stole from Chris Olah and turned into a master’s thesis. Had a lot of trouble convincing people that programs could be made differentiable, but look at the proceedings of any machine learning conference today and you’ll find dozens of papers on differentiable sorting and rendering and simulation. Don’t thank me, thank Chris and the Theano guys.

In 2018, I correctly predicted Microsoft would acquire GitHub to mine code. Why MS and not Google? I’ll bet they tried, but Google’s leadership had fantasies of AGI and besides JetBrains, MS were the only ones who gave a damn about developers. Now ML4SE is a thriving research area and showing up in real products, much to the chagrin of those who believed ML was a fad. I suspect their hype filter blinded them to the value those tools provide. Lesson learned: focus on tools, not hype.

But to heck with everything I’ve said! If I had just one idea to share with these ML people, it would be types. Beat that drum as loud as I could. Types are the best tool we know for synthetic reasoning. If you want to build provably correct systems that scale on real world applications, types are the answer. Not everyone is convinced yet, but mark my words, types are coming. Whoever figures out how to connect types and learning will be the next Barbara Liskov or Frances Allen.

This year, I predicted the pandemic weeks before the lockdown, exited the market, and turned down a job at Google. Some people called me crazy. Now I’m going all-in on some new ideas (none of which are mine). I’m making some big bets and some will be wrong, but I see the very same spark of genius in them. Hang on to your hats, because if I’m right, these ideas are going to shake the foundations of modern computing.

Everything old is new again

As a kid, I was given a book on the history of mathematics. I remember it had some interesting puzzles, including one with some bridges in a town divided by rivers, once inhabited by a man called Euler. Was there a tour crossing each bridge exactly once? Was it possible to tell without checking every path? I remember spending days trying to figure out the answer.

In the late 90s, my mom and I went to Ireland. I remember visiting Trinity College, and learning about a mathematician called Hamilton who discovered a famous formula connecting algebra and geometry, and carved it onto a bridge. We later visited the bridge, and the tour guide pointed out the stone, which we touched for good luck. The Irish have a thing for stones.

In 2007, I was applying to college and took the train from Boston to South Bend, Indiana, home of the Fighting Irish. Wandering about, I picked up a magazine article by a Hungarian mathematician called Barabási then at Notre Dame, who had some interesting things to say about complex networks. Later in 2009, while studying in Rochester, I carpooled with a nice professor, and learned complex networks are found in brains, languages, social networks and many marvelous places.

Fast forward to 2017. I was lured by the siren song of algorithmic differentiation. Olivier Breleux presented Myia and Buche. Matt Johnson gave a talk on Autograd. I met Chris Olah in Long Beach, who gave me the idea to study differentiable programming. I stole his idea, dressed it up in Kotlin and traded it for a POPL workshop paper and later a Master’s thesis. Our contributions were using algebra, shape inference and presenting AD as term rewriting.

In 2019, I joined a lab with a nice professor at McGill applying knowledge graphs to software engineering. Like logical reasoning, knowledge graphs are an idea from the first wave of AI in the 1960s and 70s which have been revived and studied in light of recent progress in the field. I believe this is an important area of research with a lot of potential. Knowledge and traceability plays an important role in software engineering, and it’s the bread-and-butter of a good IDE. The world needs better IDEs if we’re ever going to untangle this mess we’re in.

This Spring, I took a fascinating seminar on Graph Representation Learning. A lot of delightful graph theory had been worked out over the preceding decade. PageRank turned into power iteration. People made lots of interesting connections to linear algebra, including Weisfeiler-Lehman graph kernels, graph Laplacians and spectral graph theory. There are some elegant mathematics for representing graphs, and choosing the right representation can be very powerful. More on that later.

What are graphs?

Graphs are general-purpose data structures used to represent a variety of data types and procedural phenomena. Unlike most languages, which are highly sequential, graphs are capable of expressing a much richer family of relations between entities. Consider the following hierarchy of data structures, all of which are graphs with increasing expressive power:

Directed graphs can be used for modeling mathematical expressions as I show in Kotlin∇, as well as other formal languages, including source code, intermediate representations and markup. There are many recent examples of learning directed graphs for neuro-symbolic applications:

Graphs are also be found in natural language, such as constituency, dependency, link and other common grammars. Research has begun to show many practical applications for such grammars in the extraction and organization of human knowledge stored in large text corpora. Those graphs can be further processed into ontological representations for logical reasoning.

Using coreference resolution and entity alignment techniques, we can reconstruct internally consistent relations between entities, reflecting cross-corpus consensus in natural language datasets. These relationships can be stored in knowledge graphs, and used for information retrieval and question answering, e.g. on wikis and other content management systems. Recent techniques have shown promise in automatic knowledge base construction (cf. Reddy et al., 2016).

Lo and behold, the key idea behind knowledge graphs is our old friend, types. Knowledge graphs are multi-relational graphs whose nodes and edges possess a type. Two entities can be related by multiple types, and each type can relate many pairs of entities. We can index an entity based on its type for knowledge retrieval, and use types to reason about compound queries, e.g. “Which company has a direct flight from a port city to a capital city?”, which would otherwise be difficult to model explicitly without a type system.

Induction introduction!

In this section, we will review some ideas from Chomskyan linguistics, including structural induction, rewriting systems and λ-calculus. If you are already familiar with these ideas, feel free to skim or skip to the next section.

Regular languages

One thing that always fascinated me is the idea of inductively defined languages, also known as recursive, or structural induction. Consider a very simple language which accepts strings of the form 0, 1, 100, 101, 1001, 1010, et cetera, but rejects 011, 110, 1011, or any string containing 11. The symbol indicates a “production”. The | symbol, which we read as “or”, is just a shorthand for defining multiple productions on a single line:

true → 1
term → 0 | 10 | ε
expr → term | expr term

We have two sets of productions, ones which can be expanded, called “nonterminals”, and ones which can be expanded no further, called “terminals”. Notice how each non-terminal occurs at most once in any single production. This property guarantees the language is recognizable by a special kind of graph, called a finite state machine. As their name indicates, FSMs contain a finite number of states, with labeled transitions between them:

Finite State Machine Library Courtesy Bell


Please ring the bell once
and wait for assistance.

Imagine a library desk: you can wait quietly and eventually you will be served. You can ring the bell once, and wait quietly to be served. Should no one arrive after some time, you may press the bell again and continue waiting. Though you must never ring the bell twice, lest you disturb the patrons and be tossed out.

Arithmetic

Now suppose we have a slightly more expressive language which accepts well-formed arithmetic expressions with up to two variables, in either infix or unary operator notation. In this language, a non-terminal occurs twice inside a single production – an <expr> can be composed of two shorter <expr>s:

term → 1 | 0 | x | y
  op → + | - | ·
expr → term | op expr | expr op expr

This is known as a context-free language (CFL). We can represent strings in this language using a special kind of graph, called a syntax tree. Each time we expand an <expr> with a production rule, this generates a rooted subtree on <op>, whose branch(es) are <expr>s. Typically, syntax trees are inverted, with branches extending downwards, like so:

Syntax Tree Peach Tree

While syntax trees can be interpreted computationally, they do not actually perform computation until evaluated. To evaluate a syntax tree, we will need to introduce some new rules. Instead of just allowing terminals to occur on the right hand side of a grammar production, suppose we also allow terminals on the left, and applying a rule can reduce the size of a string in our language. Here, we use capital letters on the same line to indicate an exact match, e.g. a rule U + V → V + U would replace x + y with y + x:

                                         E + E → +E
                                         E · E → ·E
                  E + 1 | 1 + E | +1 | -0 | ·1 → 1
                         E + 0 | 0 + E | E - 0 → E
  E - E | E · 0 | 0 · E | 0 - E | +0 | -1 | ·0 → 0

This is known as a recursively enumerable language, or string rewrite system. This particular example produces directed acyclic graphs, which we can think of as grafting or pruning the branches of a tree. If we must combine two identical expressions, why evaluate them twice? If we need to multiply an expression by 0, why evaluate it at all? Some say, “all trees are DAGs, but not all DAGs are trees”. Growing up in the woods, I prefer to think of a DAG as a tree with a gemel:

Rewrite Rule Deformed Tree



Let us now introduce a new operator, Dₓ, and some corresponding rules. In effect, these rules will push Dₓ as far towards the leaves as possible, while rewriting terms along the way. We will also introduce some terminal rewrites:

[R0]       term → Dₓ(term)
[R1]      Dₓ(x) → 1                  
[R2]      Dₓ(y) → 0                  
[R3]    Dₓ(U+V) → Dₓ(U) + Dₓ(V)      
[R4]    Dₓ(U·V) → U·Dₓ(V) + Dₓ(U)·V  
[R5]     Dₓ(+U) → +Dₓ(U)
[R6]     Dₓ(-U) → -Dₓ(U)
[R7]     Dₓ(·U) → +U·Dₓ(U)
[R8]      Dₓ(1) → 0
[R9]      Dₓ(0) → 0

Although we assign an ordering R0-R9 for notational convenience, an initial string, once given to this system, will always converge to the same result, no matter the order in which we perform the substitutions (proof required):

Term Confluence Ottawa-St. Lawrence Confluence


This feature, called confluence, is an important property of some rewrite systems: regardless of the substitution order, we will always arrive at the same result. If all strings in a language reduce to a form which can be simplified no further, we call such systems strongly normalizing, or terminating. If a rewriting system is both confluent and terminating it is said to be convergent.

λ-calculus

Consider a language which has the following grammar:

expr → var | func | appl
func → (λ var.expr)
appl → (expr expr)

To evaluate an expr in this language, we will need a single substition rule. The notation expr[var → val], is read as, “within expr, var becomes val”:

(λ var.expr) val → (expr[var → val])

For example, applying the above rule to the expression (λy.y z) 1 yields (λy.1 z). With this seemingly trivial addition, our language is now powerful enough to encode any computable function! This language is known as the pure untyped λ-calculus.

While grammatically compact, computation in the λ-calculus is not particularly terse. In order to perform any computation using this system, we will need a way to encode values. For example, we can encode the boolean algebra like so:

[D1]           λx.λy.x = T     "true"
[D2]           λx.λy.y = F     "false"
[D3]       λp.λq.p q p = &     "and"
[D4]       λp.λq.p p q = |     "or"
[D5]    λp.λa.λb.p b a = !     "not"

To evaluate a boolean expression !T, we will first need to encode it as a λ-expression. Then, we can evaluate it using the λ-calculus as follows:

  (           !          ) T
→ (λp.λa.λb.    p     b a) T   [D5]
→ (   λa.λb.    T     b a)     [p → T]
→ (   λa.λb.(λx.λy.x) b a)     [D1]
→ (   λa.λb.(   λy.b)   a)     [x → b]
→ (   λa.λb.(   λy.b)    )     [y → a]
→ (   λa.λb.b            )     [y →  ]
→ (   F                  )     [D2]

We have now reached a terminal, and can recurse no further. Unlike its typed cousin, the untyped λ-calculus not strongly normalizing and thus not guaranteed to converge. If it were convergent, it would not be Turing complete.

Cellular automata

Consider the elementary cellular automata, which consists of a one dimensional array, and a 3-cell rewrite system. There are 223=256 possible rules for rewriting the tape. It turns out even in this tiny space, there are remarkable automata. Consider the following rewrite system, known as Rule 110:

current pattern 111 110 101 100 011 010 001 000
new pattern 0 1 1 0 1 1 1 0

We can implement this by scanning the tape and replacing any cells matching the centermost element in the first row with the second row’s value. This system is known to be Turing complete. Disregarding efficiency, we could encode any computable function as an initial state and mechanically apply Rule 110 until fixpoint termination to simulate a TM.

Graphs, inductively

Just like grammars, we can define graphs themselves inductively. As many graph algorithms are recursive, this choice considerably simplifies their implementation. Take one definition for an unlabeled directed graph, proposed by Erwig (2001). Here, the notation list → [item] is a shorthand for list → item list, where item is some terminal, and list is just a list of items:

vertex  → int
adj     → [vertex]
context → (adj, vertex, adj)
graph   → empty | context & graph

Erwig defines a graph in four parts. First, we have a vertex, which is simply an integer. Next we have a list of vertices, adj, called an adjacency list. The context is a 3-tuple containing a vertex and symmetric references to its inbound and outbound neighbors, respectively. Finally, we have the inductive case: a graph is either (1) empty, or (2) a context and a graph.

String
Graph

    ([3],       4, [1, 3])  &
    ([1, 2, 4], 3, [4]   )  &
    ([1],       2, [1, 3])  &
    ([2, 4],    1, [2, 3])

Let us consider a simple graph implementation in Kotlin. We do not record inbound neighbors, and attempt to define a vertex as a closed neighborhood:

open class Graph(val vertices: Set<Vertex>) { ... }
data class Vertex(neighbors: Set<Vertex>): Graph(this + neighbors)
//                                               ↳ Compile error!

Note the coinductive definition, which creates problems right off the bat. Since this is not accessible inside the constructor, we cannot have cycles or closed neighborhoods. Maybe we can come up with a definition which allows cycles and closed neighborhoods by avoiding coinduction:

class Graph(val vertices: Set<Vertex>) { ... }
class Vertex(val neighbors: Set<Vertex>)

Already, this definition admits a nice k-nearest neighbors implementation:

tailrec fun Vertex.neighbors(k: Int = 0, vertices: Set<Vertex> =
                             neighbors + this): Set<Vertex> =
  if (k == 0 || vertices.neighbors() == vertices) vertices
  else knn(k - 1, vertices + vertices.neighbors() + this)

fun Set<Vertex>.neighbors() = flatMap { it.neighbors() }.toSet()

// Removes all vertices outside the set
fun Set<Vertex>.closure() = map { vertex ->
  Vertex(neighbors.filter { it in this@closure })
}.toSet()

fun Vertex.neighborhood(k: Int = 0) = Graph(neighbors(k).closure())

But what about cycles? To support cycles, we will need to modify our definition slightly, to delay edge instantiation until after construction:

class Graph(val vertices: Set<Vertex>) { ... }
class Vertex(map: (Vertex) -> Set<Vertex>) {
    val neighbors = map(this).toSet()
}

We can now call Vertex() { setOf(it) } to create a vertex with a self-loop.

Let us consider an algorithm called the Weisfeiler-Lehman isomorphism test, which my colleague David Bieber wrote a nice piece about. I’ll focus on the implementation. First, we need a pooling operator, which will aggregate all neighbors in our neighborhood using some summary statistic:

fun Graph.poolBy(statistic: Set<Vertex>.() -> Int): Map<Vertex, Int> =
  nodes.map { it to statistic(it.neighbors()) }.toMap()

Next, we’ll need a histogram, which counts each node’s neighborhood:

val histogram: Map<Vertex, Int> by lazy { poolBy { size } }

Now we’re ready to define the Weisfeiler-Lehman operator, which recursively computes a hash on the histogram for k rounds.

tailrec fun wl(k: Int, labels: Map<Vertex, Int>): Map<Vertex, Int> =
  if (k <= 0) labels
  else wl(k - 1, poolBy { map { labels[it]!! }.sorted().hashCode() })

We compute the hashcode of the entire graph by hashing the multiset of WL labels. With one round, we’re just comparing the degree histogram. The more rounds we add, the more likely we are to detect a symmetry breaker:

override fun Graph.hashCode(rounds: Int = 10) = 
    wl(rounds, histogram).values.sorted().hashCode()

Finally, we can define a test to detect if one graph is isomorphic to another:

fun Graph.isIsomorphicTo(that: Graph) =
  this.nodes.size == that.nodes.size && 
  this.numOfEdges == that.numOfEdges && 
  this.hashCode() == that.hashCode()

That’s it! This algorithm works on almost every graph you will ever encounter. For a complete implementation of Graph, refer to this repository.

TODO: Graph grammars are grammars on graphs.

TODO: Single/Double pushout

Graphs, visually

Graphs have also found many interesting applications as reasoning devices in various domains:

Diagramming Language Example
Feynman diagram
Category theory
Penrose notation
Tensor network notation
Finite state machines
Petri networks
Proof networks

The λ-calculus can also be interpreted graphically. I refer the gentle reader to the following proposals:

As Tae Danae Bradley vividly portrays, matrices are not just 2D arrays, matrices are functions on vector spaces. This has a nice visual representation using a bipartite graph:

Not only do matrices correspond to graphs, graphs also correspond to matrices. One way to think of a graph is just a boolean matrix, or real matrix for weighted graphs. Consider an adjacency matrix containing nodes V, and edges E, where:

Just like matrices, we can also think of a graph as a function which carries information from state to state - given a state, it tells us which next states are accessible. This correspondence suggests an unrealized connection between graph theory and linear algebra which is still being explored, and promises important applications for signal processing on graphs.

Geometric
Matrix

Note the lower triangular structure of this matrix, indicating there are no cycles, a property which is not immediately obvious from the naïve geometric layout. Iff the vertices of a directed graph can be reordered to produce an adjacency matrix in triangular form, this graph is said to be a directed acyclic graph. Commonly encountered in introductory CS classes, this ordering, called a topological ordering, can also be implemented using matrix multiplication on the adjacency matrix.

Both geometric and matrix representations impose a extrinsic perspective on graphs, each with its own advantages and disadvantages. 2D renderings can be visually compelling, but require solving a minimal crossing number or similar minimization to make network connectivity plain to the naked eye. While graph drawing is an active field of research, matrices can often reveal symmetries that are not obvious from a naive graph layout.

Matrices are problematic for other reasons. Primarily, by treating a graph as a matrix, we impose an ordering over all vertices which is often arbitrary. Note also its sparsity, and consider the size of the matrix required to store even small graphs. While problematic, this can be overcome with certain optimizations. Despite their disadvantages, matrices and are a natural representation choice for many graph algorithms, particularly on modern parallel processing hardware. More on that later.

Graphs, computationally

What happens if we define arithmetic operators on graphs? How could we define and interpret these operations in a meaningful way? As we have seen, one way to represent a directed graph is just a square matrix whose non-zero entries indicate edges between nodes. Just like real matrices in linear algebra, we can add, subtract, multiply and exponentiate them.

One interesting game mathematicians like to play, is to design a square matrix ℝK×K and raise it to a power. There are various tricks for designing the matrix and normalizing the product so it does not explode or vanish. If we then multiply the matrix by a state vector ℝK, we are effectively “simulating” the system at discrete time steps. This game has many important applications in control theory, dynamical systems and deep learning (RNNs).

We can think about this as either a matrix product (MM…M)V, or function application M(M(…M(V)…)), where M is a function on a vector space (these two views are equivalent). There are various names for M, such as the transition matrix, stochastic matrix, or Markov matrix. It turns out the very same idea is not just valid over real matrices, but can be generalized to boolean and integer matrices. We are primarily interested in the deterministic version, whose variables inhabit 𝔹K×K.

It turns out that power iteration of a square matrix converges to its the eigenvector. This has important consequences for dynamical systems on networks. Researchers are just beginning to understand how eigenvalues of the adjacency matrix govern long timescale dynamical processes on graphs. In this section, we will explore some examples of dynamical processes on graphs.

We have previously seen an example of graph computation, Weisfeiler-Lehman, and topsort. Three steps of Barabási’s preferential attachment algorithm:

DOT Graph Matrix

Another early example of graph computation can be found in Valiant (1975):

This astonishing result suggests that, at least for the context free languages, there is a parsing algorithm which is equivalent to matrix multiplication. For example, all of the following automata can be simulated using matrix multiplication:

We now attempt to show a few examples simulating a state machine using matrix multiplication. For illustrative purposes, the state simply holds a vector of binary or integer values, however we can also imagine it carrying other “messages” around the graph in a similar manner, using their corresponding algebras. Here, we will use the boolean algebra for matrix multiplication, where + corresponds to , and * corresponds to :

┌───┬───┬─────┬─────┐
│ x │ y │ x*y │ x+y │        Boolean Matrix Multiplication 
├───┼───┼─────┼─────┤ ┌─       ─┐ ┌─ ─┐ ┌─                     ─┐
│ 0 │ 0 │  0  │  0  │ │ a  b  c │ │ j │ │ a * j + b * k + c * l │
│ 0 │ 1 │  0  │  1  │ │ d  e  f │*│ k │=│ d * j + e * k + f * l │
│ 1 │ 0 │  0  │  1  │ │ g  h  i │ │ l │ │ g * j + h * k + i * l │
│ 1 │ 1 │  1  │  1  │ └─       ─┘ └─ ─┘ └─                     ─┘
└───┴───┴─────┴─────┘ 

Linear chains

To get started, let’s simply iterate through a linked list. We initialize the pointer to the head of the list, and each matmul advances the pointer by a single element. We add an implicit self loop to the final element, and halt whenever we detect a fixpoint.

Graph
Matrix
S
S'
    a  b  c
   ________
a | 0  0  0
b | 1  0  0
c | 0  1  1



1
0
0



0
1
0

    a  b  c
   ________
a | 0  0  0
b | 1  0  0
c | 0  1  1



0
1
0



0
0
1

    a  b  c
   ________
a | 0  0  0
b | 1  0  0
c | 0  1  1



0
0
1



0
0
1

Directed acyclic graphs

Simulating a DFA using a matrix is wasteful, since we only ever inhabit one state at a time. The real benefit of using matrices comes in when simulating nondeterminstic finite automata (NFA). Typical implementations require cloning the NFA when multiple transitions are valid. Instead of cloning the machine, we can simulate the superposition of all states using a single matrix.

Graph
Matrix
S
S'
    a  b  c  d
   ___________
a | 0  0  0  0
b | 1  0  0  0
c | 1  0  0  0
d | 0  1  1  1



1
0
0
0



0
1
1
0

    a  b  c  d
   ___________
a | 0  0  0  0
b | 1  0  0  0
c | 1  0  0  0
d | 0  1  1  1



0
1
1
0



0
0
0
1

    a  b  c  d
   ___________
a | 0  0  0  0
b | 1  0  0  0
c | 1  0  0  0
d | 0  1  1  1



0
0
0
1



0
0
0
1

We encode the accept state as a self cycle in order to detect the fixpoint S = MS = MMS, after which we halt execution.

Dataflow graphs

Suppose we have the function f(a, b) = (a + b) * b and want to compute f(2, 3). For operator indices, we will need two tricks. First, all operators will retain their state, i.e. 1s along all operator diagonals. Second, when applying the operator, we will combine values using the operator instead of performing a sum.

Graph
Matrix
S
S'
    a  b  +  *
   ___________
a | 0  0  0  0
b | 0  0  0  0
+ | 1  1  1  0
* | 0  1  1  1



2
3
0
0



0
0
5
3

    a  b  +  *
   ___________
a | 0  0  0  0
b | 0  0  0  0
+ | 1  1  1  0
* | 0  1  1  1



0
0
5
3



0
0
0
15

    a  b  +  *
   ___________
a | 0  0  0  0
b | 0  0  0  0
+ | 1  1  1  0
* | 0  1  1  1



0
0
0
15



0
0
0
15

Graphs, efficiently

One issue with efficient representation of graphs is space complexity. Suppose we have a graph with 105=100,000 nodes, but only a single edge. We will need 105*2 bits, or about 1 GB to store it in adjacency matrix form, whereas if we use an adjacency list, we will need only ⌈ 2*log2105 ⌉ = 34 bits. Most graphs are similarly sparse. But how do you multiply adjacency lists? Unclear. The solution is to use sparse matrices. That was easy.

Another, thornier, problem with graph algorithms is their time complexity. Many interesting problems on graphs are NP-complete, including Hamiltonian paths and subgraph isomorphism. If so, how are we supposed to do any computation if every operation may take nondeterminstic polynomial time? Computer science people are mostly concerned with worst case complexity, which rarely ever occurs in practice. Real world isomorphism can be solved quickly using approximate algorithms, such as the one we saw earlier.

One issue with computation graphs is that in most programming languages, they are not reified. That is, for a given value, we would like some method which programmatically returned its provenance, as a computation graph. Which we can encode that program as a graph.

A lot of the stuff in Graph Representation Learning is motivated by computational constraints. You can’t instantiate the adjacency matrix, because it’s too large, so you need all kinds of mathematical tricks to sum over or approximate it. But most graphs are sparse and have all kinds of symmetries. Finding the right graph embedding can get you real far…

Programs as graphs

It turns out graphs are not only useful as data structures, but we can think of the computation itself as a graph on a binary state space. Each tick of the clock corresponds to one matrix multiplication on a boolean tape.

Futamura (1983) shows us that programs can be decomposed into two inputs, static and dynamic. This can be viewed as a function mapping inputs to output:

Consider the static case, in which we have all the information available at compile time, we just need to multiply the state P: 𝔹|S|×|S| by the vector S until termination:

    [P]--------------------------------           } Program
      \          \          \          \
[S₀]---*---[S₁]---*---[S₂]---*---[..]---*---[Sₜ]  } TM tape

Now the dynamic case, P might be governed by another program:

        [Q]---------------------                  } Dynamics
          \          \          \
    [P₀]---*---[P₁]---*---[..]---*---[Pₜ₋₁]       } Program
      \          \          \          \
[S₀]---*---[S₁]---*---[S₂]---*---[..]---*---[Sₜ]  } TM tape

We might also imagine these inputs as being generated by higher order programs.

                     ⋮
            [R₀]---------                         } World model
              \          \
        [Q₀]---*---[..]---*---[Pₜ₋₂]              } Dynamics
          \          \          \
    [P₀]---*---[P₁]---*---[..]---*---[Pₜ₋₁]       } Program
       \         \          \           \
[S₀]---*---[S₁]---*---[S₂]---*---[...]---*---[Sₜ] } TM tape

What about programs of varying length? It may be the case we want to learn programs where t varies. The key is, we can choose an upper bound on t, and search for a fixpoint. That is, we halt whenever St = St+1.

There will always be some program, at the interface of the machine and the real world, which must be approximated. One question worth asking is how large does k need to be in order to do so? If it is very large, this procedure might well be intractable. Time complexity appears to be at least O(tk2), using Strassen.

Program synthesis

Many people have asked me, “Why should developers care about automatic differentiation?” Yes, we can use it for building machine learning systems. Yes, it has specialized applications in robotics and physical simulation. But does it really matter for software engineers?

I have been thinking carefully about this question, and although it is not clear to me yet, I am starting to see how some pieces fit together. A more complete picture will require a lot more research, engineering and rethinking the role of software, compilers and machine learning.

    [P]--------------------------------           } Program
      \          \          \          \
[S₀]---*---[S₁]---*---[S₂]---*---[..]---*---[Sₜ]  } TM tape

Consider the static case seen above. Since the matrix P is fixed throughout execution, to learn P, we need to solve the following minimization problem:

One issue with this formulation is we must rely on a loss function over St, which is often too sparse and generalizes poorly. It may be the case that many interesting program synthesis problems have optimal substructure, so we should be making “progress” towards a goal state, and can define a loss over intermediate states. This needs to be explored in more depth.

Some, including Gaunt et al. (2016), have shown gradient is not very effective, as the space of boolean circuits is littered with islands which have zero gradient. Their representation is also relative complex – effectively, they are trying to learn a recursively enumerable language using something like a Neural Turing Machine (Graves et al., 2014).

More recent work, including that of Lample et al. (2019), have demonstrated Transformers are capable of learning programs, where the program belongs to a much simpler class of context free languages. This space is often much more tractable to search and generate synthetic training data, and appears to be well within the reach of modern language models.

In the last year, a number of interesting reults in differentiable architecture search started to emerge. DARTS (Liu et al., 2019) proposes to use gradient to search through the space of directed graphs. The authors first perform a continuous relaxation of the discrete graph, by reweighting the output of each each potential value by a hyperparameter, optimizing over the space of operations, then discretizing the output graph.

Solar-Lezma calls this latter approach, “program extraction”, where the network implicitly or explicitly parameterizes the function, which after training, can be decoded into a symbolic expression. This also aligns with Goodfellow’s notion of deep networks as programs, where each step performs a certain “step” of computation.

A less charitable interpretation is that Goodfellow is simply using an metaphor to explain deep learning to lay audience, but I prefer to think he is communicating something deeper about the role of stacked nonlinear function approximators as computational primitives in a chain of function compositions.

References