Lambda calculus

What’s the least we can put in a programming language and still have it be able to do anything other languages can notei.e. turing complete? It turns out we can get away with including only a single built in feature: function definition, written using \(\lambda\). The language that has only this one feature is called “lambda calculus” or “$\lambda$ calculus.” We also need a way to apply our functions to inputs; we write \(f\ x\) to indicate the value of the function \(f\) on \(x\), perhaps more familiarly written \(f(x)\).

A \(\lambda\) expression is something of the form \(\lambda x.f(x)\), which means “take in an input \(x\), and return \(f(x)\). For example, \(\lambda x.x+1\) takes in a number and returns one more than it. We only directly have functions with one input, but we can build functions with multiple inputs out of this. For example, applying \(\lambda x.\lambda y.x+y\) to \(3\), and applying the result to \(4\) gives \(3+4=7\) . We can think of this as a two-input function, and abbreviate it as \(\lambda xy.x+y\); “$\lambda xy$” is simply a shorthand for “$\lambda x.\lambda y$.” We don’t lose any power by restricting ourselves to unary functions, because we can think of functions with multiple inputs as taking only one input at a time, and returning another function at each intermediate step.

We said the only feature we have is function definition, so what’s this about “$+$” and “$1$”? These symbols don’t exist in \(\lambda\) calculus, so we’ll have to define numbers and operations on them using \(\lambda\) noteThe standard way of doing this is called the Church encoding.. But \(\lambda\) expressions are useful even when we have other features as well; programming languages such as Python allow function definition using \(\lambda\). It’s also useful when trying to understand what \(\lambda\) means to see examples with functions you’re more familiar with.

We’re not going to put constraints on what inputs \(\lambda\) expressions can take. If we allowed expressions like \(\lambda x.x+1\), it would get confused when we try to give it something other than a number; this is why we can’t have \(+\) and \(1\). So what sorts of objects will we give our functions? The only objects have we: other functions, of course!

Formal definition

Let’s formally define \(\lambda\) calculus. First, the language of \(\lambda\) calculus contains

  • Parentheses ( and )

  • Symbols \(\lambda\) and .

  • Variables \(v_1,v_2,\dots\) (in practice we’ll use other letters too).

What counts as a \(\lambda\) expression? We define them recursively:

  • For any variable \(x\), \(x\) is a \(\lambda\) expression.

  • For any variable \(x\) and \(\lambda\) expression \(M\), \((\lambda x.M)\) is a \(\lambda\) expression.

  • If \(M\) and \(N\) are \(\lambda\) expressions, so is \((M\ N)\).

We also want to define free and bound variables, and also do this recursively:

  • The variable \(x\) is free in the expression \(x\).

  • Every instance of \(x\) is bound in the expression \((\lambda x.M)\). If an instance of \(x\) is free in \(M\), we say the initial \(\lambda x\) in \((\lambda x.M)\) binds it. We say that \(M\) is the scope of that \(\lambda x\).

  • An instance of \(x\) in \((M\ N)\) is free if it is free in \(M\) or \(N\), and bound otherwise.

We can use the rules to build expressions like \(((\lambda x.(\lambda y.(x\ y))))\ x)\), in which the first instance of \(x\) and the only instance of \(y\) are bound, and the second instance of \(x\) is free. We have an oppressive amount of parentheses; we’ll establish informal conventions to reduce them in the next section.

Parenthesis conventions

If we want to be extremely careful, we should write the add-two-numbers function as \((\lambda x.(\lambda y.(x+y)))\), using parentheses as in the formal definition. But we often don’t need them, and use the following rules:

  • We don’t need outermost parentheses. We can write \(\lambda x.(\lambda y.(x+y))\) instead of \((\lambda x.(\lambda y.(x+y)))\).

  • Application is, by default, left associative.\(f\ x\ y\) means “the function \(f\) with inputs \(x\) and \(y\),” i.e. \((f\ x)\ y\), not \(f\ (x\ y)\).

  • The scope of any \(\lambda\) is as large is possible.\(\lambda x.\lambda y.x+y\) means \(\lambda x.(\lambda y.(x+y))\), not \((\lambda x.\lambda y.x)+y\) or \(\lambda x.(\lambda y.x)+y\) or anything like that.

  • We can abbreviate sequences of \(\lambda\)s. As mentioned before, we can write \(\lambda xy.x+y\) for \(\lambda x.\lambda y.x+y\).

Of course, we don’t have to omit parentheses whenever we can. These rules govern what it means when we don’t have all the parentheses strictly necessary, but we can include them if we want to be extra clear.

Reducing lambda expressions

So far, \(\lambda\) expressions don’t have any meaning; we’ve said what we’re allowed to write down but not how the relate to each other. What we usually do with a \(\lambda\) expression is reduce it, and there are three rules that we can use: \(\beta\)-reduction, \(\alpha\)-conversion, and \(\eta\)-conversion.

\(\beta\)-reduction is the fancy name for “stick in the inputs to a function.” For example, to evaluate \((\lambda x.\lambda y.x+y)\ 6\ 3\), we first notice that \(6\) is the input to a \(\lambda\) expression that starts \(\lambda x\), and substitute \(6\) for \(x\) in the expression, giving us \((\lambda y.6+y)\ 3\). Now \(3\) is substituted for \(y\), giving us \(6+3=9\).

Formally, for a variable \(x\) and \(\lambda\) expressions \(M\) and \(N\), \(\beta\)-reduction converts \(((\lambda x.M)\ N)\) to \(M[N/x]\), i.e.\(M\) with every free instance of \(x\) replaced with \(N\). The expressions \(((\lambda x.M)\ N)\) and \(M[N/x]\) are equivalent.

todo: is there a more standard notation for this?

\(\alpha\)-conversion says we can rename variables; \(\lambda x.f\ x\) is the same as \(\lambda y.f\ y\). Formally, if \(M\) is a \(\lambda\) expression containing \(\lambda x\), and no instance of \(x\) bound by that \(\lambda x\) is within the scope of a \(\lambda y\), then we can replace every \(x\) in the scope of the \(\lambda x\) with a \(y\) to get an equivalent expression. We need the second criterion because otherwise we could replace the \(x\)s in \(\lambda x.\lambda y.x\) with \(y\)s to get \(\lambda y.\lambda y.y\), and then replace the outer \(y\) with \(x\) to get \(\lambda x.\lambda y.x\), which is not equivalent (the second substitution here is valid; the first one is the problem).

Finally, \(\eta\)-conversion says that if two expressions give the same result on any input, then they’re equivalent. We should have \(\lambda x.f\ x\) and \(f\) be the same, since \(\beta\)-reduction converts \((\lambda x.f\ x)\ x\) to \(f\ x\), so they’re the same on any input. Formally, if \(x\) isn’t free in \(M\), then \((\lambda x.(M\ x))\) is equivalent to \(M\).


Let’s look again at the function that adds two numbers, \(\lambda x.\lambda y.x+y\). There are two different ways we can think of this: first, as a function that takes in two numbers, and returns their sum. If we’re talking about natural numbers, this is a function \(\mathbb N^2\to\mathbb N\).

We could also think of the function as taking one input at a time. It says “take in input \(x\), return the expression \(\lambda y.x+y\).” For example, on input \(6\) we have \((\lambda x.\lambda y.x+y)\ 6=\lambda y.6+y\). Now we have a function that takes in a number, and gives us a function that takes in a number and returns a number. We can write this as \(\mathbb N\to(\mathbb N\to\mathbb N)\).

This equivalence between functions on two (or more) inputs, and functions on one input that return a new function, is known as Currying and is important in \(\lambda\) calculus, where the only objects are functions. We have to get used to the idea that a function can take other functions as arguments and give functions as return values.

Defining variables

\(\lambda\) calculus doesn’t have a built in way to define a variable. Suppose we have the function \(d=\lambda f.\lambda x.f\ (f\ x)\) which applies its first input to its second input twice. Currying, \(d\) takes in a function and returns that function composed with itself. Say we want to know what \(d\ d\) is. We could just write \(d\) twice, to get \((\lambda f.\lambda x.f\ (f\ x))\ (\lambda f.\lambda x.f\ (f\ x))\). \(\beta\)-reduction eventually reduces this to \(\lambda f.\lambda x.f\ (f\ (f\ (f\ x)))\), the function that applies its first input to its second input four times (as you might have expected, applying a function twice, twice, is the same as applying it four times).

We can also get \(d\ d\) while only writing out \(d\) once. Take the \(\lambda\) expression \((\lambda x.x\ x)\ d\) or, writing out \(d\), \((\lambda x.x\ x)\ (\lambda f.\lambda x.f\ (f\ x))\). A single step of \(\beta\)-reduction turns this into the previous version, where we wrote \(d\) out twice. The expression we started with was shorter, and expressed better the idea that we wanted to use the same value of \(d\) twice. In general, if we want to say something like \(x = a; f\ x\) (i.e. define the variable \(x\), and then use it in some expression), we can write \((\lambda x.f\ x)\ a\), first building the expression we want to use the value of \(x\) in, and then substituting in the value using function application. This will be very useful in the next section when we make recursive functions.


It might seem like \(\lambda\) expressions are very finite, and \(\beta\)-reduction will always finish after a reasonable amount of time. But for \(\lambda\) calculus to be as powerful as normal programming languages, there must be \(\lambda\) expressions that don’t reduce in a finite amount of time. The way to resolve this is to figure out how to build while loops in \(\lambda\) calculus; if you can make a while loop, you can write a program that runs forever pretty easily.

Since all we have is functions, we’ll use recursion to make loopy programs. For example, consider this factorial function in Python:

factorial = lambda n: 1 if n==0 else n*factorial(n-1)

How can we write this in lambda calculus? Let’s assume we have numbers, arithmetic, booleans,

todo: possibly change these links to church encoding pages once they exist
and a function \(Z\) that returns True on \(0\) and False on anything else. Let’s also a assume we have an “if” function \(I\) that takes in three inputs: \(I = (\lambda p.\lambda x.\lambda y.\)if \(p\), then \(x\), else \(y)\). So \(I\ True\ x\ y=x\) and \(I\ False\ x\ y=y\).

Starting to translate from Python to \(\lambda\) calculus, we have \(F = \lambda n.I\ (Z\ n)\ 1\ (n\times(F\ (n-1)))\), which says “if \(n\) is \(0\), then \(1\), else \(n\times F(n-1)\). But we can’t define \(F\) like this; we have to use the method described in the previous section. In order for each recursive call of the factorial function to know how to call further recursive calls, we need to pass in \(F\) to itself. Consider the expression

$$F=(\lambda x.x\ x)\ (\lambda r.\lambda n.I\ (Z\ n)\ 1\ (n\times(r\ r\ (n-1))))$$

Let \(g=\lambda r.\lambda n.I\ (Z\ n)\ 1\ (n\times(r\ r\ (n-1)))\). Then \(F=(\lambda x.x\ x)\ g=g\ g\) is \(g\) applied to itself. Plugging in \(g\) for \(r\), \(g\ g=\lambda n.I\ (Z\ n)\ 1\ (n\times(g\ g\ (n-1)))\). But since \(g\ g=F\), this is just \(\lambda n.I\ (Z\ n)\ 1\ (n\times(F\ (n-1)))\), exactly what we wanted \(F\) to be.

Let’s generalize this. Say we want to define a recursive formula \(f=\lambda x.h\ f\ x\), where \(h\) is any expression (in the factorial example \(h=\lambda f.\lambda n.I\ (Z\ n)\ 1\ (n\times(f\ (n-1)))\)). We define

\begin{align} g&=\lambda r.h\ (r\ r)\newline &=\lambda r.\lambda x.h\ (r\ r)\ x \end{align}

and use this to build

\begin{align} f&=(\lambda x.x\ x)\ g\newline &=g\ g\newline &=(\lambda r.\lambda x.h\ (r\ r)\ x)\ g\newline &=\lambda x.h\ (g\ g)\ x\newline &=\lambda x.h\ f\ x. \end{align}

We can express this process taking any \(h\) to the recursive function \(f\) as a \(\lambda\) expression:

\begin{align} Y&=\lambda h.(\lambda x.x\ x)\ (\lambda r.h\ (r\ r))\newline &=\lambda h.(\lambda r.h\ (r\ r))\ (\lambda r.h\ (r\ r)). \end{align}

For any \(h\), which takes as arguments \(f\) and \(x\) and returns some expression using them, \(Y\ h\) recursively calls \(h\). Specifically, letting \(f=Y\ h\), we have \(f=\lambda x.h\ f\ x\).

For fun, let’s make a \(\lambda\) expression that enters an infinite loop by reducing to itself. If we make \(h\) the function that applies \(f\) to \(x\), recursively calling it should accomplish this, since it’s analogous to the Python function f = lambda x: f(x), which just loops forever. So let \(h=\lambda f.\lambda x.f x=\lambda f.f\). Now \begin{align} Y\ h&=(\lambda x.x\ x)\ (\lambda r.(\lambda f.f)\ (r\ r))\newline &=(\lambda x.x\ x)\ (\lambda r.r\ r)\newline &=(\lambda r.r\ r)\ (\lambda r.r\ r). \end{align}

Continued \(\beta\)-reduction doesn’t get us anywhere. This expression is also interesting because it’s a simple quine, a program that outputs its own code, and this is a sort of template you can use to make quines in other languages.

This strategy of building recursive functions results in some very long intermediate computations. If \(F\) is the factorial function, we compute \(F(100)\) as \(100\times F(99)=100\times 99\times F(98)\), etc. By the time the recursive calls finally stop, we have the very long expression \(100\times99\times\dots\times2\times1\), which takes a lot of space to write down or store in a computer. Many programming languages will give you an error if you try to do this, complaining that you exceeded the max recursion depth.

The fix is something called tail recursion, which roughly says “return only a function evaluation, so you don’t need to remember to do something with it when it finishes.” The general way do accomplish this is to pass along the intermediate computation to the recursive calls, instead of applying it to the result of the recursive calls. We can make a tail recursive version of our Python factorial like so notePython will still give you a “maximum recursion depth exceeded” error with this, but the strategy works in lambda calculus and lazy languages like Lisp and Haskell, and Python is easier to demonstrate with.:

f = lambda n, k: k if n==0 else f(n-1, k*n)

We pass along \(k\) as an intermediate value. It has to start with \(k=1\); we can do this by defining a new function g = lambda n: f(n,1) or using Python’s default value feature to have \(k\) be \(1\) be default. Now if we want \(100\) factorial, we get \(f(100,1)=f(99,1*100)=f(99,100)=f(98,9900)=\dots\). We compute the products as we go instead of all at the end, and only ever have one call of \(f\) active at a time.

Writing a tail recursive version of factorial in \(\lambda\) calculus is left as an exercise for the reader.

Y combinator

You might be wondering why we called the recursive-function-maker \(Y\), so let’s look at that again. We said earlier that \(Y\ h=\lambda x.h\ (Y\ h)\ x\). By \(\eta\)-conversion, \(Y\ h=h\ (Y\ h)\). That means \(Y\ h\) is a fixed point of \(h\)! For any function \(h\), \(Y\) finds a value \(f\) such that \(h\ f=f\). This kind of object is called a fixed point combinator; this particular one is called the Y combinator, conventionally denoted \(Y\).

Why is a fixed point finder the right way to make recursive functions? Consider again the example of factorial. If a function \(f\) says “do factorial for \(k\) recursive steps,” then \(h\ f\) says “do factorial for \(k+1\) recursive steps;” \(h\) was designed to be the function which, calling recursively, gives factorial. Naming the identity function \(i=\lambda n.n\), \(i\) does factorial for \(0\) steps, so \(h\ i\) does it for \(1\) step, \(h\ (h\ i)\) for \(2\), and so on. To completely compute factorial, we need a function \(F\) that says “do factorial for infinity recursive steps.” But doing one more than infinity is the same as just doing infinity, so it must be true that \(h\ F=F\); i.e. that \(F\) is a fixed point of \(h\). There may be other fixed points of \(h\), but the Y combinator finds the fixed point that corresponds to recursively calling it.

With \(Y\) and without \(Y\)

What do we get, and what do we lose, if we allow the use of \(Y\)?

\(Y\) takes a lambda-term which may have corresponded to a primitive recursive function, and outputs a lambda term which need not correspond to a primitive recursive function. That is because it allows us to perform an unbounded search: it expresses fully-general looping strategies. (As discussed earlier, without a fixed-point finder of some kind, we can only create lambda-terms corresponding to programs which provably halt.)

On the other side of the coin, \(Y\) may produce non-well-typed terms, while terms without a fixed-point finder are guaranteed to be well-typed. That means it’s hard or even impossible to perform type-checking (one of the most basic forms of static analysis) on terms which invoke \(Y\).

Strongly relatedly, in a rather specific sense through the Curry-Howard correspondence, the use of \(Y\) corresponds to accepting non-constructiveness in one’s proofs. Without a fixed-point finder, we can always build a constructive proof corresponding to a given lambda-term; but with \(Y\) involved, we can no longer keep things constructive.


  • Mathematics

    Mathematics is the study of numbers and other ideal objects that can be described by axioms.