Lambda calculus

What’s the least we can put in a pro­gram­ming lan­guage and still have it be able to do any­thing other lan­guages can notei.e. tur­ing com­plete? It turns out we can get away with in­clud­ing only a sin­gle built in fea­ture: func­tion defi­ni­tion, writ­ten us­ing \(\lambda\). The lan­guage that has only this one fea­ture is called “lambda calcu­lus” or “$\lambda$ calcu­lus.” We also need a way to ap­ply our func­tions to in­puts; we write \(f\ x\) to in­di­cate the value of the func­tion \(f\) on \(x\), per­haps more fa­mil­iarly writ­ten \(f(x)\).

A \(\lambda\) ex­pres­sion is some­thing of the form \(\lambda x.f(x)\), which means “take in an in­put \(x\), and re­turn \(f(x)\). For ex­am­ple, \(\lambda x.x+1\) takes in a num­ber and re­turns one more than it. We only di­rectly have func­tions with one in­put, but we can build func­tions with mul­ti­ple in­puts out of this. For ex­am­ple, ap­ply­ing \(\lambda x.\lambda y.x+y\) to \(3\), and ap­ply­ing the re­sult to \(4\) gives \(3+4=7\) . We can think of this as a two-in­put func­tion, and ab­bre­vi­ate it as \(\lambda xy.x+y\); “$\lambda xy$” is sim­ply a short­hand for “$\lambda x.\lambda y$.” We don’t lose any power by re­strict­ing our­selves to unary func­tions, be­cause we can think of func­tions with mul­ti­ple in­puts as tak­ing only one in­put at a time, and re­turn­ing an­other func­tion at each in­ter­me­di­ate step.

We said the only fea­ture we have is func­tion defi­ni­tion, so what’s this about “$+$” and “$1$”? Th­ese sym­bols don’t ex­ist in \(\lambda\) calcu­lus, so we’ll have to define num­bers and op­er­a­tions on them us­ing \(\lambda\) noteThe stan­dard way of do­ing this is called the Church en­cod­ing.. But \(\lambda\) ex­pres­sions are use­ful even when we have other fea­tures as well; pro­gram­ming lan­guages such as Python al­low func­tion defi­ni­tion us­ing \(\lambda\). It’s also use­ful when try­ing to un­der­stand what \(\lambda\) means to see ex­am­ples with func­tions you’re more fa­mil­iar with.

We’re not go­ing to put con­straints on what in­puts \(\lambda\) ex­pres­sions can take. If we al­lowed ex­pres­sions like \(\lambda x.x+1\), it would get con­fused when we try to give it some­thing other than a num­ber; this is why we can’t have \(+\) and \(1\). So what sorts of ob­jects will we give our func­tions? The only ob­jects have we: other func­tions, of course!

For­mal definition

Let’s for­mally define \(\lambda\) calcu­lus. First, the lan­guage of \(\lambda\) calcu­lus contains

  • Paren­the­ses ( and )

  • Sym­bols \(\lambda\) and .

  • Vari­ables \(v_1,v_2,\dots\) (in prac­tice we’ll use other let­ters too).

What counts as a \(\lambda\) ex­pres­sion? We define them re­cur­sively:

  • For any vari­able \(x\), \(x\) is a \(\lambda\) ex­pres­sion.

  • For any vari­able \(x\) and \(\lambda\) ex­pres­sion \(M\), \((\lambda x.M)\) is a \(\lambda\) ex­pres­sion.

  • If \(M\) and \(N\) are \(\lambda\) ex­pres­sions, so is \((M\ N)\).

We also want to define free and bound vari­ables, and also do this re­cur­sively:

  • The vari­able \(x\) is free in the ex­pres­sion \(x\).

  • Every in­stance of \(x\) is bound in the ex­pres­sion \((\lambda x.M)\). If an in­stance of \(x\) is free in \(M\), we say the ini­tial \(\lambda x\) in \((\lambda x.M)\) binds it. We say that \(M\) is the scope of that \(\lambda x\).

  • An in­stance of \(x\) in \((M\ N)\) is free if it is free in \(M\) or \(N\), and bound oth­er­wise.

We can use the rules to build ex­pres­sions like \(((\lambda x.(\lambda y.(x\ y))))\ x)\), in which the first in­stance of \(x\) and the only in­stance of \(y\) are bound, and the sec­ond in­stance of \(x\) is free. We have an op­pres­sive amount of paren­the­ses; we’ll es­tab­lish in­for­mal con­ven­tions to re­duce them in the next sec­tion.

Paren­the­sis conventions

If we want to be ex­tremely care­ful, we should write the add-two-num­bers func­tion as \((\lambda x.(\lambda y.(x+y)))\), us­ing paren­the­ses as in the for­mal defi­ni­tion. But we of­ten don’t need them, and use the fol­low­ing rules:

  • We don’t need out­er­most paren­the­ses. We can write \(\lambda x.(\lambda y.(x+y))\) in­stead of \((\lambda x.(\lambda y.(x+y)))\).

  • Ap­pli­ca­tion is, by de­fault, left as­so­ci­a­tive.\(f\ x\ y\) means “the func­tion \(f\) with in­puts \(x\) and \(y\),” i.e. \((f\ x)\ y\), not \(f\ (x\ y)\).

  • The scope of any \(\lambda\) is as large is pos­si­ble.\(\lambda x.\lambda y.x+y\) means \(\lambda x.(\lambda y.(x+y))\), not \((\lambda x.\lambda y.x)+y\) or \(\lambda x.(\lambda y.x)+y\) or any­thing like that.

  • We can ab­bre­vi­ate se­quences of \(\lambda\)s. As men­tioned be­fore, we can write \(\lambda xy.x+y\) for \(\lambda x.\lambda y.x+y\).

Of course, we don’t have to omit paren­the­ses when­ever we can. Th­ese rules gov­ern what it means when we don’t have all the paren­the­ses strictly nec­es­sary, but we can in­clude them if we want to be ex­tra clear.

Re­duc­ing lambda expressions

So far, \(\lambda\) ex­pres­sions don’t have any mean­ing; we’ve said what we’re al­lowed to write down but not how the re­late to each other. What we usu­ally do with a \(\lambda\) ex­pres­sion is re­duce it, and there are three rules that we can use: \(\beta\)-re­duc­tion, \(\alpha\)-con­ver­sion, and \(\eta\)-con­ver­sion.

\(\beta\)-re­duc­tion is the fancy name for “stick in the in­puts to a func­tion.” For ex­am­ple, to eval­u­ate \((\lambda x.\lambda y.x+y)\ 6\ 3\), we first no­tice that \(6\) is the in­put to a \(\lambda\) ex­pres­sion that starts \(\lambda x\), and sub­sti­tute \(6\) for \(x\) in the ex­pres­sion, giv­ing us \((\lambda y.6+y)\ 3\). Now \(3\) is sub­sti­tuted for \(y\), giv­ing us \(6+3=9\).

For­mally, for a vari­able \(x\) and \(\lambda\) ex­pres­sions \(M\) and \(N\), \(\beta\)-re­duc­tion con­verts \(((\lambda x.M)\ N)\) to \(M[N/x]\), i.e.\(M\) with ev­ery free in­stance of \(x\) re­placed with \(N\). The ex­pres­sions \(((\lambda x.M)\ N)\) and \(M[N/x]\) are equiv­a­lent.

todo: is there a more stan­dard no­ta­tion for this?

\(\alpha\)-con­ver­sion says we can re­name vari­ables; \(\lambda x.f\ x\) is the same as \(\lambda y.f\ y\). For­mally, if \(M\) is a \(\lambda\) ex­pres­sion con­tain­ing \(\lambda x\), and no in­stance of \(x\) bound by that \(\lambda x\) is within the scope of a \(\lambda y\), then we can re­place ev­ery \(x\) in the scope of the \(\lambda x\) with a \(y\) to get an equiv­a­lent ex­pres­sion. We need the sec­ond crite­rion be­cause oth­er­wise we could re­place the \(x\)s in \(\lambda x.\lambda y.x\) with \(y\)s to get \(\lambda y.\lambda y.y\), and then re­place the outer \(y\) with \(x\) to get \(\lambda x.\lambda y.x\), which is not equiv­a­lent (the sec­ond sub­sti­tu­tion here is valid; the first one is the prob­lem).

Fi­nally, \(\eta\)-con­ver­sion says that if two ex­pres­sions give the same re­sult on any in­put, then they’re equiv­a­lent. We should have \(\lambda x.f\ x\) and \(f\) be the same, since \(\beta\)-re­duc­tion con­verts \((\lambda x.f\ x)\ x\) to \(f\ x\), so they’re the same on any in­put. For­mally, if \(x\) isn’t free in \(M\), then \((\lambda x.(M\ x))\) is equiv­a­lent to \(M\).


Let’s look again at the func­tion that adds two num­bers, \(\lambda x.\lambda y.x+y\). There are two differ­ent ways we can think of this: first, as a func­tion that takes in two num­bers, and re­turns their sum. If we’re talk­ing about nat­u­ral num­bers, this is a func­tion \(\mathbb N^2\to\mathbb N\).

We could also think of the func­tion as tak­ing one in­put at a time. It says “take in in­put \(x\), re­turn the ex­pres­sion \(\lambda y.x+y\).” For ex­am­ple, on in­put \(6\) we have \((\lambda x.\lambda y.x+y)\ 6=\lambda y.6+y\). Now we have a func­tion that takes in a num­ber, and gives us a func­tion that takes in a num­ber and re­turns a num­ber. We can write this as \(\mathbb N\to(\mathbb N\to\mathbb N)\).

This equiv­alence be­tween func­tions on two (or more) in­puts, and func­tions on one in­put that re­turn a new func­tion, is known as Cur­ry­ing and is im­por­tant in \(\lambda\) calcu­lus, where the only ob­jects are func­tions. We have to get used to the idea that a func­tion can take other func­tions as ar­gu­ments and give func­tions as re­turn val­ues.

Defin­ing variables

\(\lambda\) calcu­lus doesn’t have a built in way to define a vari­able. Sup­pose we have the func­tion \(d=\lambda f.\lambda x.f\ (f\ x)\) which ap­plies its first in­put to its sec­ond in­put twice. Cur­ry­ing, \(d\) takes in a func­tion and re­turns that func­tion com­posed with it­self. Say we want to know what \(d\ d\) is. We could just write \(d\) twice, to get \((\lambda f.\lambda x.f\ (f\ x))\ (\lambda f.\lambda x.f\ (f\ x))\). \(\beta\)-re­duc­tion even­tu­ally re­duces this to \(\lambda f.\lambda x.f\ (f\ (f\ (f\ x)))\), the func­tion that ap­plies its first in­put to its sec­ond in­put four times (as you might have ex­pected, ap­ply­ing a func­tion twice, twice, is the same as ap­ply­ing it four times).

We can also get \(d\ d\) while only writ­ing out \(d\) once. Take the \(\lambda\) ex­pres­sion \((\lambda x.x\ x)\ d\) or, writ­ing out \(d\), \((\lambda x.x\ x)\ (\lambda f.\lambda x.f\ (f\ x))\). A sin­gle step of \(\beta\)-re­duc­tion turns this into the pre­vi­ous ver­sion, where we wrote \(d\) out twice. The ex­pres­sion we started with was shorter, and ex­pressed bet­ter the idea that we wanted to use the same value of \(d\) twice. In gen­eral, if we want to say some­thing like \(x = a; f\ x\) (i.e. define the vari­able \(x\), and then use it in some ex­pres­sion), we can write \((\lambda x.f\ x)\ a\), first build­ing the ex­pres­sion we want to use the value of \(x\) in, and then sub­sti­tut­ing in the value us­ing func­tion ap­pli­ca­tion. This will be very use­ful in the next sec­tion when we make re­cur­sive func­tions.


It might seem like \(\lambda\) ex­pres­sions are very finite, and \(\beta\)-re­duc­tion will always finish af­ter a rea­son­able amount of time. But for \(\lambda\) calcu­lus to be as pow­er­ful as nor­mal pro­gram­ming lan­guages, there must be \(\lambda\) ex­pres­sions that don’t re­duce in a finite amount of time. The way to re­solve this is to figure out how to build while loops in \(\lambda\) calcu­lus; if you can make a while loop, you can write a pro­gram that runs for­ever pretty eas­ily.

Since all we have is func­tions, we’ll use re­cur­sion to make loopy pro­grams. For ex­am­ple, con­sider this fac­to­rial func­tion in Python:

factorial = lambda n: 1 if n==0 else n*factorial(n-1)

How can we write this in lambda calcu­lus? Let’s as­sume we have num­bers, ar­ith­metic, booleans,

todo: pos­si­bly change these links to church en­cod­ing pages once they exist
and a func­tion \(Z\) that re­turns True on \(0\) and False on any­thing else. Let’s also a as­sume we have an “if” func­tion \(I\) that takes in three in­puts: \(I = (\lambda p.\lambda x.\lambda y.\)if \(p\), then \(x\), else \(y)\). So \(I\ True\ x\ y=x\) and \(I\ False\ x\ y=y\).

Start­ing to trans­late from Python to \(\lambda\) calcu­lus, we have \(F = \lambda n.I\ (Z\ n)\ 1\ (n\times(F\ (n-1)))\), which says “if \(n\) is \(0\), then \(1\), else \(n\times F(n-1)\). But we can’t define \(F\) like this; we have to use the method de­scribed in the pre­vi­ous sec­tion. In or­der for each re­cur­sive call of the fac­to­rial func­tion to know how to call fur­ther re­cur­sive calls, we need to pass in \(F\) to it­self. Con­sider the expression

$$F=(\lambda x.x\ x)\ (\lambda r.\lambda n.I\ (Z\ n)\ 1\ (n\times(r\ r\ (n-1))))$$

Let \(g=\lambda r.\lambda n.I\ (Z\ n)\ 1\ (n\times(r\ r\ (n-1)))\). Then \(F=(\lambda x.x\ x)\ g=g\ g\) is \(g\) ap­plied to it­self. Plug­ging in \(g\) for \(r\), \(g\ g=\lambda n.I\ (Z\ n)\ 1\ (n\times(g\ g\ (n-1)))\). But since \(g\ g=F\), this is just \(\lambda n.I\ (Z\ n)\ 1\ (n\times(F\ (n-1)))\), ex­actly what we wanted \(F\) to be.

Let’s gen­er­al­ize this. Say we want to define a re­cur­sive for­mula \(f=\lambda x.h\ f\ x\), where \(h\) is any ex­pres­sion (in the fac­to­rial ex­am­ple \(h=\lambda f.\lambda n.I\ (Z\ n)\ 1\ (n\times(f\ (n-1)))\)). We define

\be­gin{al­ign} g&=\lambda r.h\ (r\ r)\newl­ine &=\lambda r.\lambda x.h\ (r\ r)\ x \end{al­ign}

and use this to build

\be­gin{al­ign} f&=(\lambda x.x\ x)\ g\newl­ine &=g\ g\newl­ine &=(\lambda r.\lambda x.h\ (r\ r)\ x)\ g\newl­ine &=\lambda x.h\ (g\ g)\ x\newl­ine &=\lambda x.h\ f\ x. \end{al­ign}

We can ex­press this pro­cess tak­ing any \(h\) to the re­cur­sive func­tion \(f\) as a \(\lambda\) ex­pres­sion:

\be­gin{al­ign} Y&=\lambda h.(\lambda x.x\ x)\ (\lambda r.h\ (r\ r))\newl­ine &=\lambda h.(\lambda r.h\ (r\ r))\ (\lambda r.h\ (r\ r)). \end{al­ign}

For any \(h\), which takes as ar­gu­ments \(f\) and \(x\) and re­turns some ex­pres­sion us­ing them, \(Y\ h\) re­cur­sively calls \(h\). Speci­fi­cally, let­ting \(f=Y\ h\), we have \(f=\lambda x.h\ f\ x\).

For fun, let’s make a \(\lambda\) ex­pres­sion that en­ters an in­finite loop by re­duc­ing to it­self. If we make \(h\) the func­tion that ap­plies \(f\) to \(x\), re­cur­sively call­ing it should ac­com­plish this, since it’s analo­gous to the Python func­tion f = lambda x: f(x), which just loops for­ever. So let \(h=\lambda f.\lambda x.f x=\lambda f.f\). Now \be­gin{al­ign} Y\ h&=(\lambda x.x\ x)\ (\lambda r.(\lambda f.f)\ (r\ r))\newl­ine &=(\lambda x.x\ x)\ (\lambda r.r\ r)\newl­ine &=(\lambda r.r\ r)\ (\lambda r.r\ r). \end{al­ign}

Con­tinued \(\beta\)-re­duc­tion doesn’t get us any­where. This ex­pres­sion is also in­ter­est­ing be­cause it’s a sim­ple quine, a pro­gram that out­puts its own code, and this is a sort of tem­plate you can use to make quines in other lan­guages.

This strat­egy of build­ing re­cur­sive func­tions re­sults in some very long in­ter­me­di­ate com­pu­ta­tions. If \(F\) is the fac­to­rial func­tion, we com­pute \(F(100)\) as \(100\times F(99)=100\times 99\times F(98)\), etc. By the time the re­cur­sive calls fi­nally stop, we have the very long ex­pres­sion \(100\times99\times\dots\times2\times1\), which takes a lot of space to write down or store in a com­puter. Many pro­gram­ming lan­guages will give you an er­ror if you try to do this, com­plain­ing that you ex­ceeded the max re­cur­sion depth.

The fix is some­thing called tail re­cur­sion, which roughly says “re­turn only a func­tion eval­u­a­tion, so you don’t need to re­mem­ber to do some­thing with it when it finishes.” The gen­eral way do ac­com­plish this is to pass along the in­ter­me­di­ate com­pu­ta­tion to the re­cur­sive calls, in­stead of ap­ply­ing it to the re­sult of the re­cur­sive calls. We can make a tail re­cur­sive ver­sion of our Python fac­to­rial like so notePython will still give you a “max­i­mum re­cur­sion depth ex­ceeded” er­ror with this, but the strat­egy works in lambda calcu­lus and lazy lan­guages like Lisp and Haskell, and Python is eas­ier to demon­strate with.:

f = lambda n, k: k if n==0 else f(n-1, k*n)

We pass along \(k\) as an in­ter­me­di­ate value. It has to start with \(k=1\); we can do this by defin­ing a new func­tion g = lambda n: f(n,1) or us­ing Python’s de­fault value fea­ture to have \(k\) be \(1\) be de­fault. Now if we want \(100\) fac­to­rial, we get \(f(100,1)=f(99,1*100)=f(99,100)=f(98,9900)=\dots\). We com­pute the prod­ucts as we go in­stead of all at the end, and only ever have one call of \(f\) ac­tive at a time.

Writ­ing a tail re­cur­sive ver­sion of fac­to­rial in \(\lambda\) calcu­lus is left as an ex­er­cise for the reader.

Y combinator

You might be won­der­ing why we called the re­cur­sive-func­tion-maker \(Y\), so let’s look at that again. We said ear­lier that \(Y\ h=\lambda x.h\ (Y\ h)\ x\). By \(\eta\)-con­ver­sion, \(Y\ h=h\ (Y\ h)\). That means \(Y\ h\) is a fixed point of \(h\)! For any func­tion \(h\), \(Y\) finds a value \(f\) such that \(h\ f=f\). This kind of ob­ject is called a fixed point com­bi­na­tor; this par­tic­u­lar one is called the Y com­bi­na­tor, con­ven­tion­ally de­noted \(Y\).

Why is a fixed point fin­der the right way to make re­cur­sive func­tions? Con­sider again the ex­am­ple of fac­to­rial. If a func­tion \(f\) says “do fac­to­rial for \(k\) re­cur­sive steps,” then \(h\ f\) says “do fac­to­rial for \(k+1\) re­cur­sive steps;” \(h\) was de­signed to be the func­tion which, call­ing re­cur­sively, gives fac­to­rial. Nam­ing the iden­tity func­tion \(i=\lambda n.n\), \(i\) does fac­to­rial for \(0\) steps, so \(h\ i\) does it for \(1\) step, \(h\ (h\ i)\) for \(2\), and so on. To com­pletely com­pute fac­to­rial, we need a func­tion \(F\) that says “do fac­to­rial for in­finity re­cur­sive steps.” But do­ing one more than in­finity is the same as just do­ing in­finity, so it must be true that \(h\ F=F\); i.e. that \(F\) is a fixed point of \(h\). There may be other fixed points of \(h\), but the Y com­bi­na­tor finds the fixed point that cor­re­sponds to re­cur­sively call­ing it.

With \(Y\) and with­out \(Y\)

What do we get, and what do we lose, if we al­low the use of \(Y\)?

\(Y\) takes a lambda-term which may have cor­re­sponded to a prim­i­tive re­cur­sive func­tion, and out­puts a lambda term which need not cor­re­spond to a prim­i­tive re­cur­sive func­tion. That is be­cause it al­lows us to perform an un­bounded search: it ex­presses fully-gen­eral loop­ing strate­gies. (As dis­cussed ear­lier, with­out a fixed-point fin­der of some kind, we can only cre­ate lambda-terms cor­re­spond­ing to pro­grams which prov­ably halt.)

On the other side of the coin, \(Y\) may pro­duce non-well-typed terms, while terms with­out a fixed-point fin­der are guaran­teed to be well-typed. That means it’s hard or even im­pos­si­ble to perform type-check­ing (one of the most ba­sic forms of static anal­y­sis) on terms which in­voke \(Y\).

Strongly re­lat­edly, in a rather spe­cific sense through the Curry-Howard cor­re­spon­dence, the use of \(Y\) cor­re­sponds to ac­cept­ing non-con­struc­tive­ness in one’s proofs. Without a fixed-point fin­der, we can always build a con­struc­tive proof cor­re­spond­ing to a given lambda-term; but with \(Y\) in­volved, we can no longer keep things con­struc­tive.


  • Mathematics

    Math­e­mat­ics is the study of num­bers and other ideal ob­jects that can be de­scribed by ax­ioms.