Mike Shulman just wrote a very nice blog post on what is a formal proof. I much agree with what he says, but I would like to offer my own perspective. I started writing it as a comment to Mike’s post and then realized that it is too long, and that I would like to have it recorded independently as well. Please read Mike’s blog post first.

Just as Mike, I am discussing here formal proofs from the point of view of proof assistants, i.e., what criteria need to be satisfied by the things we call “formal proofs” for them to serve their intended purpose, which is: to convince machines (and indirectly humans) of mathematical truths. Just as Mike, I shall call a (formal) proof a *complete* derivation tree in a formal system, such as type theory or first-order logic.

What Mikes calls an *argument* I would prefer to call a *proof representation*. This can be any kind of concrete representation of the actual formal proof. The representation may be very indirect and might require a lot of effort to reconstruct the original proof. Unless we deal with an extremely simple formal system, there is always the possibility to have *invalid representations*, i.e., data of the correct datatype which however does not represent a proof.

I am guaranteed to reinvent the wheel here, at least partially, since many people before me thought of the problem, but here I go anyway. Here are (some) criteria that formal proofs should satisfy:

**Reproducibility:**it should be possible to replicate and communicate proofs. If I have a proof it ought to be possible for me to send you a copy of the proof.**Objectivity:**all copies of the same proof should represent the same piece of information, and there should be no question what is being represented.**Verifiability:**it should be possible to recognize the fact that something is a proof.

There is another plausible requirement:

**Falsifiability:**it should be possible to recognize the fact that something is*not*a proof.

Unlike the other three requirements, I find falsifiability questionable. I have received too many messages from amateur mathematicians who could not be convinced that their proofs were wrong. Also, mathematics is a cooperative activity in which mistakes (both honest and dishonest) are easily dealt with – once we expand the resources allocated to verifying a proof we simply give up. An adversarial situation, such as proof carrying code, is a different story with a different set of requirements.

The requirements impose conditions on how formal proofs in a proof assistant might be designed. Reproducibility dictates that proofs should be easily accessible and communicable. That is, they should be pieces of digital information that are commonly handled by computers. They should not be prohibitively large, of if they are, they need to be suitably compressed, lest storage and communication become unfeasible. Objectivity is almost automatic in the era of crisp digital data. We will worry about Planck-scale proof objects later. Verifiability can be ensured by developing and implementing algorithms that recognize correct representations of proofs.

This post grew out of a comment that I wanted to make about a particular statement in Mike’s post. He says:

“… for a proof assistant to honestly call itself an

implementationof that formal system, it ought to include, somewhere in its internals, some data structure that represents those proofs reasonably faithfully.”

This requirement is too stringent. I think Mike is shooting for some combination of reproducibility and verifiability, but explicit storage of proofs in raw form is only one way to achieve them. What we need instead is *efficient communication* and *verification *of (communicated) proofs. These can both be achieved without storage of proofs in explicit form.

Proofs may be stored and communicated in implicit form, and proof assistants such as Coq and Agda do this. Do not be fooled into thinking that Coq gives you the “proof terms”, or that Agda aficionados type down actual complete proofs. Those are not the derivation trees, because they are missing large subtrees of equality reasoning. Complete proofs are too big to be communicated or stored in memory (or some day they will be), and little or nothing is gained by storing them or re-verifying their complete forms. Instead, it is better to devise compact representations of proofs which get *elaborated* or *evaluated* into actual proofs on the fly. Mike comments on this and explains that Coq and Agda both involve a large amount of elaboration, but let me point out that even the elaborated stuff is still only a shadow of the actual derivation tree. The data that gets stored in the Coq .vo file is really a bunch of instructions for the proof checker to easily reconstruct the proof using a specific algorithm. The *actual* derivation tree is implicit in the execution trace of the proof checker, stored in the space-time continuum and inaccessible with pre-Star Trek technology. It does not matter that we cannot get to it, because the whole process is replicable. If we feel like going through the derivation tree again, we can just run the proof assistant again.

I am aware of the fact that people strongly advocate some points which I am arguing against, two of which might be:

- Proofs assistants must provide proofs that can be independently checked.
- Proof checking must be
*decidable*, not just*semi-decidable.*

As far as I can tell, nobody actually subscribes to these in practice. (Now that the angry Haskell mob has subsided, I feel like I can take a hit from an angry proof assistant mob, which the following three paragraphs are intended to attract. What I *really* want the angry mob to think about deeply is how their professed beliefs match up with their practice.)

First, nobody downloads compiled .vo files that contain the proof certificates, we all download other people’s original .v files and compile them ourselves. So the .vo files and proof certificates are a double illusion: they do not contain actual proofs but half-digested stuff that may still require a lot of work to verify, and nobody uses them to communicate or verify proofs anyhow. They are just an optimization technique for faster loading of libraries. The *real* representations of proofs are in the .v files, and those can only be *semi-*checked for correctness.

Second, in practice it is irrelevant whether checking a proof is decidable because the elaboration phase and the various proof search techniques are possibly non-terminating anyhow. If there are a couple of possibly non-terminating layers on top of the trusted kernel, we might as well let the kernel be possibly non-terminating, too, and instead squeeze some extra expressivity and efficiency from it.

Third, and still staying with decidability of proof checking, what actually *is* annoying are uncontrollable or unidentifiable sources of inefficiency. Have you ever danced a little dance around Coq or Agda to cajole its *terminating* normalization procedure into finishing before getting run over by Andromeda? Bow to the gods of decidable proof checking.

It is far more important that *cooperating* parties be able to communicate and verify proofs efficiently, than it is to be able to tell whether an *adversary* is wasting our time. Therefore, proofs should be, and in practice are communicated in the most flexible manner possible, as programs. LCF-style proof assistants embrace this idea, while others move slowly towards it by giving the user ever greater control over the internal mechanisms of the proof assistant (for instance, witness Coq’s recent developments such as partial user control over the universe mechanism, or Agda’s rewriting hackery). In an adversarial situations, such as proof carrying code, the design requirements for formalized proofs are completely different from the situation we are considering.

We do not expect humans to memorize every proof of every mathematical statement they ever use, nor do we imagine that knowledge of a mathematical fact is the same thing as the proof of it. Humans actually memorize “proof ideas” which allow them to replicate the proofs whenever they need to. Proof assistants operate in much the same way, for good reasons.

Thanks for the brilliant and useful summary! Perhaps the chokehold of decidable checking is loosening at last.

How does the adversarial situation of PCC change anything? What does decidability of proof checking protect the PCC consumer from?

In an adverserial situation you have to worry about denial-of-service attacks, so it is likely you’ll restrict the form of the proofs that you accept, i.e., they wouldn’t be arbitrary programs but rather something more restrictive, for instance you could require an explicitly given low time complexity. Consequently, the other party will have to do more work to find suitable proofs (and that’s not something you want to do to a

partner).Technically, denial-of-service attacks are very easy to protect against in this situation. Just give the PCC consumer a “cancel” button to interrupt the checker and reject the PCC.

Of course, the thoughtful PCC provider will provide proofs that check quickly. It’s really not much different from loading a web page. They should load fast. If they don’t, it’s the provider’s problem because the consumers will cancel. If it weren’t for security vulnerabilities (which a PCC client shouldn’t have) there would be no danger in attempting to load an arbitrary page.

So I figure proof checking should be properly sandboxed, but checking performance is not a security concern.

Thanks for this thoughtful write up. The preference for decidable over semi-decidable algorithms in the absence of any feasible bound on complexity always confused me a little.

I also enjoyed Mike’s post very much, though his conclusion in favor of decidable type checking seems like a philosophical technicality that hinges on the exact status of the various transformation stages.

“Now that the angry Haskell mob has subsided..”

That was fun. Although, the angriest person there was decidedly not in the Haskell camp.

> I also enjoyed Mike’s post very much, though his conclusion in favor of decidable type checking seems like a philosophical technicality that hinges on the exact status of the various transformation stages.

After a discussion, Mike seems to have changed his position:

https://golem.ph.utexas.edu/category/2016/08/what_is_a_formal_proof.html#c050923

Yes, but you should really link to the start of that comment thread:

https://golem.ph.utexas.edu/category/2016/08/what_is_a_formal_proof.html#c050919

When I have a moment, I’m going to add a note+link to the bottom of the main post for the benefit of future readers.

I am missing a distinction between

Proof assistants are great tools in

finding(or helping to find) formal proofs, and where would we be without them? The task that they do is by nature not computable/decidable. To “save” the proof, it makes sense to recordenough of the hints that were given to the system so that the program can recreate the proof next time (reproducibility). But these saved hints and intermediate steps ARE NOT the formal proof in the strict sense, they are only an implicit representation of it. The formal proof consists of the sequence of elementary steps in the underlying formal system. (Apparently it is called a derivation tree in Coq.) This is the formal proof. *In principle*, it can be printed out, and correctness can be checked trivially “by pencil and paper”. Most formal systems have simple deduction rules that fit on one page, so there is no issue of decidability of this checking task, a checking that is of course rather delegated to a simple computer program (an “independent checker/verifier”).

It is true that, in practice, no-one would want to look at formal proofs, but that is another matter. And why shouldn’t a computer examine a formal proof for bugs? Andrey you write that “Complete proofs are too big to be communicated”. So on the one hand you share the notion of a “complete proof”. On the other hand, I am questioning whether the proofs are indeed so big, and computer memory and storage is indeed so limited these days.

I have only (very limited) experience with Isabelle and HOL-type systems. Still I would strongly advocate the principle that formal proofs should not just be decidable, but trivially decidable in linear time. I am not comfortable with having to trust some single software for the correctness of my proofs. This is especially acute since the very point of having formally verified proofs is to avoid the errors that I might otherwise make. (Mike Shulman in his comment https://golem.ph.utexas.edu/category/2016/08/what_is_a_formal_proof.html#c050826 does mention bugs but he downplays their role.)

[As a side remark, let me recount an incident, which is only weakly related: One of my students proved a nonlinear inequality by typing it into Mathematica, and it returned “True”. I forgot which Mathematica function he used. I took me some effort to convince him that this was not an appropriate “proof” for inclusion into his thesis.]

Every formal proof system should (in the long run) be able to output a simple

“proof object” that can be independently verified.

(I was actually quite disappointed when I first found out that few systems can do it.) Maybe it works only for small proofs. “Modularizing” large proofs to make them manageable would thus be an interesting research project (much as we structure our hand-written proofs into lemmas). Andrey writes about formal proofs that “little or nothing is gained by storing them or re-verifying their complete forms.” Yes, we gain (A) confidence: Let the (complicated) system that produced the proof be buggy; as long as the verifier accepts the proof, this particular proof is fine. (I am preaching the benefits of “Certified Computing”.)

(B) As a side benefit, we might gain interoperability between different proof systems.

The OpenTheory project is a step in this direction. Among other things, it includes a file format for HOL-type proofs, and anybody can write an independent checking program for its proofs in a few days.

A small point first: it’s Andrej (Slovenians are obsessed with J’s: all my parents, grand-parents and children’s names start with J.)

The LCF-style evaluator (which only implicitly construct the proofs) are

simplerthan kernels of proof assistants that rely on checking proof objects. For instance, computing the result of a $\beta$-reduction is precisely as complicated as verifying that someone else gave you the correct $\beta$-reduction. There is no complicated system behind an LCF-style proof assistant. The complicated systems exist it proof assistants like Coq and Lean, and for those it makes sense to have an independent proof checker (which is then still more complicated than the core of HOL or Andromeda). So I do not actually believe that there is anything to be gained by producing complete derivation trees. It does not increase confidence in anythign. Whatdoesincrease confidence is an independent implementation of the nucleus.For what it’s worth, let me emphasize my own (current) perspective that the purpose of producing derivation trees is not to increase confidence, but to

havederivation trees, which you then might want to do something else with. I’m more or less content with Andrej’s proposal to beta-reduce away the derivation trees in the middle and instead do the “something else” directly as we are checking the proof that would have produced the derivation tree, but the point is that there’s more going on than “increasing confidence”.This

issomething I appreciate. I don’t really relate to all the philosophical talk about confidence that was actually detached from reality. It poses a very interesting design problem.You don’t appreciate confidence? Then what’s the point of developing formally verified mathematics and formally verified computer programs? I thought it is about how we know that a theorem (or the result of a program) is correct. Or is it just an exercise in software development? (And is the flyspeck project detached from reality?)

The process of formally verified proof, of course, is about increasing confidence relative to pencil-and-paper proof (among other things). Andrej is saying (if I understand him correctly) that producing and checking proof

objectsdoes not increase confidence appreciablymorethan checking proofstepsdoes.Günter said:

> Andrey writes about formal proofs that “little or nothing is gained by storing them or re-verifying their complete forms.” Yes, we gain (A) confidence: Let the (complicated) system that produced the proof be buggy; as long as the verifier accepts the proof, this particular proof is fine.

In LCF style, the complicated system producing a proof is some jumble of user tactics and proof script, not the kernel (“nucleus”). If the (simple) kernel is correct, it will reject an invalid proof as it’s being produced. There’s no need to build a derivation tree to look at the elementary steps of a proof.

It’s possible that the kernel is buggy. But it’s just as likely that an independent derivation validator would be buggy. They have the same checks to perform.

Andrej said:

> So I do not actually believe that there is anything to be gained by producing complete derivation trees. It does not increase confidence in anythign. What

doesincrease confidence is an independent implementation of the nucleus.I think an independent checker of derivation trees would increase confidence a little. But an independent implementation of the trusted kernel

isa better way to boost confidence.In both cases, we have two programs giving a proof the OK, instead of one. In both cases, we reuse the proof script and the tactic library that it’s based on. Avoiding actual derivation trees avoids possible bugs in serializing and deserializing derivations. (And, of course, makes things significantly faster.)

I wrote my reply of 19:16 before seeing this. The flyspeck experience I mention seems to contradict what you say about being faster. And it makes sense: The more powerful and sophisticated proof assistants and tactics libraries become, the more time it takes them to (re-)construct the proof, compared to simply replaying basic commands from a file.

Another practical issue: For an independent implementation of the kernel, I would be bound to using, say, OCAML. And compile the thing together with the whole proof assistant and libraries.

There are two major types of work a proof script might do. One is replaying and checking proof steps. The other is

searchingfor the winning proof steps in the first place. Because tactic languages are good for implementing both of these, it’s tempting to just dump them together. This can lead to proof scripts which take longer to check than derivations. But it’s not because LCF-style proof checking is slower, it’s because you’re comparing proof checking to proof search! Of course proof search is slower.I believe a tactic system should support a “phase distinction” between “search-time” tactics and “check-time” tactics. Search time tactics will be implemented such that running them records equivalent check-time scripts, but that have additional witnesses and/or annotations to avoid redoing the search.

It’s these check-time scripts, which are free of calls to expensive, searchy tactics, that check more quickly than derivations. Why? Think of video playback. Playing uncompressed video should be the fastest, right? After all, decoding is trivial. Nope. Playing uncompressed video is infeasible. It hits an I/O bottleneck

hard. The framesmustbe computed on the fly; there’s no practical alternative.(Analogy warning: Unlike video compression, proof scripts do not lose any information you care about.)

With proofs, the I/O disparity is less big, but in exchange, the decoding of “compressed derivations” is easy. A check-time script checker and a derivation checker would be doing mostly the same work, aside from I/O, where the derivation checker has to work harder.

Okay, now I’m confused again. Matt, presumably the process that you’re calling “proof checking” is not decidable, otherwise you wouldn’t be making a big deal out of how we don’t need decidability. But I don’t see a qualitative difference between “non-decidable checking” and “searching”. Is your “proof checking” really just a kind of “proof search” whose input contains hints to make it faster?

FYI. I wrote a reply at 19:16 (should appear below) which is apparently still awaiting moderation.

> But I don’t see a qualitative difference between “non-decidable checking” and “searching”. Is your “proof checking” really just a kind of “proof search” whose input contains hints to make it faster?

Yeah. The distinction I’m drawing is a practical proof engineering matter.

Replying to Matt (and trying to make sense).

Between “recording everything” and replaying it dumbly (derivation checking) on the one hand, and saving only the user inputs and searching for the proof from scratch on the other hand, there is apparently a spectrum of choices, offering a trade-off between storage space and processing time. What you suggest, (“check-time” tactics), with more hints to avoid expensive proof searching, is an intermediate choice. Apparently the Coq .vo file with a series of “instructions” is another intermediate choice. Finding the right balance in terms of running time and storage is an engineering decision.

Coming back to the original question, “What is a formal proof?”: Such practical consideration are irrelevant for this question. The proof IS the full derivation, in my view. Coming up with a smart succinct representation from which the proof can be reconstructed does not change the concept of a proof.

A “proof script” is not a formal proof. It is perhaps analogous to a proof sketch in a classical hand-written proof, as opposed to the complete proof with all details, which is given in the full version of the paper.

Compressed video has completely different side constraints. A lot of engineering effort (and running time) goes into video compression in order to make decompression easy. It is not (yet) the case that large stored proofs would routinely be replayed by thousands of eager proof aficionados. (This is not quite true: proof libraries ARE loaded by many users.) I guess the gzip compression that is used in the flyspeck project is probably quite appropriate at this stage of development, as opposed to some cleverer problem-specific shortcuts in the file format.

“Computing the result of a ?-reduction is precisely as complicated as verifying that someone else gave you the correct ?-reduction.” Indeed, namely both are straightforward, at least if you have a computer. This is still on the diametrically opposite side of the spectrum from the undecidable or semidecidable concepts of formal proof that have been discussed. I am fine with a formal proof format that would contain steps of the form “Now perform ?-reduction for the term starting at position 12 in expression 34 and call the result expression 567”. Essentially a trace of the commands that are passed to the nucleus/kernel, a record of the elementary steps that lead from the assumptions to the theorem. Why shouldn’t Coq or Isabelle output such a proof, and call it THE FORMAL PROOF, for everybody to see and scrutinize? Of course, programming capacity, other features more urgent, etc., but in the long run?

I see that the flyspeck project of the Kepler conjecture has apparently produced such run-down version of the proof that can be fed to the core, see http://www.proof-technologies.com/flyspeck/

It looks something like 100 MBytes compressed, but it is reported that it runs a lot faster than the original proof scripts. These people were really concerned about confidence! But nothing like “too big to be communicated”.

On that website I just learned that what I am advocating has a name: “De Bruijn criterion”.

This is a side issue, but here are some empirical data, from Mark Adams, Proof Auditing Formalized Mathematics, Journal of Formalized Reasoning 9, No.1 (2016), 3-32, Section 5.1.2 on p.24, about the “main text” formalization of the flyspeck project:

Running the scripts in HOL-Light with the export facility turned on took 7 hours (about twice the time as without export facility). On disk, it occupied 159 MByte compressed .tgz files.

Reimporting the files took just over 30 minutes in HOL-Light.

Reimporting the files took around 6 gours in HOL-Zero, a “stripped-down” HOL version that emphasizes correctness.

And actually it makes sense to me. The OpenHOL format is really very much like assembly language, human-readable in principle but with numeric addressing (similarly for OpenTheory). So there isn’t much “parsing” done when reading the files.

Thanks! Those are very useful figures. Do we also know if it is customary for people to download the compile files, or is the development cycle set up so that it encourages downloading of original source, which is then compiled on user’s machine?

> Coming back to the original question, “What is a formal proof?”: Such practical consideration are irrelevant for this question. The proof IS the full derivation, in my view.

Full derivations, as defined for formal metatheory, certainly have their place. But the topic of Mike and Andrej’s blog posts is the requirements for proof representations used in actual, implemented proof tools.

> there is apparently a spectrum of choices, offering a trade-off between storage space and processing time.

> I guess the gzip compression that is used in the flyspeck project is probably quite appropriate at this stage of development, as opposed to some cleverer problem-specific shortcuts in the file format.

I was wrong. It’s not actually raw I/O that I expect to slow down a derivation checker. It’s lexing, parsing, and well-formedness checking of the derivation format. I was sort of equating this with I/O overhead, since you have it when the data comes from outside the trusted kernel, but it’s not really I/O.

The point of check-time scripts is not to save disk space, and it’s not really to reduce raw I/O throughput either. It’s to reduce the amount of input that needs to be checked. Gzip and other general-purpose compression techniques don’t help with this because the uncompressed form still needs to get checked.

So don’t take the compression analogy too literally. Compression is actually pretty orthogonal to proof performance.

I still think check-time scripts would be faster, but it doesn’t really matter all that much. People will do something fast enough to stay sane, and we should be worrying more about usability improvements than performance improvements.

“But the topic of Mike and Andrej’s blog posts is the requirements for proof representations used in actual, implemented proof tools.”

My impression is that, currently, proof representations used in actual, implemented proof tools (and also published formal proof representations) are predominantly representations from which some particular piece of software was able to reconstruct a full derivation at some point (in finite but otherwise unspecified time). And Andrej, when arguing that this (re-)construction work should not be required to be decidable, seems to be fine with this sorry state of affairs, as I would call it.

I don’t want to insult the people who are doing great work in proof assistants, as an outsider, but the discussion is about requirements and goals for the future.

Again I am stressing the distinction between the proof script by which a proof was first FOUND, and the “representation” by which it is STORED and COMMUNICATED once it is found, in order to be REPLICATED. (Which of the two representations is the blog post about?) For the latter representation, I would impose not just decidability but rather stringent runtime bounds.

I have the feeling that by accepting undecidable argument steps, formal proofs would fall behind even classical pencil-and-paper proofs. Here is the analogy: If you think about proving a theorem, you have no clue about how and whether you will succeed, but once you have succeeded you WRITE DOWN THE PROOF in such a way the other people can follow it without having to face the same undecidable problem that you started with.

> I don’t want to insult the people who are doing great work in proof assistants, as an outsider, but the discussion is about requirements and goals for the future.

OK

> Andrej, when arguing that this (re-)construction work should not be required to be decidable, seems to be fine with this sorry state of affairs, as I would call it.

But what is so sorry about the state of affairs? Why should the use of proof representations with asymptotically-efficient worst-case checking be a requirement (or even a desirable feature) for future proof systems?

> Again I am stressing the distinction between the proof script by which a proof was first FOUND, and the “representation” by which it is STORED and COMMUNICATED once it is found, in order to be REPLICATED.

Good. This is an important distinction. I proposed search-time scripts for finding proofs, and check-time scripts for communicating them to “end-users” (users who will not modify the mathematical development). Realistically, proof writers will store both, since the check-time scripts are not as robust to changes in the development. This is analogous to source code vs object code. In Coq, .v and .vo files fill these respective roles. (.vo files are

comicallynon-robust to changes.) .vo file checking happens to be decidable, I think. But this is not important, I argue. As I understand it, our disagreement is only about requirements for check-time proof representation.> (Which of the two representations is the blog post about?)

I don’t know, but I’d guess the check-time one. It’s kind of taken for granted that search-time will allow tactics.

> For the latter representation, I would impose not just decidability but rather stringent runtime bounds.

Again, why?

> once you have succeeded you WRITE DOWN THE PROOF in such a way the other people can follow it without having to face the same undecidable problem that you started with.

Your intuition, that checking a proof someone else found should be easier than finding it all over again, is exactly right. Your mistake, it seems to me, is to assume that decidability, or for that matter, any of recursion theory or complexity theory, has anything to say about this matter.

These theories are about rigorously analyzing the “hardness” of

particular, well-specified problems. An input from a certain set ofspecificdata representations is presented to some algorithm, which must handle it correctly. Complexity theory talks about how efficiently algorithms can do this.But here’s the thing: What we are arguing about is

what the input format should be in the first place! Recursion and complexity theory provide no guidance about this. They get to work once a “sensible” input representation is chosen. So while it’s formally meaningful to talk about the complexity-theoretic consequences of an input representation, it doesn’t do what you want. Quite the opposite, in fact. To make a problem “look easy”, you can choose a very bad input representation, so that a bad algorithm is not so bad in comparison. Look at this comment of mine for an example, where I explain a completely-useless, decidable proof representation for a recursively-enumerable formal system. This basic trick of tagging with a derivation size works for any RE system. An even more misguided trick gets you a completely-useless representation with worst-case linear time checking. (Just put in a long string whose length is proportional to the number of steps that brute force search takes.) Without common sense, making proof checking “more efficient” means making the input format less efficient.To be clear, this does

notargue against any specific asymptotically-efficiently-checkable proof representation thatyouhave in mind. But it does point out that the practical efficiency of a proof representation is not something that complexity theory tells you anything about. You may have a wonderful proof representation in mind. I guarantee you its wonderfulness does not come from the asymptotic complexity of the proof checker.Here is another way to think about this point: Consider the “problem” of obtaining the result of an arbitrary program. Seems easy right? Just run the program. But according to recursion theory, this problem may appear unsolvable if you look at it the wrong way. The halting problem is undecidable, so if you were expecting to be told when a program doesn’t have a result, you’re out of luck. This does not in any way prevent you from getting the result of a program that has one. The mistake was to consider a program as a problem to solve, rather than a solution to some other problem.

What about total languages? Here, all programs have a result. What is the asymptotic complexity of getting the result of an arbitrary Coq program? Some insane function I don’t know how to write down, simply because some Coq programs are absurdly slow. This does not in any way prevent you from getting the result of a Coq program that is not absurdly slow.

So now let’s look again at proof checking. A proof, like a program, is a solution, not a problem. Should it be hard to check an arbitrary proof? Why not?! You don’t want an “efficient language”, you want a language with efficient programs. Proof checking is program running. (That is the foundational insight of LCF-style systems.) Assuming the proof writer actually wants to convince people, they will write reasonably fast proofs. It doesn’t matter if the language has pathological, absurdly slow proofs, or even non-terminating proofs that cannot generally be proven wrong (but are never accepted as valid), because no one will use them. They are not convincing. They are not good solutions. Considering this reality, language-wide restrictions can only get in the way.

So efficient checking of arbitrary proofs is not a requirement. The requirement is that, to the greatest extent possible, useful proofs can be represented so as to be checked efficiently. Worst-case efficiency of the proof checker is as irrelevant as worst-case efficiency of a program runner. Good proofs, like good programs, should be efficient. So proof languages should be like programming languages, and

enableefficiency, notguaranteeit. LCF-style systems reach the logical conclusion and use a programming language for the proof language.Matt, I agree very much with your arguments that decidability per se is useless in this context. Thanks for supporting my point so eloquently.

But I have the impression that by noting that decidability is not enough, people on this discussion jump to the conclusion that any precisely defined efficiency criterion must be avoided.

When I said that stringent run-time bounds should be imposed, I meant actually that reconstructing a proof from the script should not be an algorithmic issue at all (besides “trivial” book-keeping tasks perhaps).

Of course, I did not mean that “stringent run-time bounds” should be the only criterion and common sense should be abandoned. Artificially padding the proof format would not work because it would just make the proofs longer with no gain in overall running time.

The run-time bounds are a matter of available computational power and therefore it makes no sense prescribe them. We’re talking about

cooperationbetween parties, so again it makes no sense to prescribe run-time bounds. What does make sense is that we must give the communicating parties the freedom to express proof scripts in the most efficient way they can. If we start limiting the language they are supposed to use in the name of efficiency, run-time guarnatees, philosophy, or whatever, we will limit the expressive power. There is no reason to do that. I think we agree on this point?Your requirement that “reconstructing a proof from the script should not be an algorithmic issue at all (besides “trivial” book-keeping tasks perhaps)” is precisely what I am advocating. In combination with the previous paragraph, the result it: a proof checker should simply be an interpreter or a compiler for a specialized programming language. Executing a programming language

is“just book-keeping” and it gives complete control over algorithmic issues to the author of the source code (assuming it is a sanely implemented language).> But I have the impression that by noting that decidability is not enough, people on this discussion jump to the conclusion that any precisely defined efficiency criterion must be avoided.

No, I just pointed out that:

1) Efficiency criteria from complexity theory don’t work.

2) By the nature of the situation, there is no need for efficiency criteria.

You can have criteria, I just think it’s for nothing.

> When I said that stringent run-time bounds should be imposed, I meant actually that reconstructing a proof from the script should not be an algorithmic issue at all (besides “trivial” book-keeping tasks perhaps).

So it seems that you don’t have any precisely defined efficiency criterion. You have to just know a good proof object format when you see it. And apparently proof scripts ain’t one of ’em.

I think I might be able to predict your judgment though. I’ve replied to your question about good systems for independent checking. The reply currently awaits moderation.

This is an important and subtle point that I was stuck on for a while too. I agree entirely that in order for a “proof” to count as a “proof”, we have to be able to check that it is a proof without needing any of the intuitive leaps that might have gone into

findingthe proof. Otherwise, there is no qualitative difference between Pierre de Fermat and Andrew Wiles.But this doesn’t require that there is a particular notion of “putative proof” that can be decidably checked. What it requires is that

particularproofs can bealgorithmically(and, one hopes, feasibly) checked. Having an algorithm that I can feed Wiles’ proof of FLT to (or any particular proof that someone has written) and have it tell me “yes this is a proof” is quite a different thing from asking that the algorithm can be fedanystring of gibberish and have it answer either “yes this is a proof” or “no this is not a proof”. In other words, the requirement that “proofs be proofs” is only about thecorrectproofs, whereas giving up decidability is only about what happens with theincorrectproofs.People have devised formats for “stored” proofs (compiled .vo files and compiled Agda files etc.) but nobody ever uses them to communicate proofs. They are used as an optimization technique for faster loading. Everybody communicates

the original scriptswhich also search for proofs. Everyone, except in the case that there was an extraordinary amount of search involved, in which case people take additional care to store the results of the search. Flyspeck has a huge proof, so there they also optimized somewhat. But in the vast majority of cases people just communicate the original scripts, with proof search an all.I am advocating that

we look at what is actually going on. I think it is too easy to be blided by philosophy and logic. It is too easy to assume that the proof scripts which search for a proof are inefficient. They are not for a simple reason: their authors could not work with inefficient proof scripts because they had to rerun them many times, and modify them slightly, etc., before they took a final form. So by design they must be modular, efficient, and manageable. In contrast, the compiled versions of the same scripts are unmodifiable, impossible to inspect by humans, and yes, they are more efficient, but that does not matter so much.So, as far as I am concerned, the original proof scripts written by humans are as good a certificate as the “stored” communication. At least practice shows that is the norm.

About Flyspeck:

Efficiency was only one of the reasons why the flyspeck team exported the proof to an external format, and anyway, a running time of 3 hours is not such a big issue; as far as I understand, this export was only done at the end as part of an “audit”. The other reasons were (b) independent verification and (c) interoperability, making the intermediate results available to other users, see http://arxiv.org/abs/1501.02155, the end of section 4.

Answering another question: I don’t know if if it is customary for people to download exported files in HOL0 and OpenTheory. Last summer I went to a session on proof exchange formats when CADE was in Berlin, and there is a community. OpenTheory has a mailing list, and it generates traffic from time to time.

I think that independent verifiability is important, but the more pressing problem is interoperability. Do we have to formalize the whole body of basic mathematics five or ten times? Now that we have proved the Kepler conjecture in HOL-Light, why isn’t it true in Coq? (Such would be question of a naive plain mathematician.) But once it is exported, the Coq people could grab it and see what they can do about it, and Mike can do “something else” with it, transforming it beyond recognition.

I am convinced that we have to formalise basic mathemaics five or ten times, and moreover this is not a bad idea, nor is it waste of time. We are at a very early stage of computer-assisted mathematics and we still have a lot to discover. People should be and are experimenting. By redoing other people’s work they learn and they improve state-of-the-art (or not). One part of such experimentation is redoing things others have done. Another is looking for interoperability, of course.

But there will never be just one way of doing mathematics (that would be a disaster), nor will there be a sort of universal interoperability that some mathematicians imagine. For that to happen humanity would have to be an ant hill, or Borg.

Again we are mixing different concepts, inspection by humans, cooperation between parties, you want to modify or extend a proof. Proof scripts have their role, there is not doubt about that.

Independently, I am arguing that proof scripts are not “Proof objects”. Proof objects also have their role, and “Proofs assistants must provide proofs that can be independently checked.” It is those proof objects about which I am talking.

(Has the QED manifesto been abandoned?) And of course it does not make sense if this checking falls back on the search strategies by which the proof was constructed. Don’t take my running time prescriptions literally. I meant that the specification of the proof format is so simple that checking it is not a big issue; (the “interpreted specialized language” comparison is good, but there should not be a simple command that triggers a complicated computation. And no compiled universal language.)

“we look at what is actually going on”: I though this blog post was about criteria that formal proofs *should* satisfy, rather than *describing* the state of affairs? ;-) Anyway, this seems to be the main argument against requiring proof objects that are independently checked, that it does not conform to common practice. But people rarely have a choice these days. OpenHOL export for HOL-Light is proprietory. OpenTheory support seems to be a little better. (I don’t know much about Agda, and the .vo format; some issues with .vo have been mentioned in other comments.)

If I would like to use a proof system whose proofs I can independently check, which systems would you recommend? So much for mocking the angry mob about how their professed beliefs match up with their practice. (Not that I would count myself among the proof assistant mob and be offended.)

Two more items about “what is actually going on”:

a) Isabelle has the option to call external proof strategies over the internet. (Some of these are even proprietory.) When the proof succeeds, Isabelle tries to see which assumptions are used and reconstruct the proof internally with fewer assumptions, which sometimes helps. Sometimes it fails. Wouldn’t it be nice if Isabelle could examine the “proof object” and import it?

b) I hear that in those competitions that are held in connection with the ATP conferences, it is required that the participating algorithms provide some evidence of the proof. (Not that there would be a standard format for that). This is the reason why Isabelle never gets the prize, although it usually performs quite well.

> Independently, I am arguing that proof scripts are not “Proof objects”…

Suppose it was formally proven, once and for all, by a machine-checked software verification held to any standard of rigor of your choice, that a certain LCF-style kernel for a proof checker never accepts a statement as a theorem unless it’s formally provable (in the logic implemented by the proof checker). Would you still want that kernel to output the kind of dirt-simple proof object you’re talking about? Would you conclude, as I would, that theorems accepted by this system are as rock-solid as we can hope for? Keep in mind that you can ask this system to rerun the proof as many times as you want, on as many machines as you want, in order to shrink the possibility of a false theorem due to chance hardware failures.

(So that was two different questions just now.)

> “we look at what is actually going on”: I though this blog post was about criteria that formal proofs *should* satisfy, rather than *describing* the state of affairs? ;-) Anyway, this seems to be the main argument against requiring proof objects that are independently checked, that it does not conform to common practice.

That may or may not be Andrej’s argument; it’s definitely not mine. I currently consider a formally verified LCF-style system with a search/check-time phase distinction to be ideal. That’s what it *should* be, and it’s *not* the way it is now.

> If I would like to use a proof system whose proofs I can independently check, which systems would you recommend?

You should take a look at Dedukti. You configure it with a logic to check by giving it a signature in a fixed logical framework, and it checks proof terms for that logic. It seems there is/was a serious effort to extend various proof assistants to generate Dedukti-checkable proof terms.

Dedukti’s proof terms may not meet your standard for proof objects though: traces of reductions are not recorded explicitly, so checking a small proof term can involve an arbitrarily-large amount of computation, depending on the signature.

Plain old Edinburgh LF (another logical framework) terms are a good proof object format, because the reduction rules are fixed, and computationally very limited. Object language reductions (if there are any) have to be recorded explicitly. But I don’t know of much work to have proof assistants generate LF proof terms. There

wasan extremely minimal checker for LF, implemented with < 1000 lines of C. (And half of that was parser!)http://www.cs.princeton.edu/~appel/papers/flit.pdf

> Two more items about “what is actually going on”…

a) It’d be even nicer if the external proof tool would give Isabelle an efficient tactic script. That is probably asking too much though, since Isabelle is not the only proof assistant in the world. If proof terms win over proof scripts, it will probably be because they would turn out to be easier to standardize.

b) This is reasonable, because automated provers will probably be used as external tools by proof assistants. Their purpose is not to check a proof, but to produce a proof to check. (This is also the purpose of the search-time phase, in my proposal. Indeed, external tools should only be invoked at search time, and their witnesses used for check time.)

Thanks for the references; I had heard of Dedukti, I will check it out more.

“Suppose it was formally proven, once and for all, by a machine-checked software verification held to any standard of rigor of your choice, that a certain LCF-style kernel for a proof checker never accepts a statement as a theorem unless it’s formally provable (in the logic implemented by the proof checker). Would you still want that kernel to output the kind of dirt-simple proof object you’re talking about?”

Yes, for two reasons. (not by default of course but as an option).

a) Instead of trusting the LCF-style kernel, I would have to trust the software verification tool that has formally checked the LCF-style kernel. This does not solve the problem, it only pushes it somewhere else (and not even to a different level). Eventually the trust must come from my own intuitive understanding. That’s the reason for striving for a small “trusted code base”.

b) Even if I have inspected the kernel and am satisfied about its correctness,

the environment in which the kernel runs might foul up things. Some examples are mentioned in the article “Pollack-inconsistency” by Freek Wiedijk, doi:10.1016/j.entcs.2012.06.008. Other examples have to do with dirty tricks that are possible in the OCAML programming language. (I forgot where I saw this mentioned.)

The QED manifesto puts it nicely: “In any case, it is a highly desirable goal that a checker for the root logic can be easily written in common programming languages. The specification should be so unambiguous that many can easily implement it from its specification in a few pages of code, with total comprehension by a single person.”

> Instead of trusting the LCF-style kernel, I would have to trust the software verification tool that has formally checked the LCF-style kernel.

But that’s not true. The system that verifies the proof checker may have flaws that cause it to accept invalid proofs, but the verification of the proof checker could still be valid. At worst, you need to trust one verification of the proof checker. You might trust it even without machine checking! The fact that it’s been machine checked can only make it more trustworthy, not less. But you don’t even need to trust

thatverification, because you can demand multiple independent verification efforts, all possibly machine checked with different tools. This is what I meant by “held to any standard of rigor of your choice”.We’re talking about bootstrapping trust. You estimate the probability that a flaw in the proof checker exists (but has thus far escaped detection). The more verifications you do, the smaller you should estimate that probability to be.

> Eventually the trust must come from my own intuitive understanding.

Intuition should be a part of it, but so should science. Making a simpler checker does not decrease the likelihood of a flaw as much as a formal verification does. (Rule of thumb.) When Flit was made, formally verifying a proof checker was still beyond the state of the art.

> Even if I have inspected the kernel and am satisfied about its correctness, the environment in which the kernel runs might foul up things.

So demand that the environment be part of the system that gets verified. You can verify a proof checker and its runtime environment down to machine code. This has been done. (Well, they’ve not run these systems right over the metal, AFAIK, so in a sense we’re not there yet, but we have everything we need.)

> The QED manifesto puts it nicely…

You can use a totally-comprehended proof checker like that to check a verification of the “real” proof checker. But I expect that day-to-day use of such a simple checker would be too painful. Moreover, once the real proof checker has been verified to your satisfaction, you’ve bootstrapped trust, so I don’t see why it would be necessary to recheck proofs with such a simple checker. The real proof checker is by now

moretrustworthy than the simple bootstrap checkers.This discussion is similar to the one I had with Mike. Yes, proof scripts are not proof objects, but they are a good way of creating, managing, and communicating proof objects. The proof objects need not exist in computer memory because nobody needs them in that form. Nobody ever produces them in that form (not Dedukti, not Coq, not Agda, not Lean – they all make shortcuts and omit large chunks of proof objects). What we need to know is that the proof objects are there, somewhere. and that we can trust our proof assistant that the object

couldbe had in explicit form.Mike Shulman proposed a good criterion. Rather than prescribing a particular representation (“I must have a proof object in memory!”) he wants the universal property of proof objects, namely, he wants to be able to perfrom inductive constructions on them. There are many ways of providing such functionalty, only one of which is to “give proof objects”.

OpenTheory is an example of what I mean by a proof object, and there is a large codebase, apparently produced by Nobody Ever. (I once wrote a rudimentary checker for it, in order to understand what it is about.) There is no way to make shortcuts. (Common HOL seems to be similar in this respect, as far as I see. I am surprised by what you say about Dedukti. I thought it is purely a proof CHECKER and produces no output at all. So how would it make shortcuts?)

I want a proof object not only in memory, I want it on file! What’s the big deal about delivering it, instead of only making promises (“trust me that the object COULD be had in explicit form.”) But I am happy that you agree that it is a good thing to provide SOME sort of access to the proof objects as they are implicitly built, which otherwise would only have an ephemeral existence. Maybe theorem prover software could provide some hooks where users can insert their callback routines, for performing inductive constructions or whatever they like? Is that what you mean?

I realize that what I said about “stringent runtime bounds” and even “linear-time proof checking” is nonsense. It was a poorly thought-out metaphor for having a proof object that consists of a list of deduction steps, following a simple specification of deduction rules. For example: when forming a term that applies a function to an argument, or when performing a substitution, this is a single command in OpenTheory, but of course it requires some type checking or processing that may be linear in the size of the involved expressions if implemented naively. I believe one can construct OpenTheory files that would take even exponential time to check, in terms of their length (but never longer than the kernel took to run originally when the proof was produced).

> I realize that what I said about “stringent runtime bounds” and even “linear-time proof checking” is nonsense. … I believe one can construct OpenTheory files that would take even exponential time to check, in terms of their length (but never longer than the kernel took to run originally when the proof was produced).

OK. I haven’t looked at OpenTheory or Common HOL, so I have no idea if you’re right. Not that I care what its asymptotic runtime is, of course. :)

> OpenTheory is an example of what I mean by a proof object, and there is a large codebase, apparently produced by Nobody Ever. There is no way to make shortcuts. (Common HOL seems to be similar in this respect, as far as I see. I am surprised by what you say about Dedukti. I thought it is purely a proof CHECKER and produces no output at all. So how would it make shortcuts?)

I assume you’re talking about someting I said about Dedukti, but I don’t know what. What do you mean by shortcuts?

I can give an example of how Dedukti checking is computationally rich though: For a signature for a Turing-complete language with dependent types, Dedukti proof checking can loop, trying to normalize a term without a normal form. This is because, as far as I understand, Dedukti will attempt to do type checking modulo any congruent reduction relation you give it. (That is, it uses the congruence closure of the reduction rules you give it. So for untyped lambda calculus, you’d only give it the beta rule.)

For the people who may be curious how this is different from ETT: with Dedukti, the reduction rules are always directed, and are part of the signature (the logic definition). It’s impossible to add new reduction rules as a user of the logic, or assume reduction rules, and it’s impossible to add equations respected by typing, but that should not be implemented as reductions.

The whole paragraph with “shortcuts” refers to Andrej’s post from August 25, 2016 at 15:09, immediately above mine.

Ah OK. So the way Dedukti uses shortcuts is by omitting reduction sequences. The input Dedukti proof can use some term M instead of the expected N, for example, and Dedukti will try to find a common reduct for them, to prove that they’re equal. These steps are performed by Dedukti on the fly; they’re not stored in the input. But the input is only valid if Dedukti finds a common reduct.

At least I figured that’s how it works; if reductions steps needed to be explicit, what was the point of supporting reduction rules in the signature?

When browsing around about the Dedukti system, I found the following info about the mysterious .vo-Files of Coq:

“A Coq compiled file contains a copy of the Coq kernel’s in-memory representation of a module marshalled to disk. Since the marshalling facility is provided by the

OCaml compiler, the format of the content in a .vo file is dependent both on the version of the compiler that compiled Coq, and also the version of Coq itself.”

See http://www.ensiie.fr/~guillaume.burel/download/boespflug12coqine.pdf

about the CoqInE project, which translates .vo files into the format for the Dedukti proof checker, at least as far as the logic behind Dedukti allows it.

I draw from this two conclusions.

(a) Coq DOES maintain in memory a representation of a proof object. (This is what Mike originally wanted.)

(b) This proof object is in a form that it could be independently checked. (Dedukti does not have any proof-search facilities.)

I maintain that the object maintained by Coq is

notthe formal derivation. It is theconclusionof the formal derivation. So it is a bit misleading (but not a lot) to call it “proof object”.I am also surprised you consider the .vo representation, in memory or in a file, to be a proof object representation. Can you explain why .vo’s are proof objects and tactic scripts are not?

Coqine produces from the .vo file a Dedukti file of a proof that Dedukti can check. It follows that this file must be a proof object and not a proof script. (Dedukti certainly does not have Coq’ proof searching techniques built in, and such techniques can also not be described to Dedukti as part of the logic specification, I suppose.)

Now the question is: How much does Coqine do? I was under the assumption that Coqine reads the .vo file into memory (using the code from Coq) and then simply translates the proof steps into Dedukti format.

The other option (which I did not think about) is that the .vo file is really (some internal form of) a proof script, and Coqine has to search for the proofs just as Coq does, additionally recording and translating the deduction steps as they are passed through the kernel.

> Coqine produces from the .vo file a Dedukti file of a proof that Dedukti can check. It follows that this file must be a proof object and not a proof script. (Dedukti certainly does not have Coq’ proof searching techniques built in, and such techniques can also not be described to Dedukti as part of the logic specification, I suppose.)

But Coq type-checking involves normalization of Gallina terms. (Gallina is Coq’s term language.) Gallina is rich enough to express proof search algorithms, if you give a limit on search depth. So Coq’s type-checker can be convinced to do proof search, although it’s not a good idea. (But programming the type-checker to do more reasonable check-time things is a technique that is actually used in the Coq community. It’s a hackish form of proof by “computational reflection”.)

What seems to be the case (from your comments about CoqInE) is that CoqInE is indeed a more-or-less straightforward encoding of Coq’s proof terms into Dedukti. They are both able to bring practically any algorithm to bear on proof checking. This is why I was surprised you accepted these proof terms as a proof object representation.

It looks like you’re trying to put together an argument that .vo’s are a good proof object representation. Since I don’t understand what you consider to be a good proof object representation, I cannot dispute it. But I can point out that you seem to be assuming that .vo checking is computationally very limited, when in fact it’s not. In principle, Gallina’s richness means terms are already funny-looking proof scripts. In practice, Gallina is a crummy language for proof scripts. But that’s another discussion.

Here is a passage from Section 4 of the above-mentioned paper about Coqine:

“Indeed, although the translation is linear, some of them are too big. For instance, the Coq.Lists.List module, whose .vo file has length 260KB, is translated into a Dedukti file of length 1.2MB, which is in turn translated into a Haskell file of size 12MB.”

Note in particular that “the translation is linear”. This supports my assumption.

Maybe someone can figure out the length of the Coq.Lists.List source script. (I only find Coq.Lists.List.html, which contains lots of links and colors). Andrej what do you mean with storing the conclusion? Only the statements of the theorems (plus definitions etc.) without proofs? Does it make sense that storing the conclusions of Coq.Lists.List would take 260kB?

Furthermore, how should we trust a system that reads in conclusions and proceeds from there without checking them?

> Now the question is: How much does Coqine do? I was under the assumption that Coqine reads the .vo file into memory (using the code from Coq) and then simply translates the proof steps into Dedukti format.

> Here is a passage from Section 4… Note in particular that “the translation is linear”. This supports my assumption.

OK.

> The other option (which I did not think about) is that the .vo file is really (some internal form of) a proof script.

I’m pretty sure tactics are gone in .vo’s, if that’s what you’re thinking.

> Maybe someone can figure out the length of the Coq.Lists.List source script.

In the coq-8.5pl2 tarball, List.v is 62,044 bytes. That’s probably not quite the version the paper was using.

> Andrej what do you mean with storing the conclusion? Only the statements of the theorems (plus definitions etc.) without proofs?

I cannot directly answer your question to Andrej, but keep in mind that, with the context implicit, type theory has judgments of the form “

term:type“. These judgments are the conclusions of derivations. However, when proving a proposition in type theory, the technique is to represent the proposition as atype, and find some prooftermfor it. In the case of Coq’s type theory, the resulting (term:type) judgment is typically feasible to check without any additional information. So the conclusions of judgments are often quite sufficient as (proposition,proof) pairs. This also means they tend to be fairly large, due to the proof terms.> Furthermore, how should we trust a system that reads in conclusions and proceeds from there without checking them?

We should not. For Coq, (

term:type) judgments can and should be checked, using a type checker.> So the conclusions of judgments are often quite sufficient as (proposition,proof) pairs.

Oops, I meant “conclusions of derivations”.

Aha, I was really underestimating what Dedukti can do in terms of proof checking. The propositions-as-types proofs are really quite a different style of proof, and my focus on HOL-type systems made me narrow-minded.

Could one say that the (term : type) judgment is the proof object? It is what counts as a proof in the underlying theory. But, from the preceding discussion, I guess that type checking is not straightforward enough that I would be comfortable to accept it, with my “stringent requirements” on proof checkability.

Maybe I can try to explain again what I mean by a proof object.

Every proof system has an underlying logic with a handful of deduction rules.

A *formal derivation* of a theorem is a sequence of commands that might typically look as follows: “Apply rule 1 to theorem 23 and theorem 45 and call the result theorem 67”

A *proof object* is an explicit representation of a formal derivation. By explicit I mean that it is no big deal to reconstruct the derivation from it. For example, the “addresses” like 45 might be represented by references in memory. But it is not permitted to search for missing derivation steps. (The OpenTheory file format, as an example, isn’t very difficult: http://www.gilith.com/research/opentheory/article.html)

I have given some indications from the literature by the Coq team why I think that .vo files must contain proof objects. Please share your reasons for believing that they don’t.

The .vo files do not contain the normalization steps used to deduce equations. They are not formal derivations in your sense. Nobody stores those to my knowledge.