Hello. I'm a poor person looking for an advice. I know this is the right place since all of you have huge expertise in formal verification. I'm not a smart person (maybe I'm stupid), but I'm interested in open source contribution using C language. Is there any solution to write C code free from undefined behavior and vulnerability without paying for any tools?
At 2018-07-13T17:30:05+0700, Wean Irdeh wrote:
Hello.
Hi Wean,
I'm a poor person looking for an advice. I know this is the right place since all of you have huge expertise in formal verification.
Some of us lurkers have less than "huge" experience in formal verification. ;-) I've been a contributor to various Free/Libre/Open Source (FLOSS) projects for about 20 years, but will make no stronger claim to authority than that.
I'm not a smart person (maybe I'm stupid), but I'm interested in open source contribution using C language.
This is not a stupid idea. A great deal of software, both FLOSS and proprietary is written in C, and it is hugely beneficial to be familiar with the syntax of C. Numerous languages have adopted its syntax, often very closely, and the idioms of C are frequently pressed into service as a form of pseudocode in otherwise language-neutral discussions. Contributing to a FLOSS project gets you used to interacting with colleagues and users, helps you build portable skills in languages and tools that are known outside the walls of a single company, and can give your programming life continuity that persists longer than your tenure with a specific employer.
Is there any solution to write C code free from undefined behavior and vulnerability
My understanding as a largely self-taught software engineer is that this not possible in any general-purpose programming language...
without paying for any tools?
...at any price. There is a very famous idea at the intersection of computer science and mathematical logic called "the halting problem", and before we even had transistorized computers, logicians had determined that solving this problem for programs in general was impossible. This email ended up being much longer than I expected, so I will start with some bottom-line advice to start with and then you can bail out if you choose. :) I recommend Robert C. Seacord's _The CERT C Coding Standard_[11]. If you are on a tight budget, you can read even more material than the book offers at the SEI CERT C Coding Standard wiki[12]. It is often dry material, and yes there is a whole great big lot of it, but it will familiarize you with C and more importantly with the mistakes that are often made in it and with the way you need to think as a programmer to pursue your goal of writing code free from undefined behavior and vulnerability. Furthermore, the book and wiki are very good about distinguishing counter-examples (shaded in red on the wiki) from model examples you should follow (shaded in blue). Better still, many of their examples are drawn from real-life cases. Here's a much longer cogitation on your question: ------------------------------------------------------------------------ In mathematics, you may remember the "quadratic formula" from algebra class, which finds the roots of any equation of the form ax^2 + bx + c = 0 for you. There also exist "cubic" and "quartic" formulas, but they are seldom taught because they are impractically complex to use, especially the quartic. Nevertheless they were discovered hundreds of years ago in the heyday of Renaissance mathematics. It was thought by many that a formula would exist for polynomials of any integer degree no matter how high, though they might prove devilish to determine and extremely tedious to use. However, about 200 years ago Niels Henrik Abel proved that there is _no_ solution ("by radicals" as it is usually qualified) for polynomial equations of fifth degree or higher. There is no quintic equation and there never will be for our number system. Computer languages are something like this. Very simple declarative languages are often used to describe data structures. Markup languages are usually like this, from the familiar HTML to things like TOML, JSON, and YAML. One thing you will notice about these examples is that while they all have structure--they can nest--they have no way to express "jumps". Nothing in these languages says "stop reading here and go forward or backward in the file". They can make _references_ forward or backward, but they do not describe a thread of execution capable of looping or branching. Sometimes, however, this limitation is useful. Sometimes you don't _need_ to do more than describe data. And we can fully, formally validate a markup language (if it is well-designed). It turns out that once programming languages get beyond the very simple, complete formal verification becomes either challenging or impossible, depending on what you're willing to accept as "formal verification". seL4 is a monumental achievement but you will notice that what has been formally verified is not as complete as we might hope. Quoting Klein et al. 2009[1]: "We present seL4, a member of the L4 [46] microkernel family, designed to provide this ultimate degree of assurance of functional correctness by machine-assisted and machine-checked formal proof. We have shown the correctness of a very detailed, low-level design of seL4 and we have formally verified its C implementation. We assume the correctness of the compiler, assembly code, boot code, management of caches, and the hardware; we prove everything else. Specifically, seL4 achieves the following: • it is suitable for real-life use, and able to achieve performance that is comparable with the best-performing microkernels; • its behaviour is precisely formally specified at an abstract level; • its formal design is used to prove desirable properties, including termination and execution safety; • its implementation is formally proven to satisfy the specification; and • its access control mechanism is formally proven" Note the caveats there: the implementation (in C) has been formally verified against seL4's _specification_. You will often find in the software industry that the demands you or your customer place on a program are incompletely or contradictorily specified. (Recognition of this fact, and crude attempts to remedy it with "software development methodology" have slaughtered many trees and filled many shelf-feet of bookcase space with promises of process improvement that go unfulfilled. Put differently, the question "what do you WANT?" is often the hardest one a consumer of software will ever have to answer about a program. People think they know. The people who have to implement the system quickly find out how seldom that isn't the case. A client can talk about business logic all day long but even if they truly are domain experts, it's rare that they've thought through every detail of their use cases. Fortunately, our industry has long experience with operating systems development, and we have learned after many failures more or less what is required of a microkernel. Someone wanting your help to design their Web "presence" or use "Big Data" in "the cloud" to identify new markets for them to exploit are not likely to be as precise.) Furthermore, "the correctness of the compiler, assembly code, boot code, management of caches, and hardware" have been assumed. This is reasonable; NCITA couldn't solve the universe in one go. But this still leaves numerous vectors for trouble. Each and every one of them has been exploited. Compiler? See Thompson 1984, "Reflections on Trusting Trust"[2]. Boot code? A storied place to hide unverified and often hostile programs; for the good old days, see [3]; for more recent and institutionalized examples, see [4][5][6]. Management of caches? You may have heard recently of the Spectre[7] and Meltdown[8] attacks. Hardware? CPU flaws even apart from cache management are so commonplace that it seems like the industry largely regards them as merely as unfortunate designs, which may or may not be rectified in a future chip generation. I could name the infamous Pentium FDIV bug[9], but my preferred example is the otherwise beautifully-designed Motorola 68000 processor, which nevertheless leaked information about supervisor state upon transitions back to user state. Even apart from security concerns, this could be regarded as a correctness flaw because it prevented clean virtualiztion of the processor[10]. (To their credit, Moto fixed this problem in the 68010.) I skipped over assembly code on purpose. Any assembly language or its machine instruction synonyms could be decomposed conceptually into the instruction language for a simpler machine. Thought experiments like the "1-instruction set computer" and "0-instruction set computer" exist and you can look them up on Wikipedia. With those in mind, once you have become familiar with the concept of the Turing machine, it is easier to recognize how proving assembly code "correct" amounts to solving the Halting Problem, which is unsolvable. Again, none of this is to knock on the seL4 project--all of its principals are obvious clearly aware of the limits of their claims, doubtless better than I am. But it's still a fantastic project in my opinion. A long time ago I studied aeronautical engineering, and absorbed the lesson that in engineering we cannot design systems that will never, ever fail. We can only safeguard them against more and more demanding conditions. More than that, almost any system complex enough to be interesting and useful involves design tradeoffs. The classic example from aero is that the propulsion engineers want to put huge, heavy engines on aircraft, because they are able to scale up fuel economy and physical robustness that way. By contrast, the aerodynamics department recoils in horror at this, because an ideal wing is an infinitely thin sheet with no mass. Slapping a giant wart of an engine nacelle on their beautiful wing offends them deeply. The structures and materials people look at the demands of both of the foregoing and throw their hands up in despair at the requirements for infinitely strong, weightless structural components that have a zero coefficient of thermal expansion. They'll come up with some advanced carbon-fiber composite that's better than anything you can buy, but it will cost a billion dollars to produce. Meanwhile, the engineering managers look at the schedule and budget available and cut everybody else's dreams to ribbons. I expect there are many high-caliber engineers and computer scientists monitoring this mailing list, and I invite them to correct me on any point. I would appreciate it. ------------------------------------------------------------------------ Finally, it seems to my mailer that you may be CCing your replies to private responses back to the mailing list. That is generally considered a breach of email etiquette, unless you have received the permission of your correspondent to do so--and it typically helps to note this fact when it is the case. If someone replies privately, respond to them the same way until and unless you know it is okay with them for you to do otherwise. I hope I have been helpful. Regards, Branden [1] http://www.sigops.org/sosp/sosp09/papers/klein-sosp09.pdf [2] https://www.archive.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf [3] https://en.wikipedia.org/wiki/Boot_sector#Boot_sector_viruses [4] http://www.securitynewspaper.com/2017/09/01/cia-developed-windows-malware-al... [5] https://www.blackhat.com/eu-17/briefings/schedule/#how-to-hack-a-turned-off-... [6] https://arstechnica.com/information-technology/2018/03/a-raft-of-flaws-in-am... [7] https://spectreattack.com/spectre.pdf [8] https://support.apple.com/en-us/HT208394 [9] http://www.trnicely.net/pentbug/pentbug.html [10] https://en.wikipedia.org/wiki/Motorola_68010 [11] https://www.amazon.com/CERT%C2%AE-Coding-Standard-Second-Engineering/dp/0321... [12] https://wiki.sei.cmu.edu/confluence/display/c/SEI+CERT+C+Coding+Standard
participants (3)
-
David MENTRÉ
-
G. Branden Robinson
-
Wean Irdeh