At 2018-07-13T17:30:05+0700, Wean Irdeh wrote:
Hello.
Hi Wean,
I'm a poor person looking for an advice.
I know this is the right place since all of you have huge expertise in
formal verification.
Some of us lurkers have less than "huge" experience in formal
verification. ;-)
I've been a contributor to various Free/Libre/Open Source (FLOSS)
projects for about 20 years, but will make no stronger claim to
authority than that.
I'm not a smart person (maybe I'm stupid),
but I'm interested in open
source contribution using C language.
This is not a stupid idea. A great deal of software, both FLOSS and
proprietary is written in C, and it is hugely beneficial to be familiar
with the syntax of C. Numerous languages have adopted its syntax, often
very closely, and the idioms of C are frequently pressed into service as
a form of pseudocode in otherwise language-neutral discussions.
Contributing to a FLOSS project gets you used to interacting with
colleagues and users, helps you build portable skills in languages and
tools that are known outside the walls of a single company, and can give
your programming life continuity that persists longer than your tenure
with a specific employer.
Is there any solution to write C code free from
undefined behavior and
vulnerability
My understanding as a largely self-taught software engineer is that
this not possible in any general-purpose programming language...
without paying for any tools?
...at any price. There is a very famous idea at the intersection of
computer science and mathematical logic called "the halting problem",
and before we even had transistorized computers, logicians had
determined that solving this problem for programs in general was
impossible.
This email ended up being much longer than I expected, so I will start
with some bottom-line advice to start with and then you can
bail out if you choose. :)
I recommend Robert C. Seacord's _The CERT C Coding Standard_[11]. If
you are on a tight budget, you can read even more material than the book
offers at the SEI CERT C Coding Standard wiki[12].
It is often dry material, and yes there is a whole great big lot of it,
but it will familiarize you with C and more importantly with the
mistakes that are often made in it and with the way you need to think as
a programmer to pursue your goal of writing code free from undefined
behavior and vulnerability. Furthermore, the book and wiki are very
good about distinguishing counter-examples (shaded in red on the wiki)
from model examples you should follow (shaded in blue). Better still,
many of their examples are drawn from real-life cases.
Here's a much longer cogitation on your question:
------------------------------------------------------------------------
In mathematics, you may remember the "quadratic formula" from algebra
class, which finds the roots of any equation of the form
ax^2 + bx + c = 0
for you.
There also exist "cubic" and "quartic" formulas, but they are seldom
taught because they are impractically complex to use, especially the
quartic. Nevertheless they were discovered hundreds of years ago in the
heyday of Renaissance mathematics.
It was thought by many that a formula would exist for polynomials of any
integer degree no matter how high, though they might prove devilish to
determine and extremely tedious to use.
However, about 200 years ago Niels Henrik Abel proved that there is _no_
solution ("by radicals" as it is usually qualified) for polynomial
equations of fifth degree or higher. There is no quintic equation and
there never will be for our number system.
Computer languages are something like this. Very simple declarative
languages are often used to describe data structures. Markup languages
are usually like this, from the familiar HTML to things like TOML, JSON,
and YAML. One thing you will notice about these examples is that while
they all have structure--they can nest--they have no way to express
"jumps". Nothing in these languages says "stop reading here and go
forward or backward in the file". They can make _references_ forward or
backward, but they do not describe a thread of execution capable of
looping or branching.
Sometimes, however, this limitation is useful. Sometimes you don't
_need_ to do more than describe data. And we can fully, formally
validate a markup language (if it is well-designed).
It turns out that once programming languages get beyond the very simple,
complete formal verification becomes either challenging or impossible,
depending on what you're willing to accept as "formal verification".
seL4 is a monumental achievement but you will notice that what has been
formally verified is not as complete as we might hope.
Quoting Klein et al. 2009[1]:
"We present seL4, a member of the L4 [46] microkernel family, designed
to provide this ultimate degree of assurance of functional correctness
by machine-assisted and machine-checked formal proof. We have shown the
correctness of a very detailed, low-level design of seL4 and we have
formally verified its C implementation. We assume the correctness of
the compiler, assembly code, boot code, management of caches, and the
hardware; we prove everything else. Specifically, seL4 achieves the
following:
• it is suitable for real-life use, and able to achieve performance that
is comparable with the best-performing microkernels;
• its behaviour is precisely formally specified at an abstract level;
• its formal design is used to prove desirable properties, including
termination and execution safety;
• its implementation is formally proven to satisfy the specification;
and
• its access control mechanism is formally proven"
Note the caveats there: the implementation (in C) has been formally
verified against seL4's _specification_. You will often find in the
software industry that the demands you or your customer place on a
program are incompletely or contradictorily specified.
(Recognition of this fact, and crude attempts to remedy it with
"software development methodology" have slaughtered many trees and
filled many shelf-feet of bookcase space with promises of process
improvement that go unfulfilled.
Put differently, the question "what do you WANT?" is often the hardest
one a consumer of software will ever have to answer about a program.
People think they know. The people who have to implement the system
quickly find out how seldom that isn't the case. A client can talk
about business logic all day long but even if they truly are domain
experts, it's rare that they've thought through every detail of their
use cases.
Fortunately, our industry has long experience with operating systems
development, and we have learned after many failures more or less what
is required of a microkernel. Someone wanting your help to design their
Web "presence" or use "Big Data" in "the cloud" to identify
new markets
for them to exploit are not likely to be as precise.)
Furthermore, "the correctness of the compiler, assembly code, boot code,
management of caches, and hardware" have been assumed. This is
reasonable; NCITA couldn't solve the universe in one go. But this still
leaves numerous vectors for trouble. Each and every one of them has
been exploited.
Compiler? See Thompson 1984, "Reflections on Trusting Trust"[2].
Boot code? A storied place to hide unverified and often hostile
programs; for the good old days, see [3]; for more recent and
institutionalized examples, see [4][5][6].
Management of caches? You may have heard recently of the Spectre[7] and
Meltdown[8] attacks.
Hardware? CPU flaws even apart from cache management are so commonplace
that it seems like the industry largely regards them as merely as
unfortunate designs, which may or may not be rectified in a future chip
generation. I could name the infamous Pentium FDIV bug[9], but my
preferred example is the otherwise beautifully-designed Motorola 68000
processor, which nevertheless leaked information about supervisor state
upon transitions back to user state. Even apart from security concerns,
this could be regarded as a correctness flaw because it prevented clean
virtualiztion of the processor[10]. (To their credit, Moto fixed this
problem in the 68010.)
I skipped over assembly code on purpose. Any assembly language or its
machine instruction synonyms could be decomposed conceptually into the
instruction language for a simpler machine. Thought experiments like
the "1-instruction set computer" and "0-instruction set computer"
exist
and you can look them up on Wikipedia. With those in mind, once you
have become familiar with the concept of the Turing machine, it is
easier to recognize how proving assembly code "correct" amounts to
solving the Halting Problem, which is unsolvable.
Again, none of this is to knock on the seL4 project--all of its
principals are obvious clearly aware of the limits of their claims,
doubtless better than I am. But it's still a fantastic project in my
opinion. A long time ago I studied aeronautical engineering, and
absorbed the lesson that in engineering we cannot design systems that
will never, ever fail. We can only safeguard them against more and more
demanding conditions. More than that, almost any system complex enough
to be interesting and useful involves design tradeoffs. The classic
example from aero is that the propulsion engineers want to put huge,
heavy engines on aircraft, because they are able to scale up fuel
economy and physical robustness that way. By contrast, the aerodynamics
department recoils in horror at this, because an ideal wing is an
infinitely thin sheet with no mass. Slapping a giant wart of an engine
nacelle on their beautiful wing offends them deeply. The structures and
materials people look at the demands of both of the foregoing and throw
their hands up in despair at the requirements for infinitely strong,
weightless structural components that have a zero coefficient of thermal
expansion. They'll come up with some advanced carbon-fiber composite
that's better than anything you can buy, but it will cost a billion
dollars to produce. Meanwhile, the engineering managers look at the
schedule and budget available and cut everybody else's dreams to
ribbons.
I expect there are many high-caliber engineers and computer scientists
monitoring this mailing list, and I invite them to correct me on any
point. I would appreciate it.
------------------------------------------------------------------------
Finally, it seems to my mailer that you may be CCing your replies to
private responses back to the mailing list. That is generally
considered a breach of email etiquette, unless you have received the
permission of your correspondent to do so--and it typically helps to
note this fact when it is the case.
If someone replies privately, respond to them the same way until and
unless you know it is okay with them for you to do otherwise.
I hope I have been helpful.
Regards,
Branden
[1]
http://www.sigops.org/sosp/sosp09/papers/klein-sosp09.pdf
[2]
https://www.archive.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf
[3]
https://en.wikipedia.org/wiki/Boot_sector#Boot_sector_viruses
[4]
http://www.securitynewspaper.com/2017/09/01/cia-developed-windows-malware-a…
[5]
https://www.blackhat.com/eu-17/briefings/schedule/#how-to-hack-a-turned-off…
[6]
https://arstechnica.com/information-technology/2018/03/a-raft-of-flaws-in-a…
[7]
https://spectreattack.com/spectre.pdf
[8]
https://support.apple.com/en-us/HT208394
[9]
http://www.trnicely.net/pentbug/pentbug.html
[10]
https://en.wikipedia.org/wiki/Motorola_68010
[11]
https://www.amazon.com/CERT%C2%AE-Coding-Standard-Second-Engineering/dp/032…
[12]
https://wiki.sei.cmu.edu/confluence/display/c/SEI+CERT+C+Coding+Standard