A Famed Hacker Is Grading Thousands of Programs — and May Revolutionize Software in the Process

Peiter "Mudge" Zatko and his wife Sarah, formerly of the NSA, developed software that's already helped find flaws across 12,000 pieces of software.

Photo: Cole C Wilson

At the Black Hat cybersecurity conference in 2014, industry luminary Dan Geer, fed up with the prevalence of vulnerabilities in digital code, made a modest proposal: Software companies should either make their products open source so buyers can see what they’re getting and tweak what they don’t like, or suffer the consequences if their software failed. He likened it to the ancient Code of Hammurabi, which says that if a builder poorly constructs a house and the house collapses and kills its owner, the builder should be put to death.

No one is suggesting putting sloppy programmers to death, but holding software companies liable for defective programs, and nullifying licensing clauses that have effectively disclaimed such liability, may make sense, given the increasing prevalence of online breaches.

The only problem with Geer’s scheme is that no formal metrics existed in 2014 for assessing the security of software or distinguishing between code that is merely bad and code that is negligently bad. Now, that may change, thanks to a new venture from another cybersecurity legend, Peiter Zatko, known more commonly by his hacker handle “Mudge.”

Mudge and his wife, Sarah, a former NSA mathematician, have developed a first-of-its-kind method for testing and scoring the security of software — a method inspired partly by Underwriters Laboratories, that century-old entity responsible for the familiar circled UL seal that tells you your toaster and hair dryer have been tested for safety and won’t burst into flames.

Called the Cyber Independent Testing Lab, the Zatkos’ operation won’t tell you if your software is literally incendiary, but it will give you a way to comparison-shop browsers, applications, and antivirus products according to how hardened they are against attack. It may also push software makers to improve their code to avoid a low score and remain competitive.

“There are applications out there that really do demonstrate good [security] hygiene … and the vast majority are somewhere else on the continuum from moderate to atrocious,” Peiter Zatko says. “But the nice thing is that now you can actually see where the software package lives on that continuum.”

Joshua Corman, founder of I Am the Cavalry, a group aimed at improving the security of software in critical devices like cars and medical devices, and head of the Cyber Statecraft Initiative for the Atlantic Council, says the public is in sore need of data that can help people assess the security of software products.

“Markets do well when an informed buyer can make an informed risk decision, and right now there is incredibly scant transparency in the buyer’s realm,” he says.

Corman cautions, however, that the Zatkos’ system is not comprehensive, and although it will provide one indicator of security risk, it’s not a conclusive indicator. He also says vendors are going to hate it.

“I have scars to show how much the software industry resists scrutiny,” he says.

Photo: Cole C Wilson

Software Seal of Approval

When Mudge announced on Twitter last year that the White House had asked him to create a cyber version of Underwriters Laboratories, praise poured in from around the security community.

No one knew the details, but people were confident if he was involved, it would be great.

“Excellent! Something everyone has talked about for decades!” the Def Con hacker conference tweeted after his announcement.

“That’s a concept that really could make a difference if executed well,” wrote Bruce Potter, founder of the Shmoo Group crypto-security collective, which runs the annual Shmoocon security conference

Mudge has been tightlipped about the nature of the cyber UL ever since, but he agreed to discuss the details in advance of a talk he’s presenting next week at the Black Hat conference in Las Vegas.

“To use the car analogy, does it have seatbelts, does it have air bags, does it have anti-lock brakes?” — Peiter Zatko

He says the method their lab uses to evaluate software is based on one he taught NSA hackers in the 1990s about how to find the softest targets on an adversary’s network. (During his run back then with the famed hacker think tank L0pht Heavy Industries, Mudge and his L0pht colleagues regularly provided advice to various parts of the government.)

The technique involves, in part, analyzing binary software files using algorithms created by Sarah to measure the security hygiene of code. During this sort of examination, known as “static analysis” because it involves looking at code without executing it, the lab is not looking for specific vulnerabilities, but rather for signs that developers employed defensive coding methods to build armor into their code.

“To use the car analogy, does it have seatbelts, does it have air bags, does it have anti-lock brakes? All the things that are going to make [a hacker’s] life more difficult,” Mudge says.

The Zatkos say a code’s security hygiene, measured by the programming methods developers use, as well as by the tools and settings used to compile the resulting software, are good predictors of whether a software application will have serious security vulnerabilities and reliability issues.

Their algorithms run through a checklist of more than 300 items, such as whether the compiler used to convert the source code into binary inserted common protective features, like preventing portions of memory reserved for program data — the “stack” and “heap” — from being used to hold additional software.

“Things like ASLR [address space layout randomization] and having a nonexecutable stack and heap and stuff like that, those are all determined by how you compiled [the source code],” says Sarah. “Those are the technologies that are really the equivalent of airbags or anti-lock brakes [in cars]. They’re the things that make software better than it used to be.”

Modern compilers of Linux and OS X not only add protective features, they automatically swap out bad functions in code with safer equivalent ones when available. Yet some companies still use old compilers that lack security features.

The lab’s initial research has found that Microsoft’s Office suite for OS X, for example, is missing fundamental security settings because the company is using a decade-old development environment to build it, despite using a modern and secure one to build its own operating system, Mudge says. Industrial control system software, used in critical infrastructure environments like power plants and water treatment facilities, is also primarily compiled on “ancient compilers” that either don’t have modern protective measures or don’t have them turned on by default.

Asked about the findings, a Microsoft spokesperson would only say, “We are focused on security as a core component in the software development process. We developed and are committed to the Security Development Lifecycle, and continue to lead the industry in creating the most secure products across all platforms.”

The Zatkos’ algorithms also assess the number of branches in a program; more branches mean more complexity and more potential for error. And they look at the presence of complex algorithms that could be susceptible to algorithmic complexity attacks.

The lab is also looking at the number of external software libraries a program calls on and the processes it uses to call them. Such libraries make life more convenient for programmers, because they allow them to repurpose useful functions written by other coders, but they also increase the amount of potentially vulnerable code, increasing what security experts refer to as the “attack surface.” There are about 200 specific external library calls, Mudge says, that are particularly difficult to implement in a manner that ensures a given program executes safely.

If they get a really low score, “we can guarantee that … they’re doing so many things wrong that there are vulnerabilities” in their code. — Sarah Zatko

The process they use to evaluate software allows them to easily compare and contrast similar programs. Looking at three browsers, for example — Chrome, Safari, and Firefox — Chrome came out on top, with Firefox on the bottom. Google’s Chrome developers not only used a modern build environment and enabled all the default security settings they could, Mudge says, they went “above and beyond in making things even more robust.” Firefox, by contrast, “had turned off [ASLR], one of the fundamental safety features in their compilation.”

Mudge worked for Google previously, so some might accuse him of bias, but he says their algorithms, which have been vetted by an outside technical board, ensure that the automated assessments aren’t biased.

Software vendors will no doubt object to the methods they’re using to score their code, arguing that the use of risky libraries and old compilers doesn’t mean the vendors’ programs have actual vulnerabilities. But Sarah disagrees.

“If they get a really good score, we’re not saying there are no vulnerabilities,” says Sarah. But if they get a really low score, “we can guarantee that … they’re doing so many things wrong that there are vulnerabilities [in their code].”

The lab aims to prove such vulnerabilities with the second part of its testing regimen, which uses fuzzing, a method that involves throwing a lot of data at a program to see if it crashes or does something else it shouldn’t do.

“In actually executing it and crashing it, we’re confirming that, yes, this thing has bugs, this thing crashed,” Mudge says. “We were able to give it input and it behaved abhorrently.”

Not all crashes indicate the presence of a bug that hackers can exploit, but they do, at a minimum, indicate that a program may be unreliable for users. In the lab reports the Zatkos plan to make available to the public, they will note which crashes they found were potentially exploitable.

The Zatkos don’t plan to fuzz every program, only enough to show a direct correlation between programs that score low in their algorithmic code analysis and ones shown by fuzzing to have actual flaws. They want to be able to say with 90 percent accuracy that one is indicative of the other.

Mudges Storied Hacking History

Mudge has a long history in the hacker and security communities. While a member of L0pht, he and his L0pht colleagues testified to federal lawmakers in 1998 that the group could bring down the internet in 30 minutes using a serious flaw that still exists.

Photo: Cole C Wilson

He also advised the Clinton administration on cybersecurity issues; was a program manager for DARPA’s Cyber FastTrack initiative, which offered fast-turnaround grants for short cybersecurity projects; and more recently, worked for Google’s Advanced Technologies and Projects Group, a sort of rapid-response skunkworks group, before leaving to launch the testing lab.

His interest in doing software security assessments dates back to a paper one of his L0pht colleagues wrote in 1998 about such evaluations. The idea moved from theory to practice when L0pht merged with a security startup called @Stake and began developing an automated way to do static analysis of code. That method became the basis for what a company called VeraCode does today: assess software for government and corporate clients before they buy it.

Chris Wysopal, CTO of VeraCode and a former L0pht colleague of Mudge’s, says clients generally won’t purchase software his company finds problematic until the software maker fixes the problems, which he says is great for other buyers.

“To me that’s like actually finishing the job; we’re not just pointing out the problems but helping make better software,” he says.

But these assessments are done privately and often on enterprise software, leaving the rest of the public with no way to assess the security of software and little leverage to force vendors to fix other poorly secured code. The Zatkos’ venture could fill that gap, Wysopal says.

Two years ago, Mudge says someone from the White House technology office approached him about helping to set up a government program to evaluate software. He had no interest in working inside the government and decided to set up a nonprofit instead. Although his tweet last year said the White House asked him to create the lab, the White House isn’t involved in his project.

Instead, with $600,000 in funding from DARPA, the Ford Foundation, and Consumers Union, he and Sarah set up the lab in the basement of their home. The outside technical board that vets their methodology and algorithms includes security notables such as former NSA hacker Charlie Miller; Dino Dai Zovi, a security engineer with Square; and Frank Rieger, CTO of the German firm GSMk, which makes the Cryptophone.

Vendors don’t pay for the evaluations. The Zatkos choose the software they evaluate and either buy it or obtain free evaluation copies from vendor websites. They’re examining both commercial software programs and open-source ones. For each software package they test, they produce three reports. The first, automatically generated by their algorithms, scores the software on a scale between 0 and 100. The second contains a detailed breakdown of what they found in the software and will be available for free on their website. The third report, which they plan to sell, will contain raw data from their assessments for anyone who wants to recreate them.

They’ve examined about 12,000 programs so far and plan to release their first reports in early 2017. They also plan to release information about their methodology and are willing to share the algorithms they use for their predictive fuzzing analysis if someone wants them.

There’s already a growing interest in their work. They’re working with Consumer Reports, another inspiration for the lab, to develop a way to use their data to evaluate products the magazine tests. They’ve also had interest from AIG and other insurers who want to use the data to do risk-assessments of companies seeking cyber insurance.

But there is at least one downside to scoring software like this: Attackers can use it to gauge where they should focus their energy to find vulnerabilities, targeting low-scoring applications. Lawyers will also likely want to use the data to assess liability for companies that get hacked. Did they install risky software on their network when a measurably more secure one was available?

Mudge says he’s not upset about the prospect of lawyers finding joy in their scores. “We’ve been begging people to give a shit about security for a decade. … [But] there’s very little incentive if they’ve already got a product to change a product. If you come out with a quantifier saying what you’ve got is not as secure as this other one, that’s going to be an incentive for them to go out and get the other one.”

Join The Conversation