Researchers at the University of Cambridge, UK, have released details of a cunning and insidious new class of software vulnerability that allows attackers to hide code in plain sight, within the source code of computer programs. The techniques demonstrated by the researchers could be used to poison open source software, and the vast software supply chains they feed, by adding flaws, vulnerabilities, or malicious code, that are invisible to human code reviewers.

The new class of vulnerabilities, dubbed "Trojan Source", affect a who's who of the world's most widely-used programming languages—including the five most popular: Python, Java, JavaScript, and C#, and C—putting enormous numbers of computer programs at risk.

How it works

Most computer code starts life as a set of instructions written in a so-called "high level" language, like Python or Java, which is designed to be easy for humans to read, write and understand. These high level language instructions are then processed by a computer program—an interpreter or a compiler—into a low level-language, such as bytecode or machine code.

A lot of source code looks very English, and is written using the same limited set of letters, numbers, and punctuation I used to create this article. However, it can potentially include any of the roughly 150,000 characters included in the Unicode standard, a sort of grand-unified human alphabet. Unicode provides a unique number (called a code point) for almost all the characters we use to communicate—from Kanji and currency symbols to Roman numerals and emojis.

To a computer, every unicode character is just a different number, but humans are less discerning. Some unicode characters are invisible to humans, and many of them look very similar to one another.

Trojan Source attacks exploit the fact that humans and compilers may interpret the same source code in two different ways. By playing on those differences, it's possible for attackers to create malicious source code that appears harmless to human eyes.

Trojan Source attacks come in two flavours:

Homoglyph attacks

Homoglyphs are sets of characters that look identical or very similar. They are already widely used by scammers to create lookalike web addresses and app names, such asFаⅽeЬoο and WhatѕAрp. (I've used examples that deliberately look odd, so you can see what I mean, but attackers aren't so charitable.)

The Trojan Source paper shows that the same trick can be used to mislead humans when they read source code, by using lookalike class names, function names, and variables. The researchers use the example of a malicious edit to an existing codebase that already contains a function called hashPassword, which might be called during a login process. It imagines an attacker inserting a similar-looking function called hаshPasssword (the a has been replaced), which calls the original function but also leaks the user's password.

Would a busy code reviewer spot the imposter? I suspect not. The authors suspect not too, and say they were able " successfully implement homoglyph attack proofs-of-concept in every language discussed in this paper; that is, C, C++, C#, JavaScript, Java, Rust, Go, and Python".

Bidi attacks

Alongside the characters you can see, Unicode also contains a number of invisible control characters that indicate to computer programs how things should be interpreted or displayed. The most obvious and often used are probably the carriage return and line feed characters that mark the end of a line of text you write. Chances are, you use them every day without realising.

Among its invisible control characters, Unicode also includes characters for setting the direction of text, so that it can handle languages that are read left-to-right like English, and languages that are read right-to-left, like Hebrew, and mixtures of the two. The control characters allow a phrase like left-to-right to be reversed, so it reads thgir-ot-tfel, or for it to be rearranged so that chucks of left-to-right text are arranged in a right-to-left order (or vice versa), so it reads right-to-left, for example.

Since these control characters are about arranging text for human consumption, the text editors used for reading source code tend to respect them, but compilers and interpreters don't. And while compilers and interpreters tend not to allow control characters in the source code itself, they often do allow them in the comments that document the code, and in text strings processed by the code.

That difference between the way that humans and compilers "see" the source can be used to hide malicious code.

The researchers show that an attacker could use bidirectional control characters in comments to completely change the meaning of a piece of code, which they illustrate with a simple example.

In our fictional scenario, attackers have disabled a line of code that should only run if the user is an admin, by putting it in comments. The compiler sees this:

/* if (isAdmin) { begin admins only */

The attacker knows that a human code reviewer should identify this as a security problem, so they add some bidirectional control characters to rearrange the code for human eyes, making it look as if they have simply added a comment before the admin check, and that the check still works. The code reviewer sees this:

/* begin admins only */ if (isAdmin) {

This is a simple example to illustrate the point, but it's not difficult to imagine that an adversary with time and money could come up with attacks that are far more subtle and much harder to spot.

Of course the attacks only work if attackers have access to the source code, but that doesn't present the barrier you might expect. Modern software projects are often complex jigsaw puzzles composed of other, smaller projects in absurdly convoluted supply chains (although "supply webs" might be a more accurate description). Those supply webs invariably include some open source code, somewhere, and open source projects often allow anyone to make a contribution to their code, provided it gets past the watchful eye of a human reviewer.

Tinfoil hat time?

With so much software potentially at risk from Trojan Source, you might be tempted to throw your computers in the river, hide in the cellar, and put on your tinfoil headgear, but don't.

For a level-headed perspective, I spoke to Malwarebytes' security researcher and Director of Mac and Mobile, Thomas Reed. Reed's perspective is, yes, it's a supply-chain threat, but the problem isn't this specific vulnerability so much as the fragility of the supply chain itself.

"The biggest danger from my perspective is usage in open-source projects that are used by commercial software, which I imagine isn't all that unique a perspective. The danger is there, though, with or without Trojan Source, because a lot of open source projects aren't getting any kind of in depth source reviews."

This isn't the first research to find a vulnerability that could affect basically everything. In fact, they're surprisingly common. You may remember KRACK, the 2017 research that revealed that our Wi-Fi security was broken, everywhere. Or the Spectre and Meltdown vulnerabilities from a year later that affected generations of hard to patch, and hard to replace, processor chips. And what did we do? We patched and moved on, just like we always do.

The good news is that the work on that has already started, with an extensive process of vulnerability disclosure that began in July, when researchers contacted nineteen separate organizations about their findings. They have since contacted more organizations, including the CERT Coordination Centre, and been issued a pair of CVEs, CVE-2021-42574 and CVE-2021-42694.

There are plenty of choke points where Trojan Source attacks might be picked up, such as public code repositories like GitHub, code editors and Integrated Development Environments, static analysis tools, and of the actual code compilers themselves. A lot of code will have to pass through several of these chokepoints before going live, so we will soon have plenty of defence in depth.

And spotting or stopping the attacks should be fairly easy, now we know to look for them. The researchers suggest several methods, starting with the most simple: Simply banning the use of bidirectional control characters in comments entirely. Where it's humans rather than computers that are reading the code, text editors could add a visual marker to control characters, just as word processors can be made to show paragraph marks, and other invisible characters.

If you want to know more about the research, check out the research paper Trojan Source: Invisible Vulnerabilities, by Nicholas Boucher and Ross Anderson.