Detecting malware capabilities in
superblocks using static binary analysis
University of Turku
Department of Computing
Master of Science (Tech) Thesis
Information and Communication Technology
June 2025
Yacine Elhamer
Supervisors:
Antti Hakkala
Tahir Mohammad
The originality of this thesis has been checked in accordance with the University of Turku quality assurance
system using the Turnitin OriginalityCheck service.
UNIVERSITY OF TURKU
Department of Computing
Yacine Elhamer: Detecting malware capabilities in superblocks using static bi-
nary analysis
Master of Science (Tech) Thesis, 69 p.
Information and Communication Technology
June 2025
Malware, or malicious software, is a central piece to the success of most cyber attacks
since often times it is the element that is programmed and used to carry out the
objectives of the threat actors once they gain access to a system. This has made
it vital for cybersecurity analysts to study malware and accurately categorize it, so
that they remain a step ahead of the attackers, and are able to respond quickly
in the case of time-sensitive cyber breach situations. The large volume of malware
that has created a necessity for tools that automate the process of analyzing new
malware, and especially tools aimed at extracting the behavior and capabilities of
the encountered malware. One of the most widely used tools for this purpose is the
capa tool by Mandiant’s FLARE team.
This thesis aims to improve FLARE’s capa malware capability extraction tool by
increasing the accuracy of its capability extractions and categorization. Currently,
it extract features from instructions, basic blocks, and functions, and then matches
them with the rules that apply to each individual scope. This extraction method
presents an issue of false negatives if a rule is written to be applied on basic blocks,
but the features to-be-matched end up being split into two or more basic blocks by
the compiler (in the case of the malware author adding error-checking code between
two related malicious API calls for instance); as well as potential false positives
if the previous rule’s matching scope is set to function instead (to circumvent the
stated limitation), which might lead to completely unrelated features being paired up
together in the case of large functions. The identified limitation is relevant because
usage of if-else statements for error checking are common in software in general and
malware specifically, and those are a major reason why semantically-related code
would end up being segmented into multiple basic blocks.
The work adds a new "superblock" matching scope to capa that makes it possible to
extract malware capabilities whose constituent features are spread across a number
of basic blocks in direct sequence, therefore reducing the possibility of false positives
and false negatives when matching some rules, as previously described.
Using the newly-introduced scope, we were able to rewrite an existing capa rule
intended for detecting RC4 encryption logic, and have it match the relevant logic
on the superblock scope as opposed to the old function scope. By doing so, we
were able to retain the tool’s true positive rate, while eliminating some observed
false positive cases such as the wmemcmp() buffer comparison method from the C
runtime being incorrectly identified as performing RC4 encryption.
Keywords: malware analysis, static binary analysis, security automation, capa
Contents
1 Introduction 1
1.1 Cyber Kill Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Malware and Malware Analysis . . . . . . . . . . . . . . . . . . . . . 3
1.3 Automating Malware Analysis . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Malware Capability Extraction Tools 8
2.1 YARA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Capa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Call Signatures Plugin (CSP) . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Superblock Static Binary Analysis 20
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 The Utility of Previous Research . . . . . . . . . . . . . . . . . . . . 22
4 Target System: Capa 24
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
i
4.2 Feature Extractors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Capa’s Rules Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Matching Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Results Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.1 Display Modes . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Orchestration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Design and Specifications 39
5.1 Design Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1 Superblocks must not contain irrelevant basic blocks . . . . . 40
5.1.2 Superblocks must contain only one execution path . . . . . . . 40
5.1.3 Superblocks must not contain loops . . . . . . . . . . . . . . . 41
5.1.4 Superblocks can be composed of just one basic block . . . . . 41
5.1.5 Superblock matches cannot be nested . . . . . . . . . . . . . . 41
5.2 Design Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 General approach . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.2 Generate the current function’s flow graph . . . . . . . . . . . 43
5.2.3 Blind matching on the entire set of rules . . . . . . . . . . . . 46
5.2.4 Retrieve the basic blocks used for each match . . . . . . . . . 46
5.2.5 Filter the results . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Integrating With Capa . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.1 Rules Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.2 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.3 Capabilities Module . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.4 Results Rendering . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Implementation 52
6.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ii
6.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2.1 Verification sample . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2.2 Test rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3.1 Test case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3.2 Updated rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3.3 Results and Comparison . . . . . . . . . . . . . . . . . . . . . 60
6.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7 Conclusion 67
7.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
References 70
iii
List of Figures
3.1 Literature Review Academic Journals Research Results . . . . . . . . 21
4.1 default display mode for capa . . . . . . . . . . . . . . . . . . . . . . 34
4.2 verbose display mode for capa . . . . . . . . . . . . . . . . . . . . . . 35
4.3 vverbose display mode for capa . . . . . . . . . . . . . . . . . . . . . 36
6.1 Disassembly View of the Target Match Superblock . . . . . . . . . . . 54
6.2 Capa Execution Result (Simple Output Mode) . . . . . . . . . . . . . 56
6.3 Capa Execution Result (Verbose Output Mode) . . . . . . . . . . . . 56
6.4 Capa Execution Result (Very Verbose Output Mode) . . . . . . . . . 57
6.5 Disassembly of the Match Location in the Test Sample . . . . . . . . 58
6.6 Capa Analysis Results for Sample TP1 . . . . . . . . . . . . . . . . . 61
6.7 Capa Analysis of Sample TP2 . . . . . . . . . . . . . . . . . . . . . . 62
6.8 Capa Analysis of Sample TP3 . . . . . . . . . . . . . . . . . . . . . . 62
6.9 Capa Analysis of Sample FP1 . . . . . . . . . . . . . . . . . . . . . . 63
6.10 Verbose Output for The Capa Analysis of Sample FP1 . . . . . . . . 64
6.11 IDA Disassembly of The wmemcmp() Standard C Function (in FP1) 65
iv
List of acronyms
API Application Programming Interface
ATT&CK Adversarial Tactics, Techniques, and Common Knowledge
C2 Command & Control
CIL Common Intermediate Language
CKC Cyber Kill Chain
CPU Central Processing Unit
CSP Call Signatures Plugin
ELF Executable and Linkable format
F:L:I:R:T Fast Library Identification and Recognition Technology
FP False Positive
GUI Graphical User Interface
IC3 Internet Crime Complaint Center
IDA Interactive Disassembler
JSON JavaScript Object Notation
Mach O Mach Object
v
MBC Malware Behavior Catalog
MD5 Message Digest 5
PE Portable Executable
PRGA Pseudo-Random Generation Algorithm
RC4 Rivest Cipher 4
SBox Substitution Box
TP True Positive
TTP Techniques, Tactics, and Procedures
Y AML Yet Another Markup Language
Y ARA Yet Another Recursive Accronym
vi
1 Introduction
The advent of computers and complex communication networks, such as the In-
ternet, has caused a large shift toward digitalization by people, organizations, and
governments alike in different sectors. This shift is largely influenced by the increased
efficiency and effectiveness that digital systems provide as opposed to purely man-
ual ones. Unfortunately, however, these benefits come at the cost of exposing more
potentially vulnerable aspects of systems to criminals, who were also among the
groups that adopted the cyberspace as a stage of operation.
According to the 2023 Internet Crime Report published by the Internet Crime
Complaint Center (IC3) [1], which is a division of the Federal Bureau of Investigation
of the United States, exactly 880,418 Internet-related complaints were received from
US citizens and nationals of other countries, with more than 12.5 billion dollars lost
to criminals in that year alone. This figure is 80,000 higher than the previous year
and continues a general rising trend for the frequency of cyber crime incidents.
As a consequence of the increased risk of falling victim to a cyber attack, organi-
zations and individuals alike increased their investments in security countermeasures
and products, with the aim of minimizing the threats facing their now-digital sys-
tems. This has caused many frameworks and models to emerge that attempt to help
defenders stay ahead of attackers. One of these frameworks that will prove helpful
in determining a key recurring aspect of cyber attacks (and the topic of this thesis)
is Lockheed Martin’s Cyber Kill Chain (CKC).
1.1 CYBER KILL CHAIN 2
1.1 Cyber Kill Chain
The Cyber Kill Chain is a framework developed by the American defense and
aerospace manufacturing company Lockheed Martin, and its goal is to assist de-
fenders in identifying and preventing digital compromise [2]. The model is made
up of seven essential steps that Lockheed Martin asserts must all be completed in
sequence by an attacker in order for them to fulfill their objectives. These steps are
as follows:
1. Reconnaissance: The threat actors conduct research to identify their targets
as well as potential points of interest.
2. Weaponization: The threat actors combine first stage malware, exploits, as
well as potential decoy files (such as Microsoft Office documents in the case of
Phishing attacks) into a deliverable payload.
3. Delivery: The attackers deliver the previously crafted payload to the target
using physical methods (such as malicious USB sticks) or digital methods (such
as phishing emails).
4. Exploitation: At this stage, attackers must exploit a component of the sys-
tem to gain access. This component could either be human or digital (such as
a software vulnerability or misconfiguration).
5. Installation: The attackers installs malware on the target, and then attempts
to establish persistence, backdoors, as well as attempting to hide their traces
by wiping out the logs or changing the creation dates for files that they created
(also known as a "timestomping attack").
6. Command & Control (C2): The malware that was installed in the previous
stage connects back to the attacker’s command and control server in order to
exfiltrate files or receive further commands.
1.2 MALWARE AND MALWARE ANALYSIS 3
7. Actions on Objectives: Once the attackers achieve direct access to the
target’s system, they then proceed to carry out their objectives
By examining the kill chain detailed above, we can identify and extract elements
that are shared by all cyber attacks. Then, by focusing on the least varying of these
elements we can devise strategies and create security controls that could then be
shared to and implemented by different organizations without requiring any major
tweaking or modification these controls. One of the best candidates for such an
element is malware.
1.2 Malware and Malware Analysis
Malware, which is an abbreviated term for Malicious Software, can be defined as any
software that performs malicious actions on a user, computer, or network, including
viruses, Trojan horses, worms, rootkits, ransomware, spyware, among others [3].
Typically, malware gets shared and reused a lot among threat actors, who often
modify it slightly to fit the needs of the ongoing attack, and then deploy it. Malware
samples with similar origins and traits are classified into families, with examples of
common malware families including: Cobalt Strike, LockBit, and Metasploit [4].
According to the document published by Lockheed Martin titled "Gaining The
Advantage: Applying Cyber Kill Chain Methodology to Network Defense" [2], which
is aimed at defenders aiming to thwart cyber attacks, it is sufficient to prevent
only one step of the kill chain’s seven steps in order to stop the entire attack from
reaching its objects. Additionally, the guide highlights malware analysis as a key
task in order to detect and prevent the second stage (weaponization) and fifth stage
(installation) of an attack. This is because effective analysis of the malware used
can help defenders create effective mitigations for future encounters with the same
novel piece of malware, as well as gain an idea about the actors behind the attack
1.3 AUTOMATING MALWARE ANALYSIS 4
and what their motives or attribution might be. However, malware analysis can be
a difficult and complex task as we will see later.
1.3 Automating Malware Analysis
According to the 2024 M-Trends report published by Mandiant [4], which is an
American cybersecurity firm and a subsidiary of Google, the company was able to
track 626 new malware families in that year alone, with 128 of them having been
observed by the firm’s incident response team during engagements with clients. This
figure represents an increase from 588 newly-tracked malware families in the year
prior, and continues a general upwards trend with this regard. As a consequence,
the rising number of unique malware families in the wild, as well as the large number
of samples that can result from each individual family, conveys a need for developing
and improving tools for automating and making the malware analysis process less
time consuming and more straightforward to analysts in order to keep up with threat
actors. One category of malware analysis automation tools that is seeing a rise in
usage and efficiency is malware capability extractors such as capa [5] and Forecast
[6].
Malware capability extraction tools typically operate on a simple shared prin-
ciple: the tool receives a malicious sample either in binary form or as the output
of a sandbox run, and tries to look for malware capabilities that are described by
specific rules that are either written out (in the case of capa for instance) or learned
(in the case of AI-based tools). One of the most widely adopted tools among mal-
ware analysts is capa, which was created by Mandiant’s FLARE team, and has the
ability to extract varying types of binary capabilities and alert the analyst to their
location in the executable or library.
The way capa operates is that it takes in a binary and then extracts all of its
constituent functions using a static analysis backend such as Mandiant FLARE’s
1.4 RESEARCH QUESTIONS 5
vivisect framework, then, extract all of the basic blocks that make up each function,
and finally, the instructions that make up each basic block. After that, the tool relies
on a set of rules crafted by other security researchers that define malware capabilities
in terms of smaller constituent instruction, basic block, or function features. Each
rule has a scope at which it is meant to be applied, and at the matching stage
capa reads each rule matching conditions as well as that rule’s specified scope (i.e.
function, basic block, or instruction), and then try to see if any of the binary’s
components of that scope’s type satisfy the listed conditions. If there is a match,
then the malware capability is deemed as present in the binary.
1.4 Research Questions
Although capa’s current architecture functions well for extracting some kinds of mal-
ware capabilities, a room for improvement has been identified in order to minimize
some of the tool’s false positives and false negatives. This is due to a present flaw
in the way capa currently performs capability matching, which is that rules made
to match at the basic block scope might not trigger since a control flow statement
might break what would have been a single basic block into two or more blocks (such
as in the case of if-else statements being used for error checking) thereby triggering
a false negative, and also because rules made to circumvent this issue by matching
at the function scope instead could trigger false positives in cases where a function
could be too large and thus unrelated binary artifacts (instructions, API calls, etc.)
could be falsely correlated together.
As a result, we identify the following research questions for our thesis:
1. RQ1: What are the most adopted tools for extracting malware capabilities?
And does any of them solve the accuracy issue that capa suffers from?
2. RQ2: Can adding a new analysis scope that provides analysts with the ability
1.5 RESEARCH OBJECTIVES 6
to match malware capability logic at the level of basic blocks in close sequence
improve the accuracy of capa’s results? And will it increase the expressibility
of capa rules?
3. RQ3: Are there any existing capa rules whose associated accuracy could be
immediately improved by changing their scope from function or basic block to
superblock?
1.5 Research Objectives
This thesis enhances capa’s current capability extraction logic by introducing a new
rule-matching scope named the "superblock" scope, which sits between the function
and the basic block scope. This scope correlates basic blocks in succession with
features that could result in a rule matches, thereby helping to reduce the rate of
false negatives and false positives in the tool’s output that are due to error checking
logic or other regular control-flow constructs common among malicious and non-
malicious software developers.
Our main objectives for this work are:
1. RO1: Examine the main malware capability extraction tools, and whether any
of them solves the identified issue that capa suffers from.
2. RO2: Design and implement a superblock scope for the capa tool, and showcase
how it increases the expressibility of malware capabilities for security analysts
authoring capa rules.
3. RO3: Find an example of an existing capa rule with the function or basic block
scope, and show how its accuracy can be improved by using the newly-introduced
superblock scope.
1.6 THESIS STRUCTURE 7
Obfuscated malware samples are out of the scope of the thesis, since static bi-
nary analysis is usually not effective on them due to the obscured logic that is
typically harder to decipher using static analysis tools, with sandboxes or manual
deobfuscation being used first instead. Additionally, most of the malware capabil-
ity extraction tools we examined were not able to handle such samples, with the
capa tool for instance having specific hard-coded logic in it that determines at first
whether a sample is packed or obfuscated at first, and halts the analysis entirely if
that is the case.
1.6 Thesis Structure
The rest of the thesis is organized as follows: Chapter 2 examines some common
malware capability extraction tools, the advantages and disadvantages of each one,
and why capa was picked for this thesis project. Chapter 3 is a semi-structured
literary review of static binary analysis of superblocks. Chapter 4 gives some back-
ground into the architecture of the target system which will be improved in this
thesis, which is the capa tool. Chapter 5 details the conceptual design and speci-
fications for the new scope that will be contributed to capa. Chapter 6 details the
implementation of the scope into the tool, as well as the verification and analysis
of results. Chapter 7 is the conclusion to the thesis, which recaps the contributions
made as well as possible future work.
2 Malware Capability Extraction
Tools
In this chapter, we will go over a set of existing tools and frameworks that are
commonly used by analysts for malware capability extraction, while mentioning the
advantages and disadvantages of each tool, and motivating in the process our choice
for the target system that we will be integrating our contribution in this thesis into.
We will begin by first describing the YARA rule matcher in the first section,
while providing example rules as well as highlighting the tool’s advantaged and
disadvantages. Then, we will detail the capa malware extraction capability and
what its rules look like in the second section, as well as the tool’s advantages and
disadvantages. After that, we will go through the Call Signatures Plugin (CSP)
tool and how it works, as well as its advantages and disadvantages. Finally, we will
briefly compare the different tools and motivate our choice for the target system of
this thesis in the fourth section of this chapter.
2.1 YARA
YARA (Yet Another Recursive Acronym) is a signature-based malware detection
and classification tool developed by VirusTotal [7]. Its operating principle is that
malware analysts write YARA rules describing different signatures for previously
analyzed malware, then, YARA takes those rules and scans a set of file samples for
2.1 YARA 9
potential signature matches.
Unlike simple byte and sequence matching tools (such as Linux’s "grep" utility,
for example), YARA offers analysts greater flexibility by supporting boolean logic
expressions for defining signatures instead of just simple constant byte sequences.
Additionally, YARA also gives analysts the ability to extend its functionality by
integrating it with custom written modules, with YARA already having built-in
support for a handful of modules offering functionalities such as hash functions,
mathematical functions, and file-format parsing functions, among others [8].
YARA rules are composed of 3 main sections:
• meta: contains author-defined key and value pairs that denote metadata
about the rule. Examples of such metadata could be: the author’s name,
threat classification, malware family’s first appearance date.
• strings: contains author-defined strings and byte sequences that are to be
referenced later by the "condition" section described below. This section can
be viewed as the "variables declaration section" for YARA rules.
• condition: contains the boolean algebra expression that denotes the condition
that must be met by the input file in order for the rule to be matched.
An example YARA rule that showcases its syntax can be seen in Listing 1.
2.1 YARA 10
Listing 1 Example YARA rule for deteciting the Zeus malware family
import "pe"
rule Windows_Malware_Zeus : Zeus_1134
{
meta:
author = "Xylitol xylitol@malwareint.com"
date = "2014-03-03"
description = "Match first two bytes, protocol and string present
in Zeus 1.1.3.4"
reference = "http://www.xylibox.com/2014/03/zeus-1134.html"
strings:
$mz = {4D 5A}
$protocol1 = "X_ID: "
$protocol2 = "X_OS: "
$protocol3 = "X_BV: "
$stringR1 = "InitializeSecurityDescriptor"
$stringR2 = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1)"
condition:
($mz at 0 and all of ($protocol*) and ($stringR1 or $stringR2))
}
The rule shown in Listing 1 was taken from the Yara-Rules project, which is
a GitHub repository that contains numerous YARA rules under the GNU-GPLv2
license for malware analysts to share and contribute to [9].
Although YARA was initially developed to help identify variants of previously
analyzed malware and malware families, the general design of the tool also allows
one to extract and identify previously observed malware tactics, techniques, and
procedures (TTP). This is particularly useful for detecting novel malware that has
2.1 YARA 11
not been previously encountered but employs capabilities that are known to be
common in malware.
For example, cryptographic functions often rely on specific constants in their
functioning such as Rijndael’s SBox [10], MD5’s initial state [11], and Blowfish’s
usage of the binary representation of    3 to initialize its key schedule [12]. By
treating these constants as signatures, malware authors can detect the usage of such
cryptographic functions in malware.
Additionally, many other malware techniques rely on constants that could be
used as detection signatures. For example, presence of the following registry path
is indicative of the binary potentially employing registry-based persistence, and can
therefore be used as a YARA signature [13]:
"HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Run".
Another avenue for extracting capabilities from malware with YARA is convert-
ing the CPU instructions composing a capability into their binary representation,
and then setting that representation as our signature. For instance, the YARA rule
depicted in Listing 2 does just that in order to detect usage of the "Heaven’s Gate"
technique, which allows malware authors to run 64-bit code inside 32-bit processes,
thereby evading some security monitoring software [14].
2.1 YARA 12
Listing 2 Example YARA rule for detecting Heaven’s Gate
rule HeavensGate
{
meta:
author = "Andrew Williams (@recvfrom) @ Cisco Talos"
description = "Look for instructions associated with Heaven's Gate"
date = "2020-10-27"
strings:
/*
* 6A 33 push 33h ; '3'
* E8 00 00 00 00 call $+5
* 83 04 24 05 add dword ptr [esp+0], 5
* CB retf
*/
$retf_to_64bit_mode = {6a33e80000000083042405cb}
/*
* E8 00 00 00 00 call $+5
* C7 44 24 04 23 00 00 00 mov dword ptr [rsp+4], 23h ; '#'
* 83 04 24 0D add dword ptr [rsp], 0Dh
* CB retf
*/
$retf_to_32bit_mode = {e800000000c7442404230000008304240dcb}
condition:
(uint16(0) == 0x5A4D and uint32(uint32(0x3C)) == 0x00004550) and
$retf_to_64bit_mode or $retf_to_32bit_mode
}
2.1 YARA 13
While YARA’s rudimentary nature brings forth many benefits to malware an-
alysts such as the speed of execution and signature matching, its lack of support
for many higher-level constructs and abstractions such as functions, basic blocks,
and instructions makes it harder for analysts to write and maintain rules for specific
malware capabilities.
If a rule author wants to write a YARA rule describing a specific binary capa-
bility, they would need to lookup the individual opcodes for each CPU instruction
that makes up such a capability, and then construct their YARA rule with that.
Additionally, they would need to also be wary of the difference in opcodes between
x86 and x64 architectures, as well as the fact that the same capability could have
slightly different machine code depending on the compiler used. Furthermore, rule
authors would also need to make sure that their rules are being matched in the right
section of the binary (i.e., the .text section), and not in the data or overlay section
of the binary; failure to do so could result in false positives if the rule’s signature
bytes happen to occur inside an image stored in the executable’s resources section.
The conditions mentioned above can theoretically be handled by malware ana-
lysts, but in reality however, they would end up consuming a lot of the analysts’
time. Perhaps a fitting comparison might be that YARA to malware capability ex-
traction is like assembly to computer programming, with the former of the two being
somewhat more dire and demanding given the frequency at which new malware is
encountered by analysts [4].
2.2 CAPA 14
2.2 Capa
capa is a more advanced malware capability extraction tool and framework devel-
oped by the FLARE team from the Mandiant cybersecurity firm (now a subsidiary
of Google).
The tool’s general idea is similar to that of YARA in the sense that analyst-
written rules are used to find matches in input samples. However, unlike YARA
which was made with the goal of detecting previously-seen samples and malware
families, capa’s goal is to detect previously-seen malware capabilities and TTPs.
This makes capa’s rules more general and more useful to malware analysts given
that old and novel malware often times reuses the same TTPs and capabilities,
making detection using capa much more effective than YARA.
What distinguishes capa from YARA and makes it more suitable for the task of
capability extraction is that it offers malware analysts higher abstraction layers that
they could reason at, such as mnemonic instructions, basic blocks, and functions.
As a result, instead of analysts concerning themselves with the mnemonic-to-opcode
translations of different CPU instructions, functions-extraction from executables, or
file format parsing, they would focus solely on the overall picture and leave all of
those details to be handled by capa under the hood.
Additionally, it should be noted as well that capa supports all of the features
offered by YARA such as bytes and string matching, at different abstraction levels
(function, basic block, and instruction). However, this comes at a reduced execution
time since capa and its dependencies are written purely in Python, while YARA is
written in C.
An example capa rule for detecting memory region allocation or modification
can be seen in Listing 3. The rule was taken from the official capa rules GitHub
repository [15].
2.2 CAPA 15
Listing 3 Example capa rule for detecting RW memory allocation or changing
rule:
meta:
name: allocate or change RW memory
authors:
- 0x534a@mailbox.org
- "@mr-tz"
lib: true
scopes:
static: basic block
dynamic: call
mbc:
- Memory::Allocate Memory [C0007]
examples:
- Practical Malware Analysis Lab 17-02.dll_:0x1000D10D
features:
- and:
- or:
- match: allocate memory
- match: change memory protection
- or:
- number: 0x4 = PAGE_READWRITE
# lea r9d, [rcx+4] ; flProtect
# call cs:VirtualAlloc
- instruction:
- mnemonic: lea
- offset: 0x4 = PAGE_READWRITE
2.2 CAPA 16
As can be seen in Listing 3, capa rules are based on the YAML (Yet Another
Markup Language) data serialization language [16], and they are made of two main
sections: meta and features.
The meta section defines metadata for the rule such as its name, its static and
dynamic matching scopes, as well as other fields such as the rule’s authors and
examples from the capa-testfiles GitHub repository [17] where the capability can be
observed.
As for the features section, it contains a boolean algebra expression that is used
by capa for matching the rule on sample files. Non-leaf nodes represent boolean
algebra and capa-specific operators such as subscope specifiers (like "instruction" in
this example), while leaf nodes specify the features to be matched.
When capa is executed with that rule and a sample malware file passed onto
it, the tool first performs a deep analysis of the malware and extracts all of its
constituting functions, as well as each function’s constituting basic blocks, and the
instructions that make up each basic block. Then, capa parses that rule and gener-
ates an internal binary expression tree for it. Finally, capa’s matching engine keeps
only the leaf nodes whose representative feature was found in the malware sample,
and does not consider any leaf nodes that were not found in the malware sample;
then, the binary expression tree is evaluated, and if is found to be True, then the
capability is marked as "found" and is returned to the user, otherwise it is not.
One important nuance that should be mentioned is the subscope specifiers (such
as "instruction" in the previously mentioned rule). These are special capa operators
that mean that all of its child nodes must be in the same logical unit of the specified
subscope. In the previously mentioned rule for example, usage of the "instruction"
operator means that the "lea" CPU instruction and the presence of the "0x4" off-
set must be within the same operator (i.e., "lea" is the mnemonic and "0x4" is
the instruction operand). Other subscope operators include the "basic block" and
2.3 CALL SIGNATURES PLUGIN (CSP) 17
"function" operators, and similar logic applies to them. The general scope of the
rule (specified in the meta section) specifies the general scope of the rule.
The advantages of capa organizing rule writing and matching into scopes (i.e.,
instruction, basic block, and function) makes it easier for malware analysts to write
and maintain rules for describing capabilities, it introduces several disadvantages
that have to do with that particular scoping. For example, if a rule is set to match
at the function scope and the malware sample contains a large enough function in it,
then capa could end up correlating features (i.e., leaf nodes) that are not necessarily
related to each other, thereby resulting in a false negative. Furthermore, if rule
authors try to circumvent this issue by attempting to match at the narrower basic
block scope instead, then capa could end up overlooking features (i.e., leaf nodes)
that are related to each other but separated into different basic blocks due to a jump
instruction caused by some error checking code for instance.
2.3 Call Signatures Plugin (CSP)
This tool was proposed by Joren Vrancken as part of their Master’s thesis titled
"Detecting Capabilities in Malware Binaries by Searching for Function Calls" [18].
Its works by using Hex-Rays’ Interactive Disassembler (IDA) as a binary analysis
backend that provides the needed information about the analyzed binary, which is
then used by the tool alongside a pre-written capability-describing set of rules to
determine what capabilities a malware sample has.
First, and once the binary has been loaded and analyzed in IDA, the CSP tool
(which is implemented as an IDA plugin) uses IDA to get a list of all functions present
in the binary using IDA’s Python API. Then, the tool uses IDA’s Fast Library
Identification and Recognition Technology (F.L.I.R.T) to determine which functions
are library functions and which ones were authored by the malware sample’s author.
The internals of discovered library functions are not analyzed.
2.4 DISCUSSION 18
After the malware-specific functions have been determined, CSP gets a list of
all of the function calls being made by each malware-specific function. This is done
with the help of IDA’s built-in commercial decompiler, which traces back in the code
before each "call" instruction to determine its arguments according the determined
calling convention.
Finally, and once CSP has obtained the list of function calls (along with their
arguments) using IDA’s decompiler, and after parsing the rules that were passed onto
it, the tool proceeds to analyze the IDA-extraction function calls and arguments to
determine if there exists any matches with the pre-written rules. If a match is found,
then it is reported to the analyst.
While the tool has some disadvantages relative to capa such as its lack of support
for instruction mnemonic and bytes matching, it still manages to outperform capa
in certain areas. Namely, in the identified weakness discussed in the previous section
related to capa. CSP had a slightly higher detection rate using a dataset of malware
samples compiled by the tool’s author, as well as beating some of the cases where
capa had false positives due to function-scope capa rules incorrectly correlating
irrelevant features together, which is an issue that has already been identified in the
previous section.
2.4 Discussion
In summary, we have showcased three state-of-the-art static malware capability
extraction tools, as well as the advantages and disadvantages of each one of them.
From the review, it can be seen that capa is the most advanced and sophisticated
of the bunch for this task, given its support for both the same low primitives sup-
ported by YARA, as well as its rich and existing code infrastructure which allows
for high level reasoning and abstractions, thereby making the tool more extensible.
Additionally, we also showcased the weaknesses of capa that had been discovered
2.4 DISCUSSION 19
by contemporary research, as well as their relevance to the thesis’ topic.
As a result, capa will be the tool that will be utilized in this thesis for imple-
menting superblocks malware capability extraction.
3 Superblock Static Binary Analysis
In this chapter, we aim to perform a semi-structured literary review for the research
conducted into superblock formation. Specifically, in the context of static binary
analysis and malware analysis. We also speculate on how the analyzed results could
help us in answering the research problem of this thesis.
We will begin the chapter by describing the methodology we followed and the
results we obtained in the first section. Then, we will analyze the given results and
provide an overall view of the related literature in the following section. Finally, we
will theorize on how previous work could help guide our contribution in this thesis
in the final section of this chapter.
3.1 Methodology
In order to find academic work that is relevant to the topic of superblocks formation
and malware capability extraction, we perform a search on 5 major computer science
academic journals with the general term: "superblock". When applicable, we use
the journal’s own field filtering options to select only topics related to Computer
Science.
Results can be seen in Figure 3.1.
3.2 ANALYSIS 21
Figure 3.1: Literature Review Academic Journals Research Results
We also note that we tried using other search terms in the aforementioned jour-
nals, such as "superblock formation" and "superblock static binary analysis", but
the general term "superblock" seemed to provide the most results given its general
nature.
3.2 Analysis
Given the narrow scope of the topic at hand, it was expected that academic litera-
ture on it would be scarce, especially considering that most malware analysis field
advancements tend to happen in the commercial sector. This was indeed the case
with our search, and no articles were found that were directly related to the topic
of the thesis.
The majority of the articles encountered throughout our search were coming
from the compiler optimization field. The scope and goal of those articles was
3.3 THE UTILITY OF PREVIOUS RESEARCH 22
the development of heuristic or probabilistic models that could predict with decent
certainty the likelihood of different execution paths being taken during the program’s
execution.
The utility of such models is that they could be incorporated into different com-
pilers that could utilize them to determine the paths that a program is most likely to
take, and therefore generate machine code in such a way that those specific execution
paths would be given priority.
Another observed use case for the papers we went through was in the field of
CPU caching and optimization. Since accurate prediction of which chain of basic
blocks (i.e., a superblock) are likely to be taken at a specific point of the program’s
execution can help the CPU cache the right code segments, and therefore speed up
program’s execution by preventing cache misses.
3.3 The Utility of Previous Research
While the encountered research does not prove directly useful to our research prob-
lem, several portions of it can be utilized and incorporated into our contribution in
a constructive manner.
One instance for this can be extracted from the research paper published in 1993
by Hank et al. titled "Superblock Formation Using Static Program Analysis" [19],
wherein the authors define many foundational terms and concepts that will prove
useful to our thesis.
For example, the authors of that paper define a superblock as a "block of in-
structions in which control may only enter at the top but may leave at one or more
exit points", and that "when the execution stream enters a superblock, it is likely
that all basic blocks in that superblock will be executed". This definition seems to
be agreed upon in other literature that followed this paper (albeit in CPU cache
algorithms and compiler optimization), and provides us with legitimacy for naming
3.3 THE UTILITY OF PREVIOUS RESEARCH 23
the scope we propose in this thesis the "superblock" scope.
Additionally, the definition given above —in addition to illustrations from that
paper— also provide us with guidelines for how superblocks should be formed. For
instance, if at a certain point of the superblock’s execution it loops back to itself
(i.e., a basic block in that superblock), then that superblock should ideally be di-
vided into two superblocks that are split at the basic block which was branched
into. However, this suggestion might not be very fruitful in our specific application
(malware analysis), and we might choose to relax this assumption a bit).
Additionally, the superblock formation strategies illustrated by diagrams in the
aforementioned paper also suggest that when a basic block branches into two other
basic blocks that are not part of the current superblock, then if we wish to extend
the current superblock to include those two newly-encountered basic blocks, then
we must split the current superblock into two superblocks, one for each branch.
4 Target System: Capa
The contribution we make in this thesis to capa spans a large set of the tool’s
components, such as its rules’ parsing and matching logic, its static binary analysis
process, and the organization and display of discovered malware capabilities. As a
result, it is vital that we provide an adequate description of the components and
functioning of the target system in order to better describe the contribution we make
to the tool.
In this chapter, we will describe the general architecture of capa and its opera-
tional model. Due to the large size of the tool, We will not give extensive details
about elements of the tool that are not relevant to the implementation of our con-
tribution in this thesis.
4.1 Overview
Capa takes in as input analyst-written rules and attempts to match them across
the sample provided to it by the user. In order to do this, it relies on three ma-
jor components: a rules parser, a binary analyzer (named internally as a Feature
Extractor), and a logic matching engine that tries to see which rules apply on the
output provided by the binary analyzer.
Additionally, normal execution of capa (as opposed to it being invoked as a
Python module) utilizes two additional modules: a "capabilities" module that is
responsible for orchestration between the aforementioned components, as well as a
4.2 FEATURE EXTRACTORS 25
results rendering component for displaying neat output to the user (including the
matched rules and where exactly in the binary did they match).
In the following subsections, we will provide the necessary details and structure
behind each of these components.
4.2 Feature Extractors
Capa utilizes several sample analysis tools as backends for extracting features and
artifacts from the provided samples. Each of the backends that capa supports are
wrapped internally by classes named "Feature Extractors", which abstract away the
usage details of the specific backend at hand, and provide a common programming
interface that makes it easier for capa developers to maintain the tool and add
support for other analysis backends as needed. Two types of feature extractors
exist, static, and dynamic.
Static feature extractors operate on the original provided binary sample, and
directly instrument the binary via static analysis in order to extract features and
artifacts from it such as functions, strings, and control flow information.
Dynamic feature extractors on the other hand operate on a sandbox analysis
report (in JSON or LOG format), and they extract artifacts and sample features
from the information contained in that sandbox analysis report.
Both static and dynamic feature extractors operate in terms of "scopes", which
can be defined as logical units or containers within which sample features are orga-
nized and grouped. These scopes are later used by the capa rule matching engine
to see which capabilities exist at each of these logical grouping units.
The static scopes are:
• instruction: this logical unit contains the features extracted at each instruc-
tion of the program, such as the instruction’s mnemonic and operands.
4.2 FEATURE EXTRACTORS 26
• basic block: this scope entails features identified across a group of successive
instructions with no branch-related instructions (such as jmp and ret) in it,
other than the final instruction of this block. Features contained in this scope
include all of the features extracted at the constituent instructions, in addi-
tion to basic block-exclusive features such as the detection of stack strings for
instance.
• function: this scope provides logical grouping for the sample’s methods. Capa
rules that have the function scope assigned to them are evaluated on the set of
features extracted from each function in the sample. This logical unit contains
all of the features identified at the constituent basic blocks (and therefore
instructions), as well as function-related features such as identified loops and
recursive calls.
• file: this scope contains all of the features identified across all of the func-
tions contained in the sample, in addition to information about the sample’s
sections, read-only data, and imports for instance.
• global: this logical grouping contains features that are not extracted from an
instruction of a basic block or a function per se, but are still made available and
shared across all of the static scopes. Examples of this are the "architecture"
and "os" features which are often extracted from the sample’s binary metadata,
and then made available at each scope of the sample to aide to with writing
rules’ logic.
Dynamic scopes on the other hand are:
• calls: this logical unit groups features identified at a single API call that was
logged by the sandbox. Examples of such features include function names and
the provided arguments.
4.2 FEATURE EXTRACTORS 27
• span-of-calls: made up of a constant number of calls that are in succession.
Evaluation at this scope is performed in a sliding-window fashion, and this
scope helps prevent false positives in cases where a thread might contain a
large volume or calls.
• thread: a logical unit for grouping features on a per-thread basis.
• process: a logical unit for grouping features on a per-process basis.
• file: similar to the static "file" scope.
• global: similar to the static "global" scope.
One downside for the usage of static backends as opposed to dynamic ones is
that they do not work well on packed or obfuscated samples, because they have no
meaningful way of obtaining the hidden contents of the binary at hand. However,
they are often preferred over dynamic extractors whenever the sample is unpacked,
because they give analysts the ability to navigate to the location of a capa rule match,
and then further analyze the function or basic block where the match occurred. In
this work, we will focus only on static feature extractors, with the dynamic ones
being out of the scope and relevance of this thesis’ contribution.
As previously mentioned, all static feature extractors implement a common in-
terface. The methods we are are most interested in for the scope of this thesis are
the following:
• get_functions(self): this object method retrieves all of non-library func-
tions identified in the sample.
• extract_function_features(self, fh): this method takes in as argument a
function handle fh, which contains an address (alongside possible metadata),
and returns all of the function-exclusive features identified in that function.
4.2 FEATURE EXTRACTORS 28
• get_basic_blocks(self, fh): this method takes in a function address (con-
tained in the variable fh of type "Function Handle"), and returns a list of all
the basic blocks extracted from that function.
• extract_basic_block_features(self, fh, bh): this method takes in a
function identifier, as well as a basic block identifier, and extracts all of the
basic block-exclusive features identified in that block.
• get_instructions(self, fh, bh): this method takes in a function address
(stored in the fh function handle) and a basic block address (stored in the bh
basic block handle), and returns all of the instructions that compose the basic
block identified by the fh and bh handles.
• extract_instruction_features(self, fh, bh, ih): this method takes in a
function handle, basic block handle, and an instruction handle, and returns a
list of all features identified in the instruction referenced by the input function,
basic block, and instruction handles.
These methods were listed because they are are the core of how capa extracts
features within their right scope from the provided binary sample. We will give
more details on these methods later in this chapter.
Regarding the static binary analysis frameworks that capa supports, we note the
following:
Vivisect: a free, open-source, and well-documented binary analysis framework
that supports multiple file formats such as Linux’s ELF, Windows’ PE, and MacOS’
Mach-O. The contribution made in this thesis will be based primarily on the interface
provided by this framework given its free-to-use nature.
dncil: this is an open-source tool for disassembling Microsoft .NET ’s Common
Intermediate Language (CIL) instructions. We will not be handling .NET binaries
4.3 CAPA’S RULES PARSER 29
in this thesis, since the basic block scope is not supported for them as of the writing
of this thesis.
Binary Ninja: an interactive disassembler, decompiler, and debugger developed
by Vector35. It has support for Python-based scripting as well as headless execution
(no GUI) which cape makes use of in order to extract features. However, the free
version of this software does not support come with access to the tool’s API, and
therefore no Python scripting ability.
4.3 Capa’s Rules Parser
After having discussed the portion of capa that handles instrumentation of the
malware sample, we now turn our attention to the component responsible for parsing
and processing the capa rules authored by security analysts.
Capa relies on the YAML human-readable data serialization language as a medium
for describing malware capabilities. This not only makes it easier for analysts who
are already familiar with YAML to write new rules, but also has the benefit of
making processing and storing rules during capa’s execution and easier task since
prebuilt libraries could be used in this matter.
Each capa rule is stored in a "Rule" object. These objects are initialized from
the parsed YAML code, and are checked for semantic and syntactic correctness upon
initialization. If a rule is determined to be invalid, capa raises an exception which
if left unhandled (in the case of command line execution for instance) then capa
simply displays the error and exists.
Internally, the logic of capa rules is represented using a tree data structure. The
leaf nodes of the tree are objects of the class "Feature", which hold the type of
the feature (API name, instruction mnemonic, extracted string, etc.) as well as
the feature’s value extracted from the binary; while the non-leaf nodes represent
the boolean logic operators (AND, OR, NOT, etc.) or subscopes (Basic Block,
4.3 CAPA’S RULES PARSER 30
instruction, etc.), and are stored in an object of the class "Statement" that contains
the operators name as well as a list of its children, be it other statements of features.
Additionally, each type of feature or statement implements an "evaluate()" method
that is used by capa’s rule matching engine to evaluate whether a node is true or
false in the context of the current scope unit (function, basic block, instruction,
etc.). We will cover this more in depth in the next subsection.
This rule’s logic tree, in addition to other details such as the rule’s metadata, is all
stored inside the "Rule" object which provides an interface for easy evaluation and
manipulation of capa rules by the developers. Some of the most relevant methods
that the Rule object provides are the following:
• from_dict(cls, d, definition): this method takes in a capa rule in the form
of a hashmap (Python dictionary), and uses the "build_statements()" helper
method to construct a logic statement tree while making sure the tree is both
syntactically and semantically correct. Then, the "from_dict()" method ini-
tializes a Rule object from that statement (in addition to the rule’s metadata)
and returns it to the caller.
• from_yaml_file(cls, path): this method reads a YAML file, parses it us-
ing a YAML parser, and passes the resulting dictionary to the previously-
mentioned "from_dict()" method and returns the resulting object.
• get_dependencies(self, namespaces): this method returns a list of names
of all of the rules that the current rule relies on for a successful match. All
that this method does is that it goes through the rule’s logic tree and searches
for features of type "match" and returns that rule’s name. This method will
be useful later on for ordering rules and rule evaluation in a way such that
dependency rules are always evaluated before the rules that depend on them.
• extract_subscope_rules(self): this method extracts subscope statements
4.4 MATCHING ENGINE 31
in a rule into independent rules with a unique name, that will be placed before
the current rule in the order of evaluation.
• evaluate(self, features): this method takes in a set of features and addresses
and recursively evaluates the boolean value of the current rule depending on
whether the input features satisfy the logic tree contained in the rule.
After all capa rules are converted from YAML format into a list of Rule objects,
that list is then used to construct another object of type "RuleSet" that provides an
interfaces of several methods that operate on the whole of the rules that it contains.
Examples of such methods include a "match()" method that takes in a set of features
and iterates through rules to see which ones match for that given set of features, as
well as getter methods for retrieving rules of a certain scope (i.e., only function or
basic block or instruction scope rules).
While the RuleSet object implements many methods vital to capa’s operation,
most of them are not relevant to the scope of this thesis (such as RuleSet optimization
methods), and will therefore not be covered in this work.
4.4 Matching Engine
Thus far, we have covered the capa component responsible for instrumenting the
input malware sample, as well as the component that responsible for handling the
capability-describing rules. Additionally, we have also noted that feature extractors
retrieve artifacts from the sample in the form of Feature objects of different types
(api name, instruction mnemonic, etc.), and that Rule objects hold representations
of the rule’s evaluation logic in the form of a binary tree whose leaf nodes are
Feature objects as well. Now, we will discuss the rule matching engine that sits
between these two components, and handles matching features retrieved from the
sample with features stored in a Rule object’s statement tree in order to generate a
4.4 MATCHING ENGINE 32
capa rule match.
Critical for the functioning of capa’s rule matching engine is two components, the
Statement class and its subclasses, as well as the match() method. The Statement
class has been briefly mentioned while describing the RuleSet component, and that
it is used for representing non-leaf nodes in a rule’s evaluation logic tree. More
specifically, each supported logic operator in capa is implemented in the form of a
subclass of the Statement class, with the evaluation logic (defined in the inherited
evaluate() method) being implemented differently depending on the operator being
represented.
For example, the operator "and" is represented by the class "And" which inherits
from the Statement class. When the evaluate() method is invoked on an "And"
object, all of the immediate children of the object in the statement tree are evaluated,
and the method returns True only if all of the child nodes evaluate to True. For the
"or" operator however (represented by the "Or" subclass of the parent Statement
class), only one child of the node needs to be true in order for the "or" statement
to evaluate to true.
As for subscope statements (nodes), then these cannot be evaluated directly, and
they must first be extracted into independent temporary rules and then linked to the
parent rule as a dependency (using a "match" feature with the temporarily-created
subscope rule’s name as value for the match feature).
The second component for the engine is the match() function, which takes in as
input a RuleSet, a set of features, and an address (this will be used to mark where
the matching is being tested, be it at the function, basic block, or instruction levels).
Upon invocation, the match() function goes through all rules in the input rule set,
and then attempts to evaluate it against the inputted set of features to see whether
a match exists or not. If a match is found, it is added to the set of discovered
matches, and also to the inputted set of features (in the form of a "match" feature)
4.5 RESULTS RENDERING 33
in order for it to be considered by later rule evaluations that might depend on this
recently matched rule.
While the match() function originally belongs to the capa engine, it should be
noted that a more performant version of it is implemented by the RuleSet object.
This more performant version relies on the fact that some features or rules are
more likely to occur or match than others, and therefore can be checked first in
order to short circuit the entire evaluation process. However, the more performant
RuleSet.match() method is outside of the scope of this thesis.
4.5 Results Rendering
Once the features have been extracted, rules have been parsed, and capabilities have
been matched using the capa matching engine, the results are then stored by capa
in a dataclass object of type "ResultDocument" implemented as a Pydantic model.
This object contains the processed sample’s metadata, as well as a list of all matched
rules and where they (and their constituent features) were found and matched. Then,
capa passes this result document object to one of four result rendering components
in order to display the matches to the user.
4.5.1 Display Modes
Capa supports the following display modes: default (show only metadata and rules),
JSON, verbose, and vverbose (very verbose). The specific display mode that will be
used is determine based on the arguments that capa was invoked with.
Default
The default (simple) capa display mode shows metadata, extracted ATT&CK TTPs,
Malware Behavior Catalog (MBC) objectives [20], as well as the rules names. An
4.5 RESULTS RENDERING 34
example can be seen in Figure 4.1.
Figure 4.1: default display mode for capa
Verbose
As for the verbose mode of display, then it does not display the ATT&CK and MBC
details, but instead shows where each capability was extracted. An example can be
seen in Figure 4.2.
4.5 RESULTS RENDERING 35
Figure 4.2: verbose display mode for capa
Very Verbose
As for the very verbose mode, it provides the utmost level of details by showing
the ATT&CK and MBC related information, as well as the rules authors and where
each feature in the rule was matched in the input sample. An example can be seen
in Figure 4.3.
4.6 ORCHESTRATION 36
Figure 4.3: vverbose display mode for capa
4.6 Orchestration
The last capa component we need to cover is the capabilities module, which is
responsible for orchestrating all of the modules we have covered thus far.
While the "main" and "loader" modules which are responsible to parsing com-
mand line arguments and instantiating different components are crucial to orches-
tration, we will not be focusing on them much in this section. Instead, we will
be focusing more on the "capabilities" module which is responsible for extracting
features and matching capa rules against these features, and then returning these
matches to the main module in order for them to be stored in a ResultDocument
object and later displayed by the main module.
4.6 ORCHESTRATION 37
The reason we give more attention to the capabilities module is that a large
portion of our contribution in this thesis will be made in this module, since it plays
a large part in implementing the scoping mechanism that capa has.
The capabilities module has two main components, a static and a dynamic ca-
pability extractor. The static component will be the focus of this thesis, while the
dynamic one is out of scope.
The static capability extractor is composed of four different functions defined
in a similar fashion. Each function corresponds to a scope, and performs feature
extraction on that scope as well as iterate through subscopes and match rules on
them. For example, the function corresponding to the basic block scope takes in as
argument a basic block, and then iterates through all instructions in it, passes each
one to the instruction scope-related function, and then extracts features from the
current basic block and attempts to match capabilities at that scope. The full list
of function is as follows:
• find_static_capabilities(): this function corresponds to the file scope, and it
iterates through all functions in the sample and passes them to the function
scope-related method for capability extraction from it. Then, capabilities (i.e.,
rule matches) extracted at lower subscopes are merged with file features, and
then this function attempts to match file-scope rules using this combined fea-
ture set.
• find_code_capabilities(): this method relates to the function scope, and it
work similar to the previously described function. The difference is that it
takes as argument a malware sample function handle, and iterates through all
basic blocks in it and passes each one to the basic block-related capability ex-
traction function. Then, after combining function-level features and extracted
basic block capabilities, rule matching is performed on the current function.
4.6 ORCHESTRATION 38
• find_basic_block_capabilities(): this method relates to the basic block scope.
It functions similar to the function-related method, except that it takes as ar-
gument a basic block and iterates through its constituent instructions, passing
each one of them to the instruction scope-related function, and then match-
ing basic block capabilities at the current block after all features and lower
instruction-scope matches have been merged.
• find_instruction_capabilities(): this function relates to the instruction scope.
This method differs in that it is the lowest possible scope and therefore does
not iterate through lower components. Instead, it extracts the features at
the provided instruction, matches rules based on those extracted features, and
returns that as output.
When the static capabilities module is invoked, the find_static_capabilities()
method is called first, and then the remaining functions get called automatically by
it. Output of the capabilities module is a set of extracted rule matches.
5 Design and Specifications
In this chapter, we will detail the objectives that our work aims to achieve, as well
as the proposed design and our suggestion for incorporating it into capa.
We will begin by defining the objectives of our contribution in the first section.
Then, we will detail the methodology we propose for forming superblocks in the
context of static binary analysis. Finally, we will describe how our method could be
incorporated into the target system.
5.1 Design Requirements
The primary objective behind our contribution is to provide capa rule authors with
a way that they could describe logic that must exist on a set of basic blocks in
sequence with no interruptions. Such addition would give capa rule authors more
expressive powers and would help reduce the rate of false positives stemming for
spurious correlations in the case of function scope rules, as well as false negatives
stemming form overconstrained basic block scope rules.
In order to achieve our objective, we will first detail the general requirements
and design principles behind our contribution in a concrete manner. This will help
guide our design process and make testing our solution clear and straightforward.
In the following subsections, we will detail the main tenets behind our approach,
and the motivation behind each one. Then, we will proceed to describe our proposed
design in the following section.
5.1 DESIGN REQUIREMENTS 40
5.1.1 Superblocks must not contain irrelevant basic blocks
The first requirement for our design is that the formed superblocks that capa rules
would match on must not contain any redundant basic blocks in them that do not
contribute to a superblock rule’s match. Put differently, if a capa rule matches on
a superblock, then all of the features that constitute that match must exist on a set
of basic blocks that are directly chained together with no gaps in between.
This requirement is important because it is one of the mechanisms that prevent
unrelated basic blocks from being incorrectly correlated by the capa engine, such as
the case with function scope rules that do not consider the relationships between
the constituent basic blocks.
It should also be noted that successful application of this design requirement
necessitates that it would be applied on a rule basis. This means that superblock
formation would depend on the capa rule being evaluated, and that a set of basic
blocks in succession might form a superblock under one capa rule, but not in another.
5.1.2 Superblocks must contain only one execution path
The second design principle is that superblocks must not contain basic blocks from
divergent execution paths inside the same superblock.
Such scenario might occur if a basic block at the tail of a superblock contains a
branch to an earlier parallel execution path, which would result both the current ex-
ecution path and the parallel one being wrongfully included in the same superblock.
This condition is set in order to conform to the definition of a superblock as seen
in the reviewed literature, as well as to prevent features extracted from basic blocks
in different parallel execution paths from being falsely correlated, thereby resulting
in a false positive.
5.1 DESIGN REQUIREMENTS 41
5.1.3 Superblocks must not contain loops
The third design principle we propose is for superblocks to be linear with no loops
therein, with the only exception being basic blocks that loop over themselves (oth-
erwise known as "tight loops").
The reason behind the introduction of this principle is to make writing rule logic
easier and simpler for analysts, and to instead user superblocks as subscopes inside
function scope rules whenever a rule author wants to describe a capa rule with
complex logic involving loops.
5.1.4 Superblocks can be composed of just one basic block
The fourth design principle we propose is to consider basic blocks as superblocks of
length one. The reason we consider a basic block as a superblock is primarily due
to the convenience of not having to duplicate similar rules twice, since sometimes
malware authors might skip error checking in their code for instance, which would
result in what would have been a large superblock instead being collected into a
singular large basic block, thereby necessitating a separate and similar basic block
rule in order to detect such cases. Instead, we opt for recognizing basic blocks
a superblocks of length one so that rule authors can avoid such redundancy and
duplication.
5.1.5 Superblock matches cannot be nested
The fifth and final design principle we propose is for superblock rule matches to not
be nestable. This means that a superblock rule cannot have as part of its required
rule matching logic another superblock rule as a dependency, thereby reducing rule
logic complexity.
Instead, if rule authors need to describe logic in terms of predefined superblock
rules, they would instead need to set the current rule’s scope to the function scope,
5.2 DESIGN DESCRIPTION 42
and then define superblock subscopes as needed and manage their dependencies as
part of the parent function-scope rule’s logic.
5.2 Design Description
After having listed the main conditions and requirements that our design must fulfill,
we now proceed to describe our proposed design in this section as well as highlight
how it satisfies the given requirements.
We will begin by providing an overview of our proposed design and the different
stages of matching capa rules on superblocks. Then, we will proceed to detail each
stage in its own subsection, while motivating the reasoning behind our approach
and how it helps achieve our design goals.
The details of integrating the proposed solution into the target system will be
discussed in the final section of this chapter.
5.2.1 General approach
Our proposed approach for matching malware capabilities on superblocks is com-
posed of four main steps. First, a flow graph is constructed for each of the malware’s
functions. Second, all of the features contained in the function are grouped together
as if they were part of the same basic block, and are then searched for all possible
matches for superblock-scoped capa rules. Third, each resulting capability match
is parsed, and the addresses at which its constituent features were found are ex-
tracted and their corresponding basic blocks are determined. Fourth and finally, we
go through each of the discovered matches, and determine whether its constituent
basic blocks (extracted from step three) form a non-cyclical path on the flow graph
generated in step one; If the basic blocks form a non-cyclical path, the superblock
match is determine to be valid and is therefore retained, otherwise it is discarded.
5.2 DESIGN DESCRIPTION 43
These steps will be executed in succession for each of the malware sample’s
functions in order to extract the superblock malware capabilities in that function.
These steps will next be explained in more detail in the following subsections.
5.2.2 Generate the current function’s flow graph
In this step, we will construct four data structures that will be required and used by
the remaining steps. These data structures are constructed on a per-function basis,
and they are as follows: a directed flow graph, a dictionary mapping basic block
addresses to nodes of the flow graph, an interval tree that maps any instruction’s
address to the address of the basic block that houses it, and a global function-wide
set of features.
These structures will be constructed in a gradual manner with each encountered
basic block in the function being appended into these structures until they are com-
plete. Once all of the basic blocks have been visited and the structures’ construction
has concluded, we then pass to the following steps.
Flow graph
The first structure of interest is a graph that models the function’s control flow in
terms of basic blocks. The graph will be a directed rooted graph composed of the
function’s constituent basic blocks as nodes, with branches between basic blocks
representing the edges between these said nodes.
Each node of the graph will contain the address of its associated basic block, as
well as up to two pointers to other nodes in the graph. The number of outgoing
edges from a given node depends on the type of branching that the associated basic
block performs, with possible cases being as follows:
• 0 outgoing edges: If a given node contains no outgoing edges (i.e., leaf node)
then its associated basic block ends with a return instruction, and is therefore
5.2 DESIGN DESCRIPTION 44
an exit point for the function.
• 1 outgoing edges: If a given node has exactly one outgoing edge, then its
associated basic block ends with an unconditional branch, such as the case
with "jmp" (jump) assembly instructions for instance.
• 2 outgoing edges: If a given node has exactly two outgoing edges, then its
associated basic block ends with a conditional branch, such as the case with
the "jne" (jump-if-not-equal) assembly instruction for instance.
In order to initiate a graph node object from a given basic block and then include
it in the graph, we must first retrieve the addresses that it branches to. For this, it
is sufficient to look at the final instruction of a given basic block, then retrieve the
offset by which the possible jump is performed, then add it to the final instruction’s
address, thereby giving us address that the basic block branches to. However, the
static binary analysis frameworks that capa works on already implement an inter-
face method that allows us to retrieve the addresses that a conditional instruction
branches to, and we will be using that instead for the sake of simplicity.
One thing that remains is how these gradually constructed graph nodes objects
will be constructed together, since we said that the graph will be formed one basic
block at a time, and for this, we will rely on the second data structure in this initial
step.
Hash set of graph nodes
The second data structure that will be constructed in this step will be a dictionary
(hash set) that maps basic block starting addresses to their corresponding flow graph
node objects, thereby allowing us to find specific nodes in the flow graph in constant
time without needing to search the graph each time.
One of the main benefits behind this structure is that it makes construction of
5.2 DESIGN DESCRIPTION 45
the function’s flow graph more straight forward. This is because when initiating
the node object for an encountered basic block, the branches of that node could be
set to the target addresses for that branch, which can be immediately computed at
any given basic block. Then when going through the flow graph, the code would
rely on the address-to-node dictionary to translate the branches’ addresses into their
associated nodes. This approach is more straight forward than having to directly
link node objects together by reference.
Another reason for this dictionary structure is that it will later make it easier
for us to filter discovered capa matches in the final step of our approach, since given
a set of basic blocks that house the features necessary for a match, we can index
into their corresponding graph nodes in constant time, and then check whether they
exist in sequence or not.
Interval tree of in-function addresses
The third data structure we will be constructing in this step will be an interval tree
that maps the address range of each basic block to the starting address of that basic
block.
This structure will later be used in the penultimate stage of our approach to
retrieve the constituent basic blocks for a capa rule match. Addresses of the features
that caused the match will be extracted from that match and looked up in this
interval tree, and the corresponding basic blocks will be determined and passed
from there to the final filtering step.
Global set of features
This is the final and simplest data structure that will be constructed in our approach.
It is a set of all unique features that are contained in the function. It is gradually
constructed by adding each of the features extracted from a basic block to it at a
5.2 DESIGN DESCRIPTION 46
time, until all of the function’s blocks had been parsed. This structure will be useful
in the second and following step of our approach.
5.2.3 Blind matching on the entire set of rules
In the second step of our approach, and once we have collected a list of all features
present in a function across all of its basic blocks, we proceed to match superblock-
scope rules on this large features set without taking into consideration that these
features might exist on distinct (and possibly distant) basic blocks.
As a result of this step, for any combination of features from the previously
constructed features set we will obtain a rule match if that said combination does
indeed match a described malware capability. This matching action is performed
using capa’s regular rule matching engine, which already extracts matches on all
possible combinations of features.
Once this set of superblock matches performed on an ambiguous set of features is
obtained, our next step is to go through them all and filter them to determine which
ones exist on a an actual sequence of uninterrupted basic blocks (i.e., superblocks).
But before that, we must first convert the addresses associated with each feature
present in a match, and convert it to the starting address of the basic block. This
address can then be used with the previously-constructed flow graph to determine
whether the match is a valid superblock match or not.
5.2.4 Retrieve the basic blocks used for each match
Given a certain match, and tasked with the goal of finding the set of basic blocks that
the match spans across, we must first iterate through the given match to extract all of
its constituent features. After that, we can simple utilize the previously-constructed
interval tree that maps basic block address ranges to the starting address of that
basic block in order to determine what basic block that specific feature at hand was
5.2 DESIGN DESCRIPTION 47
found in.
In order to list all of the given capability’s constituent features, we must traverse
the logic tree for the statement contained in the given result’s "match" object, and
extract leaf tree nodes of type "feature" while continuing to traverse intermediate
statement nodes (such as AND, OR, etc.).
Once a list of "feature" objects is obtained, we can iterate through it and search
the previously-constructed address interval tree with the current feature object’s
address value in order to determine the basic block that it is part of. After doing
this for all of the features in the extracted set, we obtain the set of all basic blocks
that were relevant to the match, which can then input into the following and final
step.
5.2.5 Filter the results
After we have obtained a list of all possible matches identified on all possible com-
binations of features in a function, as well as the list of basic block addresses that
pertain to each match, we now proceed to filtering these matches and only retaining
the ones that exist on superblocks only.
To begin, we first go through all of the basic blocks pertaining to a given match,
and if we find that one of them branches to two other basic blocks relevant to that
match at the same time, then we exclude that match since it was discovered on
divergent superblocks (i.e, features from two parallel execution paths might have
been wrongfully correlated together). We exclude however the edge case where a
basic block would branch to itself (i.e., tight loop) and to another basic block relevant
for that match at the same time.
Once we have filtered out matches whose constituent basic blocks resided on
divergent superblocks, we now proceed to filtering the rest of the remaining matches.
To do so, we use the following algorithm:
5.2 DESIGN DESCRIPTION 48
1. To start off, we pick a random basic block from the list of basic blocks related
to a match, then fetch the graph node associated with it using our previously-
constructed address-to-node dictionary (from step one of the approach). We
consider this node as the head of a basic block cycle of unknown length (at
the time of picking this node), and we add it to a set of encountered cycles.
2. Once an initial graph node has been selected, we traverse the graph starting
from that node in order to find the longest path composed of only unvisited
basic blocks that are relevant for our match. We do so by keeping a pointer
to the current node, which is initialized to the node picked in step 1, and then
moving that pointer along the branch whose related basic block is relevant for
the current match. We do this until we reach a graph node that is not in the
list of previously unseen match-related basic blocks, at which point we move
to the next step of the algorithm.
3. Once we have come to the node at the end of the current cycle, we check its
nature. If that node happens to be the start of another cycle (i.e., is contained
in the set of encountered cycles), then we merge the two cycles together and
keep only the current more-precedent cycle head in the maintained set of
encountered cycles. Otherwise, we deduce that we are either at the end of the
function (i.e., exit node), or that the basic block is irrelevant to the match,
and we therefore determine that the current cycle is finished, and we move
back to step 1 of this algorithm and pick another never-before-seen node to
repeat the process with until all nodes has been visited.
4. Once all of the basic blocks relevant to the match have been covered, and all
possible disjoint cycles related to them have been constructed and added to
the set of encountered cycles, we then proceed to check the length of that
set. If the set has only one cycle remaining, that means that the match’s
5.3 INTEGRATING WITH CAPA 49
given basic blocks exist in sequence and are therefore a superblock, otherwise,
we determine that the capability was matched on disjoint basic blocks of the
function, and the match is therefore discarded.
After applying the algorithm listed above, we are left with only the matches that
exist strictly on superblocks as previously defined in the next chapter, and these
matches are then what will be returned to the user.
5.3 Integrating With Capa
Now that we have outlined our proposed approach for extracting superblock malware
capabilities from binary samples, we proceed into discussing how our contribution
will be integrated into capa in this chapter. More specifically, we outline the capa
components that should be modified for our contribution to work, and how those
components should be modified.
5.3.1 Rules Parser
The first component that should be modified is the capa rules parser, which needs
to be modified to accept the keyword "superblock" as a valid capa rules scope
and subscope, as well as define the scope internally in the module’s static scopes
enumeration class which will be used by other rule constructing and matching logic.
Additionally, the supported features for the superblock scope need to be correctly
assigned to it. In this case, these would be all of the basic block features from
the lower scope, with the possibility of adding superblock-exclusive features in the
future.
Furthermore, the rules parser module needs to be updated with the new correct
static scopes hierarchy, which puts the newly created superblock scope in between
the function and basic block scopes.
5.3 INTEGRATING WITH CAPA 50
5.3.2 Feature Extractor
The next component that requires modification is the features extractors module.
The "StaticFeatureExtractor" interface needs to have two methods added to it,
get_next_basic_blocks() which will be used to retrieve the branches of a given basic
block, and the get_basic_block_size() which will return the size of a given basic
block in bytes, which in turn will be used when constructed the previously-mentioned
inner-function addresses interval tree.
Additionally, the feature extractors for static analysis backends that we wish to
support (in our case Vivisect since it is free and open-source) need to be updated to
support the newly introduced methods, with other feature extractors (such as the
.NET extractor or Hex-Rays’ IDA) being adjusted to provide a blank implementation
of the newly introduced abstract methods until support is added.
5.3.3 Capabilities Module
This component will house the majority of our contribution including flow graph
construction, capability extraction, and match filtering. The code for these tasks
and data structures will all be housed in a class named SuperblockMatcher, which
would internally house the data structures described in section 5.2, in addition to
the necessary methods required to operate on these structures and carry out all of
the tasks previously described.
A new SuperblockMatcher object will be instantiated each time a function is
encountered, with basic blocks being gradually added to it using a method it exposes
named add_basic_block(). This method will be responsible for generating a graph
node for that given basic block as well as appending it to the flow graph and its
address range to the addresses interval tree.
Once all basic blocks have been added to the matcher object, the match() method
that the SuperblockMatcher object exposes will be invoked, which will in turn search
5.3 INTEGRATING WITH CAPA 51
for capabilities on all of the detected function features, and then proceed to filtering
the found matches using a private _prune() method. Correct matches are then
added to the list of extracted capabilities, and are then returned for them to be
displayed to the user.
To make matters easier for storing matches and distinguishing superblocks apart,
we store the list of all constituent basic blocks for a superblock alongside the as-
sociated match, which will later simply the process of displaying where exactly a
superblock capability was extracted.
5.3.4 Results Rendering
The final component we need to adjust is the results rendering component, which
needs to be formatted and refactored to support displaying rule matches extracted
at the new superblock scope.
Most of the capa code revolving around results rendering is dynamic and ex-
tensible, meaning that minimal refactoring is needed in this component in order to
display simple superblock matches. All necessary information will be automatically
extracted by the relevant display component.
One exception however is that verbose and very verbose display modes need to be
slightly adjusted to display the addresses at which superblocks were matched. This
is because these components were originally designed to display single addresses
pertaining to an instruction, basic block, or function, whereas superblock matches
can span a number of basic blocks in sequence. To display this accurately, we rely
on the list of addresses that is stored alongside each superblock match (as explained
towards the end of the last subsection), and we simple format this list of addresses
into a presentable string for it to be displayed to the user.
6 Implementation
After having outlined the proposed design for our contribution, as well as how we
plan to integrate it into the target system (capa), we now proceed into implementing
our solution and testing its functionality, viability, and efficacy with regards to
addressing the problems outlined at the start of the thesis.
First, we will briefly go over general implementation details such as external
libraries used, as well as interfacing with the API’s of the static analysis backends.
Then, we will verify that our solution works in the following section, and showcase
the output across the different display modes. Finally, we will provide a real life
use case for our proposed solution, and motivate through it how our contribution
improves the target system.
6.1 Implementation
The written code concerning our contribution has been submitted in a Pull Request
to the capa GitHub repository. As of the writing of the thesis, it is awaiting approval
from the maintainers of the target system, followed by addition into the upcoming
version of the tool. The entirety of the code can be viewed using the following link:
https://github.com/mandiant/capa/pull/2665/
6.2 VERIFICATION 53
6.2 Verification
In order to verify that our code can indeed extract malware capabilities that exist
across superblocks, and to also showcase how it displays such matched superblock
rules, we will be doing a test run of the the updated target system with a custom
sample and rule to verify that our code functions.
6.2.1 Verification sample
The sample that we will be utilizing for our verification is taken from the follow-
ing GitHub repository belonging to the thesis’s authors: https://github.com/
yelhamer/light-C-server/.
The test sample is a lightweight file-sharing web server written entirely in C,
whose function is to listen for incoming connections, accept them when they arrive,
get the filename that the remote user wants to download, and send the contents of
that file back.
The superblock capability we will be aiming to match in our verification process
will be the setting up of a connections listener. This superblock can be found in
the init_server() method, and its disassembly using Hex-Rays’ IDA can be seen in
Figure 6.1.
6.2 VERIFICATION 54
Figure 6.1: Disassembly View of the Target Match Superblock
As we can see in the figure above, the superblock at hand spans the basic blocks
with addresses ending in: 0x1EF5, 0x1F46, and 0x1F98.
6.2 VERIFICATION 55
6.2.2 Test rule
In order to match the capability listed in the previous subsection, we use the fol-
lowing simple capa rule:
rule:
meta:
name: Initialize File Sharing Server (superblock test rule)
namespace: communication
authors:
- "Yacine Elhamer (@yelhamer)"
scopes:
static: superblock
dynamic: unsupported
features:
- and:
- basic block:
- or:
- api: getaddrinfo
- api: _getaddrinfo
- basic block:
- or:
- api: socket
- api: _socket
- basic block:
- or:
- api: bind
- api: _bind
The rule searches for at most 3 basic blocks in succession that contain the API
calls getaddrinfo(), socket(), and bind().
6.2.3 Results
Now with the rule and verification sample ready, we move to execution. We will be
running capa with the previously described sample and the previously-listed rule,
and will be showing the results in simple, verbose, and very verbose output modes.
6.2 VERIFICATION 56
Simple
As we can see in Figure 6.2, the superblock capability was indeed detected and
correctly outputted.
Figure 6.2: Capa Execution Result (Simple Output Mode)
Verbose
Moving on to the verbose display mode, we can see in Figure 6.3 more information,
and in particular, that the basic blocks that constitute the superblock are indeed
the ones we previously highlighted when selecting the sample.
Figure 6.3: Capa Execution Result (Verbose Output Mode)
6.3 TESTING 57
Very verbose
Finally, we execute capa for the last time using the very verbose (-vv) flag, and we
can see in Figure 4.3 that capa correctly outputs the addresses corresponding to
each basic block.
Figure 6.4: Capa Execution Result (Very Verbose Output Mode)
6.3 Testing
Now that we have confirmed that our contribution does indeed function and that
it is capable of extracting superblock matches, we now try to adjust one of capa’s
existing rules to make it more accurate using the new superblock scope.
6.3.1 Test case
We have identified a false positive for the "RC4 PRGA" malware capability present
in a native Windows binary named AgentService.exe. The match was identified in
6.3 TESTING 58
the function with the address 0x1400056E0, however, after navigating to that func-
tion we can see that it is a statically linked wmemcmp function from the standard C
run-time library as shown in Figure 6.5. This method is responsible for comparing
characters in two buffers [21].
Figure 6.5: Disassembly of the Match Location in the Test Sample
Further examining of the match reveals that the constituent features were found
on disperse basic blocks, and then falsely correlated together under the assumption
that they were all in sequence. Therefore, we deduce that by changing the scope fo
the original rule to the newly-added superblock scope we would fix this false positive.
6.3.2 Updated rule
The original rule relies on counting the number of references to the current function,
as well as the number of basic blocks it contains, in order to try to minimize the rate
of false positives. However, this does not completely prevent them as previously seen
with the wmemcmp() standard C function in the default AgentService.exe binary.
The original rule is shown in the following snippet:
6.3 TESTING 59
features:
- and:
- count(characteristic(nzxor)): 1
- or:
- match: calculate modulo 256 via x86 assembly
- count(mnemonic(movzx)): 4 or more
- count(characteristic(calls from)): (0, 4)
- count(basic blocks): (4, 50)
- match: contain loop
- optional:
- or:
- number: 0xFF
- number: 0x100
After inspecting the original rule that caused the false positive, we suggest an
updated rule that does not rely on counting function cross references or the num-
ber of basic blocks in a function to eliminate false positives. Instead, it follows a
more natural approach by trying to find the necessary logic and artifacts being in
close sequence. More precisely, it tries to find a non-zeroing xor operation on the
same superblock with other basic blocks that contain a total of 4 or more movzx
instruction (commonly used in RC4 PRGA implementations) or a basic block that
computes the 256 module using x86 assembly. If such superblock match is found,
and a loop is found alongside it in the same containing function, then a discovered
RC4 PRGA capability match is returned. Full description of the the updated rule
can be seen in the following snippet:
6.3 TESTING 60
rule:
meta:
name: encrypt data using RC4 PRGA (Superblock)
namespace: data-manipulation/encryption/rc4
authors:
- Yacine Elhamer (@yelhamer)
scopes:
static: function
dynamic: unsupported
features:
- and:
- superblock:
- and:
- or:
- match: calculate modulo 256 via x86 assembly
- count(mnemonic(movzx)): 4 or more
- characteristic: nzxor
- match: contain loop
6.3.3 Results and Comparison
In order to verify whether the new superblock-based RC4 PRGA rule eliminates
the observed false positive or not, we will execute capa with the new rule on the
AgentService.exe benign executable that contained the false positive, as well as three
different malware samples that were confirmed to be using RC4 encryption. These
three samples were originally listed as example malware samples in the original RC4
PRGA capa rule. The list of used false positive (FP) and true positive (TP) samples
6.3 TESTING 61
as well as their associated MD5 hashes can be seen in Table 6.3.3.
Sample Name Contains RC4 PRGA? MD5 Hash
FP1 No 5F2DD9A74F198BA69135BCE9BC534560
TP1 Yes 34404A3FB9804977C6AB86CB991FB130
TP2 Yes 9324D1A8AE37A36AE560C37448C9705A
TP3 Yes 73CE04892E5F39EC82B00C02FC04C70F
Table 6.1: Samples Used for Testing The RC4 PRGA Superblock Rule
Next, we proceed to execute capa with both the old and new RC4 PRGA detec-
tion rule, and we observe which of those rules matches for each sample. First, we
begin by running capa on the TP samples, and then the FP samples. After that, we
compile the obtained results and compare the old capa rule and the new superblock
one. Capa analysis results for the sample TP1 can be seen in Figure 6.6, with the
match for the superblock-based RC4 PRGA rule being highlighted in white.
Figure 6.6: Capa Analysis Results for Sample TP1
6.3 TESTING 62
Capa analysis results for sample TP2 can be seen in Figure 6.7, with the match
for the superblock-based RC4 PRGA rule being highlighted in white.
Figure 6.7: Capa Analysis of Sample TP2
Capa analysis results for sample TP3 can be seen in Figure 6.8, with the match
for the superblock-based RC4 PRGA rule being highlighted in white.
Figure 6.8: Capa Analysis of Sample TP3
Proceeding to the false positive sample (FP1), the capa analysis results for it
can be seen in Figure 6.9, with no match for the superblock RC4 PRGA rule.
6.3 TESTING 63
Figure 6.9: Capa Analysis of Sample FP1
We summarize the results obtained in table 6.3.3, which shows that the updated
RC4 PRGA capa rule that utilized the superblock scope was matched successfully
on a set of three malware samples (TP1, TP2, and TP3) with slightly different
implementations of the RC4 PRGA algorithm, while at the same time not matching
on the wmemcmp() standard C method within the benign sample (FP1).
6.3 TESTING 64
Sample Name Has RC4? Old Rule Match? New Rule Match?
FP1 No Yes No
TP1 Yes Yes Yes
TP2 Yes Yes Yes
TP3 Yes Yes Yes
Table 6.2: Capa Analysis Results for The Chosen Samples
After close examining the false positive match on the sample FP1, we can see
that one crucial element to the legacy RC4 PRGA rule match was the presence of
4 or more movzx CPU instructions in the given function. This condition as well as
the addresses for the retrieved movzx instructions can be seen highlighted in white
within Figure 6.10.
Figure 6.10: Verbose Output for The Capa Analysis of Sample FP1
By opening the FP1 sample in IDA then navigating to the location of the match
(i.e., the statically-linked wmemcmp() method), and then rearranging all of the
movzx CPU instructions present in the function to be adjacent to each other in
the flow graph (as shown in Figure 6.11), we can see that these instructions are far
apart within the function and not closely related such as the case with RC4’s PRGA
usually. And as a consequence of the legacy rule having the function scope, these
6.3 TESTING 65
sparse movzx instructions are incorrectly assumed by capa to be in close relationship
to each other like the RC4 PRGA necessitates, which is not the case, and we got a
false positive as a result.
Figure 6.11: IDA Disassembly of The wmemcmp() Standard C Function (in FP1)
6.3.4 Conclusion
In conclusion, we have demonstrated that our newly added superblock scope gives
rule authors more expressibility with regards to describing malware capabilities,
6.3 TESTING 66
which in turns will help reduce false positives and false negatives when using the
tool.
False negatives are reduced by giving rule authors a new matching scope that
is not vulnerable to variation in the control flow of programs. For instance, if a
superblock rule requires two Windows API calls to be made in order for it to match,
and the malware author performs some error checking after the first Windows API
call that exits the program on a fail, but continues to the second Windows API
call on success, then that would not affect the superblock-scope rule since the basic
blocks hosting each of the Windows API calls would be tied together and existing
on the same superblock. This can be clearly seen in the verification example shown
in section 6.2.1.
As for reducing false positives, the new superblock scope gives rule authors the
ability to write capability-describing rule logic that necessitates for its constituent
features to be present within small proximity (i.e., same or in-direct-sequence basic
blocks) in a function. This gives rule authors the ability to write precise rules that
are immune to branch instructions inserted by the compiler for either performance
or error-checking reasons, without sacrificing on the rule’s matching potential.
7 Conclusion
In this thesis, we have identified an area of improvement in automated static malware
capability extraction in general, and in the capa tool in precise. We examined other
major malware capability extraction tools in use today, and determined that none of
them provide a solution to the identified issue in capa, which was the first research
question of our thesis. Therefore, we proceeded with addressing the discovered issue,
and providing a solution for it.
The identified area of deficiency had to do with reducing the rate of false negatives
and false positives in the target system at hand, and that is by providing rule
malware capability rule authors with a new scope with which they could organize
and group rule matching logic.
The identified false negatives were due to existing solutions not being able to
circumvent error-checking logic in input samples that might split one basic block
into multiple ones, therefore resulting in the match not being discovered. As for false
positives, they were primarily due to the fact that many rule authors attempted to
reduce false negatives by making the scope wider which resulted in many matches
being the result of false correlation.
As for the second objective of this work, we proposed, designed, implemented,
and tested a working solution for a matching scope that exists between the too-
narrow basic block scope, and too-wide function scope as illustrated in the previous
paragraph. This scope was named the Superblock scope, and it was defined roughly
7.1 LIMITATIONS 68
to be a succession of basic blocks with no interruptions or inter-basic block loops in
it.
Our newly added superblock scope gave more flexibility for authoring rules, as
seen in the example regarding the extraction of network-related capabilities, as well
as the RC4 example. This positively answers the second research question we intro-
duced at the start of this thesis, and that the newly-added superblock scope does
indeed reduce false positives and gives rule authors more expressibility.
As for answering the third research question of this thesis, then we have showed
that the existing capa rule for detecting the pseudo-random generation algorithm
(PRGA) for the RC4 cryptographic algorithm suffered from a false positive, and
that it was incorrectly matching at the wmemcmp() standard C runtime method re-
sponsible for matching whether two buffers are equal, and we have shown that using
the newly-introduced superblock for that rule does indeed reduce the observed false
positive by correlating features relevant for the logic together, while also retaining
the true positives.
7.1 Limitations
One limitation for this work is the lack of extensive and wide-scoped testing that
showcases how the superblock scope improved the accuracy of capa. Instead, and
due to the effort required to assemble an extensive set of false positive and true
positive rules that we could accurately test on, as well as the need to verify whether
the resulting capabilities are true or false positives, we refrained from carrying out
such extensive testing and settled on verifying our contribution on a small scale, and
demonstrated how the results could be extrapolated to a larger dataset of samples.
While robust and wide-range testing of our contribution would make it more
concrete, we note that the introduced scope could still be added into capa as is,
which would provide capa rule authors with greater rule expressibility. Then, the
7.2 FUTURE WORK 69
scope could then be tested more thoroughly as it matures and more capa false
positives are identified, and more superblock scope rules are written.
7.2 Future Work
Future work that could build on this thesis might revolve around updating the
existing set of capa rules to better utilize the superblock scope, and thereby minimize
false positives and false negatives as much as possible.
Another avenue might be to modify our contribution to allow for basic block gaps
larger than one, and to give the choice of that number to the rule author, which
might provide them with more expressibility.
Additionally, it might also be worthwhile to integrate simple binary emulation
into capa to help with superblock formation by predicting the likely path of execution
of a binary at a given basic block.
References
[1] Internet Crime Complaint Center (IC3), Internet crime report 2023, https:
//www.ic3.gov/annualreport/reports/2023_ic3report.pdf, [Accessed
13-03-2025], 2023.
[2] Lockheed Martin, Gaining the advantage: Applying cyber kill chain methodol-
ogy to network defense, https://www.lockheedmartin.com/content/dam/
lockheed-martin/rms/documents/cyber/Gaining_the_Advantage_Cyber_
Kill_Chain.pdf, [Accessed 13-03-2025], 2015.
[3] M. Sikorski and A. Honig, Practical Malware Analysis: The Hands-On Guide
to Dissecting Malicious Software, 1st. USA: No Starch Press, 2012, isbn:
1593272901.
[4] Mandiant, M-Trends 2024 report, https://cloud.google.com/security/
resources/m-trends, [Accessed 13-03-2025], 2024.
[5] Capa: Automatically Identify Malware Capabilities | Mandiant | Google Cloud
Blog — cloud.google.com, https : / / cloud . google . com / blog / topics /
threat-intelligence/capa-automatically-identify-malware-capabilities/,
[Accessed 12-06-2025].
[6] O. Alrawi, M. Ike, M. Pruett, et al., “Forecasting malware capabilities from
cyber attack memory images”, in 30th USENIX Security Symposium (USENIX
Security 21), USENIX Association, Aug. 2021, pp. 3523–3540, isbn: 978-1-
REFERENCES 71
939133-24-3. [Online]. Available: https://www.usenix.org/conference/
usenixsecurity21/presentation/alrawi-forecasting.
[7] VirusTotal, YARA - The pattern matching swiss knife for malware researchers,
https://virustotal.github.io/yara/, [Accessed 23-03-2025].
[8] VirusTotal, Welcome to YARA’s documentation!; yara 4.5.0 documentation,
https://yara.readthedocs.io/en/latest/, [Accessed 23-03-2025].
[9] Y. team, YaraRules Project — yara-rules.github.io, https://yara-rules.
github.io/blog/, [Accessed 26-03-2025], 2015.
[10] J. Daemen and V. Rijmen, “The block cipher rijndael”, in Smart Card Re-
search and Applications. Springer Berlin Heidelberg, 2000, pp. 277–284, isbn:
9783540445340. doi: 10.1007/10721064_26.
[11] R. Rivest, The MD5 Message-Digest Algorithm. Apr. 1992. doi: 10.17487/
rfc1321. [Online]. Available: http://dx.doi.org/10.17487/RFC1321.
[12] B. Schneier, “Description of a new variable-length key, 64-bit block cipher
(blowfish)”, in International workshop on fast software encryption, Springer,
1993, pp. 191–204.
[13] Intel471, Hunting for Persistence: Registry Run Keys / Startup Folder —
intel471.com, https://intel471.com/blog/hunting-for-persistence-
registry-run-keys-startup-folder, [Accessed 27-03-2025], 2021.
[14] S.-J. Hwang, A. Utaliyeva, J.-S. Kim, and Y.-H. Choi, “Bypassing heaven’s
gate technique using black-box testing”, Sensors, vol. 23, no. 23, p. 9417, Nov.
2023, issn: 1424-8220. doi: 10.3390/s23239417.
[15] Mandiant,GitHub - mandiant/capa-rules: Standard collection of rules for capa:
The tool for enumerating the capabilities of programs — github.com, https:
//github.com/mandiant/capa-rules/, [Accessed 28-03-2025].
REFERENCES 72
[16] The Official YAML Web Site — yaml.org, https://yaml.org/, [Accessed
28-03-2025].
[17] Mandiant, GitHub - mandiant/capa-testfiles: Data to test capa’s code and
rules. — github.com, https://github.com/mandiant/capa- testfiles,
[Accessed 28-03-2025].
[18] J. Vrancken, Detecting capabilities in malware binaries by searching for func-
tion calls, M.Sc. thesis, Radboud University. 2022.
[19] R. E. Hank, S. A. Mahlke, R. A. Bringmann, J. C. Gyllenhaal, and W.-m. W.
Hwu, “Superblock formation using static program analysis”, in Proceedings
of the 26th Annual International Symposium on Microarchitecture, Austin,
Texas, USA, 1993, pp. 247–255. doi: 10.1109/MICRO.1993.282760.
[20] MITRE, Malware Behavior Catalog, version 3.0, 2024. [Online]. Available:
https://github.com/MBCProject/mbc-markdown.
[21] Microsoft,Memcmp, wmemcmp | microsoft learn, https://learn.microsoft.
com/en-us/cpp/c-runtime-library/reference/memcmp-wmemcmp, [Ac-
cessed 28-05-2025], 2022.