Introduction to binary exploitation#
In the next few lectures, we are going to be dealing with so-called binary exploitation, which is a field of security which deals with the security of programs on the instruction level and their memory management. First, we’re going to give you a quick refresher about how computers execute programs on the binary level, which is already covered by courses such as NSWI120, NSWI170 and others, so feel free to skip to Disassembling.
Instructions and registers#
The CPUs in today’s computers do not execute programs written in programming languages directly. The only language the CPU understands is the machine code, which is a sequence of simple instructions, encoded using some predetermined binary format.
Instead of variables, the CPU only uses registers, which are similar to variables in many ways, but they have a fixed size and each CPU has only a fixed and very limited amount of them — on today’s most common architecture, x86_64 (more on that later), there are only 16 general purpose registers.
So the instructions we have available might look something like: “take the values from registers A and B, add them, and save the result into register C”. The used registers are baked into the machine code, so the same snippet will always access the same three registers every time.
Memory access#
Since the number and usage of registers is quite limited, we can also save data into the system’s memory. For now, we can imagine the memory as a long sequence of bytes, each one of which can be directly accessed using its address, which is simply the order of the byte in the list. The only thing the memory can do, is read from an address or write a value to an address.
To do this, there are dedicated instructions, which move data from a register to a specific place in memory and vice versa. To specify the address we want to access, we can provide an immediate value — the address is directly encoded into the instruction and is part of the machine code, or we can read the address from a register.
When storing more complex objects into memory, such as arrays or structures, we have to somehow store them as bytes in a contiguous block of memory. To access a field in a structure, we can store the address at which the whole structure is stored in a register, but the offset of the field from the start of the structure remains the same, so that can be provided as a immediate value. When accessing an element of an array, we might store the start address of the array in one register and the index we’re interested in another one. In both cases, we can just add these two number together and get the actual address we want to access. Since this is quite a common and simple operation — “take a base address and add an index multiplied by the size of each element”, there are many dedicated instructions to do so.
Memory layout#
To aid with reserving memory for different objects, the memory is split into various sections, each of which may be managed differently. Firstly, there is the static memory. Static memory is reserved when the program is written and is never used for anything else. Objects in static memory are always available at the same address. This includes global variables, but also the machine code of the program itself and possibly some other data. Keep in mind that since the layout of static memory must be known when the program is created, the size of objects in the static memory must be known in advance as well — no inflatable arrays for instance.
There is also the stack and the heap (not to be confused with the data structures of the same name), which can be reserved at runtime.
The call stack#
Before we explain how the stack works, we first need to understand how functions get called on the instruction level. When the CPU runs a program, it just reads one instruction after another from the memory and executes them. To start executing code from some other part of memory, a jump instruction can be used. This jump can also be conditional and only happen when some condition holds, otherwise the execution continues as if the jump wasn’t there.
When we want to execute a function, we can just jump to the start of the function and jump back to where we were before invoking the function. There’s only one issue, a function can be called from multiple places in the code, so the returning jump cannot be constant. What about saving the current position to a register right before calling the function and at the end jumping back to this position? This could work, but what if we want to call a function from within a function? Or a function within a function within a function? Or a function within a… We need to somehow store a list of all the places from which a function was called and then use this list when trying to return from a function.
This is exactly what the call stack is for. It has its own dedicated section in
memory and a special register, which holds the address of the end of this list.
Each time we call a function, we save the current position into this list and
increment the address. When we want to return, we just read the value from the
end and decrement the address. It should be clear why the call stack shares its
name with the stack data structure, since they behave exactly the same. On
x86_64, the register is called rsp and there are dedicated instructions
call and return which push the current location to the stack and jump to a
provided location, or pop an address from the stack and jump to it respectively.
Since functions usually have local variables, which need to be stored in
different places for each call, they are also stored in the call stack. Each
time a new variable is introduced, the rsp is incremented, and at the end of
the function, it has to be decremented back so that the top of the stack
contains the return address. Each local variable is then read as a given offset
from rsp. The portion of the stack used by each function call is called the
stack frame.
As this can be quite cumbersome and the offset of each local variable from
rsp changes as new variables are added, x86_64 also provides the rbp
register, which always points just above the return address of the current
function, at the beginning of the current stack frame. This means we can access
the local variables via offsets from rbp without needing to recalculate them
with each new variable and cleaning up the stack after each call can be done by
just copying the value from rbp to rsp. The only problem is that after a
return, the rbp no longer points to the start of the stack frame, but remains
equal to rsp. This means we also have to store the old value of rbp on the
stack when we enter a function and recover it from there when we leave. x86_64
provides the enter and leave instructions which perform just that.
This is what the stack will look like after entering function f2 called from
inside f1, called from inside main:
The heap#
Still, none of the aforementioned solutions can help when we don’t know the amount of memory we’ll need in advance. Well, we can keep one inflatable array on the stack, but trying to keep more than one gets ugly very quickly. That’s what the heap is for. It is a portion of memory controlled by an allocator, which we can request pieces of memory from and to which we can then return these pieces of memory when we no longer need them. This allocator then keeps some internal data structure to remember, which parts of the memory have been handed out and which remain free.
In the C programming language, such an allocator is part of the standard library
and can be invoked by calling the void *malloc(size_t size) function. It
returns a pointer (which is just an address) to a chunk of memory of size
size, which we can use as we please. To return a no longer used chunk of
memory, we pass the pointer to it to void free(void *ptr).
The stack grows “down”#
As you might notice, there are two structures in memory, which can both grow independently and infinitely. How do we decide where to put them in memory? We could just split the non-static memory into two halves and put the stack into one and heap in the other, but that would be very wasteful in cases when one uses a lot of memory and other just a little — the larger may run out of space even though half of the system memory is still free! To mitigate this, it was decided that the stack starts at the very end of the available memory and grows downwards (towards lower addresses). This means that the heap and the stack can grow as they please until they meet.
Memory virtualization#
Nowadays, it seldom happens that only a single program runs on a CPU. Usually an
operating system manages thousands of processes, each one of which needs to have
access to its own memory and hopefully not anyone else’s. This is what memory
virtualization is for. From the point of view of each program, it is alone in
the world — every memory address from 0b0 to 0b111…111 is available to it,
even far outside of what would fit into the physical memory of the computer. The
operating system then keeps a page table, which contains information on where
chunks (called pages) of virtual memory for each process are stored in
actual physical memory. Since the virtual address space of each process is most
likely larger than the amount of actual available physical memory, the OS
dynamically maps only pages which are accessed by the process. This mapping
has to be requested explicitly by the process. When the process starts, some of
its memory is automatically prepared and populated, like the stack or the static
memory, but the memory for dynamic allocation must be requested explicitly. This
is actually what malloc does in the background — it does keep its own data
structure for allocating chunks smaller than a page, but each time it runs out
of space to give out, it requests another page from the OS. When a program tries
to access parts of memory which haven’t been mapped to physical memory yet, a
segmentation fault occurs, which usually leads to the program crashing.
The practical stuff#
Warning for ARM CPU users: The challenges we will be assigning will be compiled for the x86_64 architecture. This means you will not be able to run them and debug them on your ARM CPU. It would be highly advisable to get access to an
x86_64machine for solving the next few challenges. There are Linux x86_64 machines available for MFF students inside the Rotunda computer lab. They are even remotely accessible via SSH -ssh <SIS username>@<computer name>.ms.mff.cuni.czand then enter your SIS password.
The C language#
During our lectures about binary exploitation, we will be using the C programming language to write the challenges. If you are familiar with C++, or to some extent JavaScript, you will find C familiar, as the syntax is quite similar.
This is, how Hello World looks like in C:
#include <stdio.h>
int main() {
printf("Hello World!");
return 0;
}First, we include the stdio standard library header file - similarly to import in Python, this will allow us to use functions from that part of the standard library. stdio specifically has functions for handling input and output of the program. Then, we define the main function. The main function is the entry point of the execution of our program - if we run this program, the main will be executed. In the main function, we execute printf with a string argument. As you might have guessed, printf is being included from the stdio and stands for print format, but since it gets only one argument, it will not do any actual formatting and just print the string argument. Information about functions from the standard library, like printf, can be found in the manual pages by running man printf, or online. Then we return 0 and the return value from the main function will be used as the return value of the program.
For learning more about C, we recommend checking out Medvědův průvodce po Céčku (Czech), or somewhere else online. It’s sufficient if you can read and understand C code, writing it won’t be necessary.
Compiling and running#
C is a compiled language. To compile it, we can run (assuming you are on Linux/Unix system and have gcc installed)):
gcc -o output_binary ./input_program.cIf compilation succeeds, a binary will be created, that we can then run with ./output_binary.
Disassembling#
To inspect the instructions of the compiled binary, we can use the objdump utility:
objdump -M intel -d ./output_binaryThe -M intel tells objdump to use Intel-style disassembly, which we recommend as it’s much more common.
In the output we can see the instructions of our main function (on x86_64 architecture):
0000000000401136 <main>:
401136: 55 push rbp
401137: 48 89 e5 mov rbp,rsp
40113a: bf 04 20 40 00 mov edi,0x402004
40113f: b8 00 00 00 00 mov eax,0x0
401144: e8 e7 fe ff ff call 401030 <printf@plt>
401149: b8 00 00 00 00 mov eax,0x0
40114e: 5d pop rbp
40114f: c3 retFrom this we see the main function lays at a virtual address 0x401136 (your address may differ), then we see the opcode bytes of the instructions (their hex encoded form), and human-readable mnemonic representation.
In the case of our main function, we see that program prepares the stack (push rbp - save the stack base pointer, mov rbp,rsp - use the current stack top, as the new stack base). Then it prepares the arguments according to the x86_64 calling convention (mov edi,0x402004 - address of the hello world string, mov eax,0x0 - number of floating-point arguments - zero). call 401030 <printf@plt> will jump with the execution to the printf function in the libc library.
Decompiling#
Even though lot of information about the program is usually lost during compilation, there are tools that help recover some of it and reconstruct more high-level view. These tools are called decompilers and are commonly combined with disassemblers into one tool. Examples include the open source NSA-developed Ghidra, IDA, or Binary Ninja.
Note, that no reversing engineering or even decompilation should be necessary for this course - all source code will be provided. If you are interested in this topic, please check out NMMB460 Cryptanalysis Upon the Level of Instructions.
Debugging#
When analyzing and exploiting binaries, being able to inspect the program’s state during execution is invaluable. This is where debuggers come in - they allow us to pause execution, examine memory and registers, and step through instructions one by one.
The most commonly used debugger on Linux systems is GDB (GNU Debugger). GDB allows you to set breakpoints at specific addresses or function names, run the program until it hits a breakpoint, and then examine the state of the program at that point. You can inspect register values, read memory contents, modify both registers and memory, and continue execution step by step.
To start debugging a binary with GDB, simply run (assuming you have GDB installed):
gdb ./output_binaryYou can also attach to already running process, if you have the permissions to do that, with gdb -p <PID> (<optional-binary-file>).
Once inside the GDB console, you can set a breakpoint at the main function with break main, start execution with run, and then use commands like info registers to see register values, x/10x $rsp to eXamine 10 words of memory at the stack pointer, nexti to execute a single instruction, or continue to continue execution until the next breakpoint.
While GDB by itself is powerful, its default interface can be quite sparse and unintuitive, especially for beginners. This is where pwndbg comes in (there are some other similar plugins, like gef, but we will use pwndbg here). pwndbg is a GDB plugin specifically designed for binary security analysis and exploit development. It enhances GDB with a much more user-friendly interface that, for example, automatically displays the current state of registers, or the stack. It also provides some useful commands we will use.
To install pwndbg, follow the instructions on its GitHub repository. Once installed, it will automatically activate when you start GDB.
This is how our binary looks like after running gdb ./output_binary, then in the console break main and run:
For instance, we can see it automatically dereferenced the address of “Hello World!” for us in the disassembly view. How nice!
If we have compiled the binary with the -g option, that will create a debug build, like gcc -g -o ./output_binary ./input_program.c, we would see even the binary source code and where we are in it.
Some of the commands that pwndbg provides and that will make analyzing binaries much easier are:
-
context- This command displays a comprehensive view of the current program state. This is what is pictured in the screenshot above. -
hexdump- Displays memory contents in the hexdump format at a given address. -
telescope- Recursively dereferences memory addresses and displays what they point to. For instance,telescope $rsp 20will show 20 words starting at the stack pointer, and if any of those values are valid pointers, it will show what they point to as well.
Please go, and try to for yourself with a binary you compiled.
For more tips, check out their cheat sheet.
Information for ARM CPU users: To install pwndbg without root on the Rotunda lab machines, you can do:
curl -qsL 'https://install.pwndbg.re' > ./install_pwndbg.sh. Then manually inspectinstall_pwndbg.shif it seems legit (you can compare it to the version on Github). And then runsh ./install_pwndbg.sh -u. This will install pwndbg in~/.local/bin/pwndbg. Or you can download the corresponding releases from Github and unpack them manually.
Interacting programmatically#
When exploiting binaries, we often need to interact with them programmatically - sending input, receiving output, and automating the exploitation process. While we could use standard tools like netcat or write custom scripts, there’s a nicer™ solution: pwntools.
Pwntools is a Python library specifically designed for CTF challenges and exploit development. It provides a convenient API for interacting with processes (both local and remote), crafting payloads, and performing common exploitation tasks.
pwntools are a package available on Pypi, just install it through your favorite Python package manager.
Here’s a simple example of using pwntools to interact with a binary:
from pwn import *
# Start a local process (this is a different binary now, then just the simple hello world)
p = process('./output_binary')
# Or connect to a remote service
# p = remote('127.0.0.1', 1337)
# Receive output until a specific string appears
p.recvuntil(b'Enter your name:')
# Send input
p.sendline(b'Alice')
# Receive a line of output
response = p.recvline()
print(response)
# Interactive mode - allows you to interact with the process manually in your terminal
p.interactive()Some of the most useful features of pwntools are:
-
Process interaction -
process()for local binaries,remote()for network connections over TCPp = process('./vulnerable_binary') # or p = remote('challenge.server.com', 1337) -
Data sending/receiving -
send(),sendline(),recv(),recvline(),recvuntil()for controlled communicationp.recvuntil(b'Enter password: ') p.sendline(b'secret123') response = p.recvline() -
Payload generation - Functions like
p64()to pack a 64-bit integer into bytes (useful for addresses that are little endian on x86),cyclic()to generate patterns for finding buffer overflow offsetsaddress = 0x401234 payload = b'A' * 40 + p64(address) # Pack address as 8 bytes # Generate cyclic pattern to find offset pattern = cyclic(100) -
ELF parsing - Load binaries with
ELF('./binary')to extract useful information like function addresses or the architectureelf = ELF('./binary') main_addr = elf.symbols['main'] -
Shellcode generation - Built-in shellcode for various architectures and purposes
shellcode = asm(shellcraft.amd64.linux.sh()) # Shellcode to spawn an interactive shell shellcode2 = asm(shellcraft.amd64.linux.cat("/flag.txt")) # Shellcode to print the flag file -
CLI tools - One of the most useful ones being
checksecused to get the security features enabled for a given binary. Here is an output for the binary compiled above, likechecksec ./output_binary:[*] './output_binary' Arch: amd64-64-little RELRO: Partial RELRO Stack: No canary found NX: NX enabled PIE: No PIE (0x400000) Stripped: No Debuginfo: Yes
For more information, check out the pwntools documentation.
Exploiting a simple buffer overflow#
To illustrate how to exploit binary programs, let’s look at a simple example:
void print_flag1() { /* ... */ }
void print_flag2() { /* ... */ }
char real_password[40];
int main() {
bool ok = false;
char password[40];
printf("Input password: ");
gets(password);
if (strcmp(password, real_password) == 0)
ok = true;
if (ok)
print_flag1();
else
printf("\nWrong password.\n");
}An immediate red flag should be the gets function. The description in its man
page starts with the sentence:
Never use this function.
This function receives a single argument, a buffer, and reads characters from
stdin into it until it receives a newline or EOF. It however never checks for
bounds and will always continue writing into memory until it receives a newline
and unless the target has infinite memory1, it will always allow the
user to overflow the buffer and write into memory they shouldn’t be able to
access.
You might think to yourself “Oh, boohoo, the user can overwrite some memory, the worst that can happen is a SEGFAULT.” Throughout the coming lectures we’ll see that such an overflow can be easily exploited in multiple ways and that a segmentation fault almost always means the program can be somehow exploited, even if the impact is limited.
Let’s see what happens in our example. The overflow happens in the password
variable, which is a local variable and an array. In C, arrays are simply
contiguous bits of memory, large enough to store the specified number of
elements of a given type. When used, array variables behave like pointers, and
as you might know a[i] is just an alias for *(a+i), which adds the index
(implicitly multiplied by the size of whatever the pointer is pointing
to2) to the pointer and then dereferences it3. This means
gets will always just take the pointer it receives and sequentially write to
the pointer, increment it, and repeat, until it reaches a newline.
As we’ve learned, local variables are stored on the stack, so we’ll be able to
overwrite something on the stack, but what? Remember how we mentioned that the
stack grows downwards (towards lower addresses)? Since password is the last
local variable which gets defined, it is going to be at the top of the stack,
so, have the lowest address. When we’re passing around pointers, we have no
way to tell whether they reside on the stack or elsewhere, so arrays have to
behave the same regardless — so our array starts at the top of the stack and
grows towards the rest of the stack. That is actually great for us, because any
overflow we get on the stack will allow us to overwrite the entire rest of the
stack, including return addresses, but that’s skipping ahead a bit.
You can probably realise the very next thing on the stack after the password is
the ok variable. If we can overwrite it, we can get the first flag. In C,
booleans are actually just integers, 0 acts as false, any other value as
true. So it should be enough to write 41 As to the input of the program and
we’ll get the flag? Well, not quite. C compilers like to keep variables stored
at “round” addresses and might keep some padding in between them to do so. You
usually have to figure that out experimentally, or by running the program in a
debugger and inspecting the memory. On the author’s machine, the number of As
needed was 48.
The next thing stored on the stack is, as we’ve learned earlier, the previous
value of rbp and the return address. This is probably the most important thing
on the stack. When the function ends and the ret instruction is called, 8
bytes are popped off the stack and jumped to, no matter what they are. And sure
enough, if we try sending many more As to the program, it will crash. Upon
closer inspection with gdb, we can see that when the program crashes, the last
attempted instruction was in fact a ret and the target address was
0x4141414141414141, which is in fact eight As. The segmentation fault
happens because 0x4141414141414141 is most likely unallocated memory which we
definitely should not be able to execute.
If we could overwrite the return address with the address of the print_flag2,
the CPU will happily “return” to the start of the function and execute it. To
figure out the address of print_flag2, we can again use gdb, or objdump.
Well, only if the binary isn’t a Positionally independent executable, for
which these addresses may change per execution. We’ll learn more about those in
the next lecture. After that we can just append eight bytes of whatever for
rbp and then eight bytes of the address to our original payload and we get the
second flag. Keep in mind that on x86_64, numbers are stored in
little-endian order, so the first byte is the least significant one — the
number 0x12345678 is stored in memory as 78 56 34 12.
TLDR#
- Processors execute machine code, which is a sequence of instructions.
- There are no variables, only finite registers and addressable memory.
- Local variables and other function state is stored on the call stack.
- Dynamically allocated memory is stored on the heap.
- Operating systems provide virtual memory — each process has its own address space and individual pages are mapped to physical memory by the OS.
- Use
man <function name>to lookup information about C functions.- Use
objdump -d <executable>to dump disassembled machine code of an executable.- More advanced decompilers exist and can be used to more efficiently analyse unknown binaries.
- Use
gdb <executable>to debug programs.
break <name>creates a breakpoint,run [args]runs the program,info registersdumps register values,xreads memory,nextiexecutes a single instruction,continueresumes execution- The
pwndbgplugin adds features useful for binary exploitation and many convenience features, likecontext,hexdumpandtelescope.pwntoolsis a Python library for writing (not only) binary exploits.
- Interact with the program using tubes.
- Generate payloads.
- Parse executables.
- Generate shellcodes.
- Much more.
- Segmentation faults can often be exploited.
- A simple buffer overflow can lead to overwriting neighbouring values, but also to arbitrary code execution.
-
Spoiler alert, it (most likely) doesn’t. ↩︎
-
Which means the result of pointer addition is dependent on the type of the pointer. This works, since we’re always adding a pointer and a number, since adding two pointers together makes no sense. ↩︎
-
Yes, this does mean indexation is commutative and that
12[a]is a perfectly valid and equivalent way to writea[12]. ↩︎