AArch64 MMU Programming
MMU stands for Memory Management Unit and it is responsible for virtual memory address translation and memory access control. Being one of the most important subjects of the OS development, it could be at the same time very confusing. In this post I will try to clear out MMU programming process.
I've just merged the commit containing MMU support implementation into master branch of LeOS repository. This post is not supposed to be a step by step tutorial, but more a developer's guide. So if you are looking for the code, go straight to GitHub.
Understanding the MMU
Let's start from the beginning of the boot process of application from
previous post. It was
compiled for the entry point at
0x40080000 address. This exact address comes
from the design of QEMU virtual device:
0x00000000 - 0x3FFFFFFFis an area of memory mapped peripherals. Using addresses from this range you can access registers of multiple peripherals to configure and control them, just as we used output register located at
0x09000000of UART to output a text string to the terminal.
0x40000000 - 0x4007FFFFis an area reserved for a bootloader.
- And kernel (or any bare metal application) is being loaded at address
0x40080000. registers of preipheral devices
Initial address, where your kernel will be loaded depends on the bootloader implementation and if you are using existing hardware or emulator, then most likely you will deal with existing bootloader, that will load your kernel file at some predefined address.
If you are about to try the same on Raspberry Pi 3, then start address will be
Also for some bootloaders you will have to use stripped binary instead of an ELF file, so there will be no information about the entry point address for your specific kernel at all. Being loaded somewhere in the memory, your kernel will still work and even some branch operations could be successfully performed, until you need to access or perform a long jump to an absolute address. If your kernel was compiled for the entry point matching the load address, it will work, otherwise behavior is undefined. This problem could be completely solved by using of the MMU.
And while it is just a side effect of MMU usage, I decided to start with it, because by understanding it, you will understand the whole concept.
So instead of having separate Kernel builds of each bootloader, we do following:
- Choose a virtual address that will be set as an entry point for the Kernel
at compilation stage.
For LeOS kernel I've chosen
0xfffffff0_00000000. Later I will explain why.
- As a first instruction of the Kernel code, save current address in some
register that will be kept untouched for a while:
adr x20, .
- Implement position independent start code, that will initialize MMU and
enable memory address translation from virtual
0xfffffff0_00000000to real one stored in that register.
- Perform a long jump to an entry function of the Kernel,
kernel_mainin case of
What you also may want to do is to map mentioned above peripherals memory area,
that differs from SoC to SoC
to a fixed virtual address. For LeOS it is set to be
Now let's see how address is being translated on AArch64.
Address Translation Process
kernel_main function is a jump to
0xfffffff0_00000278 address for
current build. Translation process for that address is shown and described
- MMU checks if highest 63..37 bits of the address are all set to 1 or to 0.
In the first case, MMU will continue lookup using
ttbr1_el1register and in the second case with
ttbr0_el1. Difference between them will be explained later.
- From chosen register MMU gets real memory address of the so called translation table L1. That table is just an array of 512 descriptors 8 bytes long each.
- MMU uses bits 36..30 of the address as an index of the descriptor from L1 table. For the example address it is index 64. The 64th descriptor of the L1 contain address of L2 translation table.
- MMU uses bits 29..21 as a descriptor's index of L2 table, which is 0 for the example. That descriptor contains an address of the last L3 translation table.
- And bits 20..12 of the address are used by MMU as an index of a descriptor inside L3 translation table. For the example it is also 0. That descriptor contains an address of target 4KB memory page.
- As a last step of the translation, MMU takes first 12 bits of the address as an offset inside target memory page. For the address the offset would be 0x278 = 632 bytes.
For complete overview, here is a format of translation table descriptor for this case:
+---+--------+-----+-----+---+------------------------+---+----+----+----+----+------+----+----+ | R | SW | UXN | PXN | R | Output address [47:12] | R | AF | SH | AP | NS | INDX | TB | VB | +---+--------+-----+-----+---+------------------------+---+----+----+----+----+------+----+----+ 63 58 55 54 53 52 47 12 11 10 9 8 7 6 5 4 2 1 0 R - reserve SW - reserved for software use UXN - unprivileged execute never PXN - privileged execute never AF - access flag SH - shareable attribute AP - access permission NS - security bit INDX - index into MAIR register TB - table descriptor bit VB - validity descriptor bit
You can get quite nice documentation for address translation process on ARM website. It was my handbook during development.
Looks complicated, doesn't it? I will try to clarify everything step by step, but first here is supposed to be a reasonable question.
Why we need this?
Long story short, for security. MMU allows to build independent isolated virtual address spaces. Each process inside the operating system can run in his own address space, thinking he is alone in the memory, without any possibility of access to code or data of other processes. In fact the process could be allocated at any address of physical memory and even be fragmented.
By splitting memory into pages, MMU allows to specify special attributes for each page in the descriptors inside translation tables. Using these attributes OS can control write and execute permissions, priviliged access, cache options and even OS itself specific features.
AArch64 MMU comes with separate translation table base registers for specific
exception levels. It allows, for example, always to keep Kernel address space
ttbr1_el1 register, keeping its cache always valid even during
context switching, that will affect only user space, by
Address translation example from above is not comprehensive. There are various options that developer can choose between and the choice can significantly change the behavior of MMU, so it is important to understand them all.
Configuration of the MMU is done by Exception Level specific translation control register. EL1 exception level is supposed to be used by a Kernel, while EL0 as a less privileged, for user applications. There are also EL2 and EL3 for virtualization and hypervisor. From here I will speak mainly about EL1, but keep in mind there are other exception levels too.
AArch64 splits address space into lower and higher half giving possibility to configure them separately. Registers and their properties for lower half part are marked with 0:
TCR_EL1.T0SZ, while higher half ones are marked with 1:
Translation Granule is the size of a page - a minimal memory area that could be mapped. It is controlled by
TCR_EL1register and can be 4KB, 16KB or 64KB. A change of the translation granule changes everything: size of page, so size of translation table too, so number of address bits treated as indexes, so number of translation levels and size of output address inside the descriptor.
Size of address space is also configurable by
TCR_EL1register. Value inside the fields represents number of higher bits excluded from translation process. These bits must be all 1 or all 0 in dependency of address space half. If you scroll up to address translation example, you may see that for index of L1 table only 7 bits are used, while for other tables there are 9. It is because
T1SZis set to 27. It also removes one extra translation level L0 as unnecessary, because address space is limited by
64 - 27 = 37bits, which still allows to address up to 128GB of RAM, that is 8 times higher than I have on my current workstation. Worth to mention, that number of translation levels affects speed of translation process.
Descriptor validity is controlled by zero bit of descriptor in translation table. If the bit is set to 1, descriptor is valid and MMU will use the descriptor for translation. If the bit is set to 0, MMU will trigger an Exception and OS can handle it in some way, by memory allocation or termination of the process that tried to access wrong memory address.
Block mapping is another option of descriptors in translation tables. It is controlled by first bit of the descriptor, defining, how MMU should treat target address: as address of next translation table or as target address. This feature allows to map larger areas of memory than page size defined by translation granule. For 4KB granule it is possible to map blocks of
4KB * 512 = 2MBand
2MB * 512 = 1GB, which is very useful on practice.
OS can controll access to memory pages and blocks by
APfield of the descriptor, marking some pages as
read-onlyand explicitly allowing or closing access from unprivileged
Another possibility that OS can use is to mark pages as **non-executable as well separately for privileged Kernel and unprivileged user exception level.
AArch64 also provides 2 **Memory Attribute Indirection Registers (MAIR) for flexible configuration of memory areas. You can think about MAIRs as of an array with 8 elements each of 8 bits long. You can store inside MAIRs up to 8 attributes sets and reffer them by the index 0..7 stored in
INDXfield of the descriptor.
As a final stage of this post, I would like to share with you some development and troubleshooting tips that could help you with own implementation.
Decide on translation granule size
For small applications 4KB size is a good choice and as an adept of simple kernels, I would also recommend it for a kernel. But most important is to focus only on one case if you are just starting with MMU programming, because otherwise it could produce unwanted mess in your head, so keep it simple.
For some reason, AArch64 uses different formats for
TG1values. For example, for 4KB granule I had to set
TG1=0b10. Always refer official documentation to avoid mistakes.
Don't forget, that target addresses inside the descriptors must be aligned by granule size. You can achieve that by using of macroses inside your assembler code or by linker script:
. = ALIGN(0x1000); LD_TTBR1_BASE = .; . = . + 0x1000;
Start with block mapping
It is enough to start with just one translation table and with just one descriptor stored inside, that will map, for example, 1GB of RAM, covering all the area of your kernel or even all available memory. Translation process is hard to debug, so as less places where you can make a mistake, as easier it will be to find where is the problem.
Start with identity mapping
Identity is way of memory mapping when your virtual addresses map to the same addresses of physical memory. It is not a good practice in general, because it is less secure and more confusing on complex systems, but for the beginning it is absolutely normal and even necessary thing to do.
It is necessary because after you enable MMU, link register of the CPU will point to next instruction by real physical address and in case if there is no identity mapping, MMU will trigger an exception. Handling this exception could be a solution, but probably it is not a way for proof-of-concept application.
For Identity mapping you can initialize
TCR_EL1 registers with small 2 bytes
value and get working example.
Configure your memory block in the descriptor as executable and writeable for
all exception levels, initialize
0xFF value (normal memory) and
reffer it as
Add translation levels
When you got your translation for identity mapping working, try add next level of translation. When it is ready add next, until you reach the final level for your granule size.
Don't forget to change first bit of descriptor for upper translation levels. It is a very common source of problems.
When you have all translation levels, you can configure differently pages for code, read-only data and normal data. Write attempt to read-only memory will trigger an Exception and you should easily notice it in your debugger.
Map peripherals memory
Peripherals memory area should be mapped differently and the right place to
MAIR register. I've end up with
0b00000100 value, that corresponds
to Device-nGnRE memory (non-cacheble). Don't forget to use proper attribute
index inside the descriptor.
Map kernel at higher half of memory
It is a common practice to map kernel at higher half and applications at lower half. It helps to isolate kernel from user applications and even simplify life of the developer a bit, because by the address of some value or instruction you can easily guess where did the error or exception happen.
When you implement translation tables for Kernel in upper half, don't forget to check the value of
TCR_EL1. The value inside this define, how many higher bits of your virtual address must be set to 1. You can come in the situation, that your translation tables set up properly, but you refer wrong address, lets say
Debugger, pencil and a sheet paper are your best friends. It is useful to draw a scheme like one above for your granule size and pretend you are an MMU and just follow each step of translation. I was also writing down addresses of some instructions to manualy put breakpoints in GDB after I did changes to the code.
But if MMU does not work for you, make sure that
Your translation tables are aligned according to granule size
Proper target addresses inside descriptors are stored in translation tables at proper offsets
Block descriptors are marked as blocks and table descriptors as table descriptors
Descriptors contain proper access permissions for pages
T1SZfields correspond to number of unset or set bits of your virtual address
If you changed values of these fields, you have also updated linker script with actual entry point for your kernel
One of the most useful GDB commands could be a memory examination:
This will show content of memory at this address or error message if it is unreadable.
I hope it was useful. I have found several code examples of MMU programming, some of them were even working out of the box. But when I tried to make it on my own, I have faced many issues and it was hard to find practical tips on what to do, so I decided to gather them all in one place and maybe describe everything from another side, so reader can have better overview of the techniques.
If you have any question regarding MMU, feel free to leave a comment bellow, I would be glad to help. Also let me know if I have missed something or you would like to have more information on subject in format of text or video.