# Xtensa MMU Operation As with other elements of the architecture, paged virtual memory management on Xtensa is somewhat unique. And there is similarly a lack of introductory material available. This document is an attempt to introduce the architecture at an overview/tutorial level, and to describe Zephyr's specific implementation choices. ## General TLB Operation The Xtensa MMU operates on top of a fairly conventional TLB cache. The TLB stores virtual to physical translation for individual pages of memory. It is partitioned into an automatically managed 4-way-set-associative bank of entries mapping 4k pages, and 3-6 "special" ways storing mappings under OS control. Some of these are for mapping pages larger than 4k, which Zephyr does not directly support. A few are for bootstrap and initialization, and will be discussed below. Like the L1 cache, the TLB is split into separate instruction and data entries. Zephyr manages both as needed, but symmetrically. The architecture technically supports separately-virtualized instruction and data spaces, but the hardware page table refill mechanism (see below) does not, and Zephyr's memory spaces are unified regardless. The TLB may be loaded with permissions and attributes controlling cacheability, access control based on ring (i.e. the contents of the RING field of the PS register) and togglable write and execute access. Memory access, even with a matching TLB entry, may therefore create Kernel/User exceptions as desired to enforce permissions choices on userspace code. Live TLB entries are tagged with an 8-bit "ASID" value derived from their ring field of the PTE that loaded them, via a simple translation specified in the RASID special register. The intent is that each non-kernel address space will get a separate ring 3 ASID set in RASID, such that you can switch between them without a TLB flush. The ASID value of ring zero is fixed at 1, it may not be changed. (An ASID value of zero is used to tag an invalid/unmapped TLB entry at initialization, but this mechanism isn't accessible to OS code except in special circumstances, and in any case there is already an invalid attribute value that can be used in a PTE). ## Virtually-mapped Page Tables Xtensa has a unique (and, to someone exposed for the first time, extremely confusing) "page table" format. The simplest was to begin to explain this is just to describe the (quite simple) hardware behavior: On a TLB miss, the hardware immediately does a single fetch (at ring 0 privilege) from RAM by adding the "desired address right shifted by 10 bits with the bottom two bits set to zero" (i.e. the page frame number in units of 4 bytes) to the value in the PTEVADDR special register. If this load succeeds, then the word is treated as a PTE with which to fill the TLB and use for a (restarted) memory access. This is extremely simple (just one extra hardware state that does just one thing the hardware can already do), and quite fast (only one memory fetch vs. e.g. the 2-5 fetches required to walk a page table on x86). This special "refill" fetch is otherwise identical to any other memory access, meaning it too uses the TLB to translate from a virtual to physical address. Which means that the page tables occupy a 4M region of virtual, not physical, address space, in the same memory space occupied by the running code. The 1024 pages in that range (not all of which might be mapped in physical memory) are a linear array of 1048576 4-byte PTE entries, each describing a mapping for 4k of virtual memory. Note especially that exactly one of those pages contains the 1024 PTE entries for the 4M page table itself, pointed to by PTEVADDR. Obviously, the page table memory being virtual means that the fetch can fail: there are 1024 possible pages in a complete page table covering all of memory, and the ~16 entry TLB clearly won't contain entries mapping all of them. If we are missing a TLB entry for the page translation we want (NOT for the original requested address, we already know we're missing that TLB entry), the hardware has exactly one more special trick: it throws a TLB Miss exception (there are two, one each for instruction/data TLBs, but in Zephyr they operate identically). The job of that exception handler is simply to ensure that the TLB has an entry for the page table page we want. And the simplest way to do that is to just load the faulting PTE as an address, which will then go through the same refill process above. This second TLB fetch in the exception handler may result in an invalid/inapplicable mapping within the 4M page table region. This is an typical/expected runtime fault, and simply indicates unmapped memory. The result is TLB miss exception from within the TLB miss exception handler (i.e. while the EXCM bit is set). This will produce a Double Exception fault, which is handled by the OS identically to a general Kernel/User data access prohibited exception. After the TLB refill exception, the original faulting instruction is restarted, which retries the refill process, which succeeds in fetching a new TLB entry, which is then used to service the original memory access. (And may then result in yet another exception if it turns out that the TLB entry doesn't permit the access requested, of course.) ## Special Cases The page-tables-specified-in-virtual-memory trick works very well in practice. But it does have a chicken/egg problem with the initial state. Because everything depends on state in the TLB, something needs to tell the hardware how to find a physical address using the TLB to begin the process. Here we exploit the separate non-automatically-refilled TLB ways to store bootstrap records. First, note that the refill process to load a PTE requires that the 4M space of PTE entries be resolvable by the TLB directly, without requiring another refill. This 4M mapping is provided by a single page of PTE entries (which itself lives in the 4M page table region!). This page must always be in the TLB. Thankfully, for the data TLB Xtensa provides 3 special/non-refillable ways (ways 7-9) with at least one 4k page mapping each. We can use one of these to "pin" the top-level page table entry in place, ensuring that a refill access will be able to find a PTE address. But now note that the load from that PTE address for the refill is done in an exception handler. And running an exception handler requires doing a fetch via the instruction TLB. And that obviously means that the page(s) containing the exception handler must never require a refill exception of its own. Ideally we would just pin the vector/handler page in the ITLB in the same way we do for data, but somewhat inexplicably, Xtensa does not provide 4k "pinnable" ways in the instruction TLB (frankly this seems like a design flaw). Instead, we load ITLB entries for vector handlers via the refill mechanism using the data TLB, and so need the refill mechanism for the vector page to succeed always. The way to do this is to similarly pin the page table page containing the (single) PTE for the vector page in the data TLB, such that instruction fetches always find their TLB mapping via refill, without requiring an exception. ## Initialization Unlike most other architectures, Xtensa does not have a "disable" mode for the MMU. Virtual address translation through the TLB is active at all times. There therefore needs to be a mechanism for the CPU to execute code before the OS is able to initialize a refillable page table. The way Xtensa resolves this (on the hardware Zephyr supports, see the note below) is to have an 8-entry set ("way 6") of 512M pages able to cover all of memory. These 8 entries are initialized as valid, with attributes specifying that they are accessible only to an ASID of 1 (i.e. the fixed ring zero / kernel ASID), writable, executable, and uncached. So at boot the CPU relies on these TLB entries to provide a clean view of hardware memory. But that means that enabling page-level translation requires some care, as the CPU will throw an exception ("multi hit") if a memory access matches more than one live entry in the TLB. The initialization algorithm is therefore: 0. Start with a fully-initialized page table layout, including the top-level "L1" page containing the mappings for the page table itself. 1. Ensure that the initialization routine does not cross a page boundary (to prevent stray TLB refill exceptions), that it occupies a separate 4k page than the exception vectors (which we must temporarily double-map), and that it operates entirely in registers (to avoid doing memory access at inopportune moments). 2. Pin the L1 page table PTE into the data TLB. This creates a double mapping condition, but it is safe as nothing will use it until we start refilling. 3. Pin the page table page containing the PTE for the TLB miss exception handler into the data TLB. This will likewise not be accessed until the double map condition is resolved. 4. Set PTEVADDR appropriately. The CPU state to handle refill exceptions is now complete, but cannot be used until we resolve the double mappings. 5. Disable the initial/way6 data TLB entries first, by setting them to an ASID of zero. This is safe as the code being executed is not doing data accesses yet (including refills), and will resolve the double mapping conditions we created above. 6. Disable the initial/way6 instruction TLBs second. The very next instruction following the invalidation of the currently-executing code page will then cause a TLB refill exception, which will work normally because we just resolved the final double-map condition. (Pedantic note: if the vector page and the currently-executing page are in different 512M way6 pages, disable the mapping for the exception handlers first so the trap from our current code can be handled. Currently Zephyr doesn't handle this condition as in all reasonable hardware these regions will be near each other) Note: there is a different variant of the Xtensa MMU architecture where the way 5/6 pages are immutable, and specify a set of unchangable mappings from the final 384M of memory to the bottom and top of physical memory. The intent here would (presumably) be that these would be used by the kernel for all physical memory and that the remaining memory space would be used for virtual mappings. This doesn't match Zephyr's architecture well, as we tend to assume page-level control over physical memory (e.g. .text/.rodata is cached but .data is not on SMP, etc...). And in any case we don't have any such hardware to experiment with. But with a little address translation we could support this. ## ASID vs. Virtual Mapping The ASID mechanism in Xtensa works like other architectures, and is intended to be used similarly. The intent of the design is that at context switch time, you can simply change RADID and the page table data, and leave any existing mappings in place in the TLB using the old ASID value(s). So in the common case where you switch back, nothing needs to be flushed. Unfortunately this runs afoul of the virtual mapping of the page refill: data TLB entries storing the 4M page table mapping space are stored at ASID 1 (ring 0), they can't change when the page tables change! So this region naively would have to be flushed, which is tantamount to flushing the entire TLB regardless (the TLB is much smaller than the 1024-page PTE array). The resolution in Zephyr is to give each ASID its own PTEVADDR mapping in virtual space, such that the page tables don't overlap. This is expensive in virtual address space: assigning 4M of space to each of the 256 ASIDs (actually 254 as 0 and 1 are never used by user access) would take a full gigabyte of address space. Zephyr optimizes this a bit by deriving a unique sequential ASID from the hardware address of the statically allocated array of L1 page table pages. Note, obviously, that any change of the mappings within an ASID (e.g. to re-use it for another memory domain, or just for any runtime mapping change other than mapping previously-unmapped pages) still requires a TLB flush, and always will. ## SMP/Cache Interaction A final important note is that the hardware PTE refill fetch works like any other CPU memory access, and in particular it is governed by the cacheability attributes of the TLB entry through which it was loaded. This means that if the page table entries are marked cacheable, then the hardware TLB refill process will be downstream of the L1 data cache on the CPU. If the physical memory storing page tables has been accessed recently by the CPU (for a refill of another page mapped within the same cache line, or to change the tables) then the refill will be served from the data cache and not main memory. This may or may not be desirable depending on access patterns. It lets the L1 data cache act as a "L2 TLB" for applications with a lot of access variability. But it also means that the TLB entries end up being stored twice in the same CPU, wasting transistors that could presumably store other useful data. But it it also important to note that the L1 data cache on Xtensa is incoherent! The cache being used for refill reflects the last access on the current CPU only, and not of the underlying memory being mapped. Page table changes in the data cache of one CPU will be invisible to the data cache of another. There is no simple way of notifying another CPU of changes to page mappings beyond doing system-wide flushes on all cpus every time a memory domain is modified. The result is that, when SMP is enabled, Zephyr must ensure that all page table mappings in the system are set uncached. The OS makes no attempt to bolt on a software coherence layer.