ARMv8-R AArch32 cores determine the CPU start address on reset from
RVBAR (Reset Vector Base Address Register), which only stores bits
[31:5] — bits [4:0] are RES0. Any firmware or boot-loader that
programs RVBAR from the ELF entry point will silently truncate
a non-aligned address to a 32-byte boundary, causing the CPU to
begin executing at the wrong location.
Whether __start lands on a 32-byte boundary depends on the size of
code sections placed before it, which changes with Kconfig options.
This makes the failure non-deterministic: a build may work today and
break after enabling an unrelated feature like logging.
Force 32-byte alignment on z_arm_reset/__start for ARMv8-R so the
entry point survives RVBAR truncation on any SoC.
Signed-off-by: Appana Durga Kedareswara rao <appana.durga.kedareswara.rao@amd.com>
We will make use of the .exc_return member during walk_stackframe() to
know whether we have extended stack or standard stack.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Some ARMv6-M and ARMv8-M Baseline cores indeed support MPU
(CPU_HAS_ARM_MPU in soc Kconfig), so the exclusion should
not be based on ARMV6_M_ARMV8_M_BASELINE.
Signed-off-by: Andy Lin <andylinpersonal@gmail.com>
Extend the ARM Cortex-M coredump arch block to version 3 with metadata
that provides the offset to the callee_saved struct within k_thread.
This enables the coredump GDB stub to accurately retrieve callee-saved
registers (r4-r11) for non-faulting threads during multi-thread
debugging.
Signed-off-by: Mark Holden <mholden@meta.com>
The arch_dcache_enable() and arch_icache_enable() functions could
cause system crashes when called on caches that were already enabled.
This occurs because arch_dcache_invd_all() invalidates the entire
cache without first flushing dirty data, leading to memory corruption
when the cache was previously enabled.
This scenario happens in cache tests where test setup calls
sys_cache_data_enable(), but the SoC early init hook has already
enabled caches during boot.
Fix by checking the SCTLR register before performing cache operations:
- If D-cache is already enabled, perform clean+invalidate instead of
just invalidate to preserve dirty cache lines
- If I-cache is already enabled, perform invalidate only (no dirty
lines in I-cache)
- If cache is not enabled, proceed with normal enable sequence
This makes the enable functions safe to call multiple times without
risking data corruption or system crashes.
Signed-off-by: Appana Durga Kedareswara rao <appana.durga.kedareswara.rao@amd.com>
relocate_vector_table is called as part of z_arm_reset.
This is considered early-boot code before XIP.
At this stage, Program might not have access to optimized
compiler APIs that reside in FLASH.
Thus, its better for relocate_vector_table to use arch_early_memcpy.
Signed-off-by: Shreyas Shankar <s-shankar@ti.com>
Add calls to sys_trace_idle_exit before leaving idle state
to track CPU load.
Extend CPU_LOAD to CPU_AARCH32_CORTEX_R and CPU_AARCH32_CORTEX_A, thus
we can support CPU_LOAD for all CPU_CORTEX.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
ASM is notoriously harder to maintain than C and requires core specific
adaptation which impairs even more the readability of the code.
As for performance concern, there's no difference of generated code
between ASM and C version.
ASM version:
<arch_cpu_idle>:
f57ff04f dsb sy
e320f003 wfi
f1080080 cpsie i
f57ff06f isb sy
e12fff1e bx lr
<arch_cpu_atomic_idle>:
f10c0080 cpsid i
f57ff04f dsb sy
e320f002 wfe
e3500000 cmp r0, #0
1a000000 bne 102ca8 <_irq_disabled>
f1080080 cpsie i
<_irq_disabled>:
e12fff1e bx lr
C version:
<arch_cpu_idle>:
f57ff04f dsb sy
e320f003 wfi
f1080080 cpsie i
f57ff06f isb sy
e12fff1e bx lr
<arch_cpu_atomic_idle>:
f10c0080 cpsid i
f57ff04f dsb sy
e320f002 wfe
e3500000 cmp r0, #0
112fff1e bxne lr
f1080080 cpsie i
e12fff1e bx lr
As can be seen, the C version use 'bxne lx' to return directly for irq
disabled case, cost one less instruction than asm version. So from this
PoV, C version not only improves the readability and maintainability
but also generates better code.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
This allows to distinguish between f16 storage format support
(CONFIG_FP16) and actual f16 arithmetic capability.
CONFIG_FP16_ARITHMETIC requires either MVE float (ARMV8_1_M_MVEF) or a
Cortex-A core (CPU_CORTEX_A).
Signed-off-by: Martin Jäger <martin.jaeger@a-labs.io>
USE_SWITCH is a new feature and needs more testing before enabling it by
default. While all tests in upstream Zephyr CI passed, keeping this
config disabled helps in getting majority of the work in without causing
regression on upstream boards that are not tested in ci.
Signed-off-by: Sudan Landge <sudan.landge@arm.com>
Fix below issues when trying to build hello world with armclang:
```
Error: L6218E: Undefined symbol z_arm_exc_exit (referred from reset.o).
Error: L6218E: Undefined symbol z_arm_int_exit (referred from reset.o).
Error: L6218E: Undefined symbol z_arm_pendsv (referred from reset.o).
```
Signed-off-by: Sudan Landge <sudan.landge@arm.com>
orr fix is as reported in review:
```
The add causes a crash with IAR tools as the address loaded to r8
already has the lowest bit set, and the add causes it to be set to ARM
mode. The orr instruction works fine with both scenarios
```
`UDF 0` seems to break on IAR but `UDF #0` works for all.
Signed-off-by: Sudan Landge <sudan.landge@arm.com>
USE_SWITCH code unconditionally applied interrupt locking, which altered
BASEPRI handling and broke expected interrupt behavior on both
Baseline and Mainline CPUs when USE_SWITCH was disabled.
This commit restores the original behavior with USE_SWITCH disabled and
fixes tests/arch/arm/arm_interrupt failures.
Signed-off-by: Sudan Landge <sudan.landge@arm.com>
The ARM Ltd. FVP emulator (at least the variants run in Zephyr CI)
appears to have a bug with the stack alignment bit in xPSR. It's
common (it fails in the first 4-6 timer interrupts in
tests.syscalls.timeslicing) that we'll take an interrupt from a
seemingly aligned (!) stack with the bit set. If we then switch and
resume the thread from a different context later, popping the stack
goes wrong (more so than just a misalignment of four bytes: I usually
see it too low by 20 bytes) in a way that it doesn't if we return
synchronously. Presumably legacy PendSV didn't see this because it
used the unmodified exception frame.
Work around this by simply assuming all interrupted stacks were
aligned and clearing the bit. That is NOT correct in the general
case, but in practice it's enough to get tests to pass.
Signed-off-by: Andy Ross <andyross@google.com>
The exit from the SVC exception used for syscalls back into the
calling thread is done without locking. This means that the
intermediate states can be interrupted while the kernel-mode code is
still managing thread state like the mode bit, leading to mismatches.
This seems mostly robust when used with PendSV (though I'm a little
dubious), but the new arch_switch() code needs to be able to suspend
such an interrupted thread and restore it without going through a full
interrupt entry/exit again, so it needs locking for sure.
Take the lock unconditionally before exiting the call, and release it
in the thread once the magic is finished, just before calling the
handler. Then take it again before swapping stacks and dropping
privilege.
Even then there is a one-cycle race where the interrupted thread has
dropped the lock but still has privilege (the nPRIV bit is clear in
CONTROL). This thread will be resumed later WITHOUT privilege, which
means that trying to set CONTROL will fail. So there's detection of
this 1-instruction race that will skip over it.
Signed-off-by: Andy Ross <andyross@google.com>
Some toolchains don't support an __asm__(...) block at the top level
of a file and require that they live within function scope. That's
not a hardship as these two blocks were defining callable functions
anyway. Exploit the "naked" attribute to avoid wasted bytes in unused
entry/exit code.
Signed-off-by: Andy Ross <andyross@google.com>
Late-arriving clang-format-demanded changes that are too hard to split
and squash into the original patches. No behavior changes.
Signed-off-by: Andy Ross <andyross@google.com>
Some nitpicky hand-optimizations, no logic changes:
+ Shrink the assembly entry to put more of the logic into
compiler-optimizable C.
+ Split arm_m_must_switch() into two functions so that the first
doesn't look so big to the compiler. That allows it to spill (many)
fewer register on entry and speeds the (very) common early-exit case
where an interrupt returns without context switch.
Signed-off-by: Andy Ross <andyross@google.com>
When USE_SWITCH=y, the thread struct is now mostly degenerate. Only
the two words for ICI/IT state tracking are required. Eliminate all
the extra fields when not needed and save a bunch of SRAM.
Note a handful of spots in coredump/debug that need a location for the
new stack pointer (stored as the switch handle now) are also updated.
Signed-off-by: Andy Ross <andyross@google.com>
The new switch code no longer needs PendSV, but still calls the SVC
vector. Split them into separate files for hygiene and a few
microseconds of build time.
Signed-off-by: Andy Ross <andyross@google.com>
Micro-optimization: We don't need a full arch_irq_lock(), which is a
~6-instruction sequence on Cortex M. The lock will be dropped
unconditionally on interrupt exit, so take it unconditionally.
Signed-off-by: Andy Ross <andyross@google.com>
z_get_next_switch_handle() is a clean API, but implementing it as a
(comparatively large) callable function requires significant
entry/exit boilerplate and hides the very common "no switch needed"
early exit condition from the enclosing C code that calls it. (Most
architectures call this from assembly though and don't notice).
Provide an unwrapped version for the specific needs non-SMP builds.
It's compatible in all other ways.
Slightly ugly, but the gains are significant (like a dozen cycles or
so).
Signed-off-by: Andy Ross <andyross@google.com>
GCC/gas has a code generation bugglet on thumb. The R7 register is
the ABI-defined frame pointer, though it's usually unused in zephyr
due to -fomit-frame-pointer (and the fact the DWARF on ARM doesn't
really need it). But when it IS enabled, which sometimes seems to
happen due to toolchain internals, GCC is unable to allow its use in
the clobber list of an asm() block (I guess it can't generate
spill/fill code without using the frame?).
There is existing protection for this problem that sets
-fomit-frame-pointer unconditionally on the two files (sched.c and
init.c) that require it. But even with that, gcc sometimes gets
kicked back into "framed mode" due to internal state. Provide a
kconfig workaround that does an explicit spill/fill on the one
test/platform where we have trouble.
(I checked, btw: an ARM clang build appears not to have this
misfeature)
Signed-off-by: Andy Ross <andyross@google.com>
ARM Cortex M has what amounts to a design bug. The architecture
inherits several unpipelined/microcoded "ICI/IT" instruction forms
that take many cycles to complete (LDM/STM and the Thumb "IT"
conditional frame are the big ones). But out of a desire to minimize
interrupt latency, the CPU is allowed to halt and resume these
instructions mid-flight while they are partially completed. The
relevant bits of state are stored in the EPSR fields of the xPSR
register (see ARMv7-M manual B1.4.2). But (and this is the design
bug) those bits CANNOT BE WRITTEN BY SOFTWARE. They can only be
modified by exception return.
This means that if a Zephyr thread takes an interrupt
mid-ICI/IT-instruction, then switches to another thread on exit, and
then that thread is resumed by a cooperative switch and not an
interrupt, the instruction will lose the state and restart from
scratch. For LDM/STM that's generally idempotent for memory (but not
MMIO!), but for IT that means that the restart will re-execute
arbitrary instructions that may not be idempotent (e.g. "addeq r0, r0,
The fix is to check for this condition (which is very rare) on
interrupt exit when we are switching, and if we discover we've
interrupted such an instruction we swap the return address with a
trampoline that uses a UDF instruction to immediately trap to the
undefined instruction handler, which then recognizes the fixup address
as special and immediately returns back into the thread with the
correct EPSR value and resume PC (which have been stashed in the
thread struct). The overhead for the normal case is just a few cycles
for the test.
Signed-off-by: Andy Ross <andyross@google.com>
Integrate the new context layer, allowing it to be selected via the
pre-existing CONFIG_USE_SWITCH. Not a lot of changes, but notable
ones:
+ There was code in the MPU layer to adjust PSP on exception exit at a
stack overflow so that it remained inside the defined stack bounds.
With the new context layer though, exception exit will rewrite the
stack frame in a larger format, and needs PSP to be adjusted to make
room.
+ There was no such treatment in the PSPLIM case (the hardware prents
the SP from going that low), so I had to add similar code to
validate PSP at exit from fault handling.
+ The various return paths for fault/svc assembly handlers need to
call out to the switch code to do the needed scheduler work. Really
almost all of these can be replaced with C now, only userspace
syscall entry (which has to "return" into the privileged stack)
needs special treatment.
+ There is a gcc bug that prevents the arch_switch() inline assembly
from building when frame pointers are enabled (which they almost
never are on ARM): it disallows you from touching r7 (the thumb
frame pointer) entirely. But it's a context switch, we need to!
Worked around by enforcing -fomit-frame-pointer even in the two
scheduler files that can swap when NO_OPTIMIZATIONS=y.
Signed-off-by: Andy Ross <andyross@google.com>
Signed-off-by: Sudan Landge <sudan.landge@arm.com>
1. Mostly complete. Supports MPU, userspace, PSPLIM-based stack
guards, and FPU/DSP features. ARMv8-M secure mode "should" work but I
don't know how to test it.
2. Designed with an eye to uncompromising/best-in-industry cooperative
context switch performance. No PendSV exception nor hardware
stacking/unstacking, just a traditional "musical chairs" switch.
Context gets saved on process stacks only instead of split between
there and the thread struct. No branches in the core integer switch
code (and just one in the FPU bits that can't be avoided).
3. Minimal assembly use; arch_switch() itself is ALWAYS_INLINE, there
is an assembly stub for exception exit, and that's it beyond one/two
instruction inlines elsewhere.
4. Selectable at build time, interoperable with existing code. Just
use the pre-existing CONFIG_USE_SWITCH=y flag to enable it. Or turn
it off to evade regressions as this stabilizes.
5. Exception/interrupt returns in the common case need only a single C
function to be called at the tail, and then return naturally.
Effectively "all interrupts are direct now". This isn't a benefit
currently because the existing stubs haven't been removed (see #4),
but in the long term we can look at exploiting this. The boilerplate
previously required is now (mostly) empty.
6. No support for ARMv6 (Cortex M0 et. al.) thumb code. The expanded
instruction encodings in ARMv7 are a big (big) win, so the older cores
really need a separate port to avoid impacting newer hardware.
Thankfully there isn't that much code to port (see #3), so this should
be doable.
Signed-off-by: Andy Ross <andyross@google.com>
In addition to pool literal, we want to avoid jump tables generally
associated to Table Branch Byte (TBB) and Table Branch Halfword (TBH)
instructions.
Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
In addition to -mslow-flash-data, we must also ensure that the assembler
does not generate literal pools. They are automatically generated by the
LDR pseudo-instruction[1]:
- If the constant can be constructed with a MOV or MVN instruction, the
assembler emits the corresponding instruction.
- Otherwise (when the value does not fit on 16bits), the assembler places
the value in the next literal pool.
No options was found in GNU assembler to disable literal pool generation.
Therefore, this patch explicitly uses MOVT and MOVW when the assembler
would otherwise generate literal pool. Note, that LDR must be kept under
ifdef since Cortex-M0 does not support MOVT/ MOVW.
This patch only change four occurrences of LDR. The other occurrences do
not appear to generate literal pool (likely because the literal values are
< 0xFFFF). If a literal pool is generated in the future, it will introduce
a performance penalty. No other limitations are expected.
[1]: https://developer.arm.com/documentation/dui0204/f/ \
writing-arm-assembly-language/loading-constants-into-registers/ \
loading-with-ldr-rd---const?lang=en
Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
On some SoC, no data cache is associated with the main flash. Therefore,
all accesses to data stored in flash, especially literal pools[1] penalizes
performance. Fortunately, GCC and IAR provide options (-mslow-flash-data
and --no_literal_pool) to prevent the generation of literal pools.
Unfortunately, current GCC versions (14.x) do not support -mslow-flash-data
when Thread Local Storage (TLS) variables are used. A patch is currently
under review[2][3] to address this limitation. Without this gcc patch,
using -mslow-flash-data is not very user friendly. The user must rebuild
the libc (CONFIG_PICOLIBC_USE_MODULE=y) without TLS support
(CONFIG_THREAD_LOCAL_STORAGE=n), and must ensure that the application does
not rely on thread-safe "errno".
Because of these interactions with the compiler, this option can't be
automatically selected by the SoC. Thus, this patch leaves the option
hidden. The SoC may expose it if relevant.
[1]: https://en.wikipedia.org/wiki/Literal_pool
[2]: https://gcc.gnu.org/pipermail/gcc-patches/2026-February/707887.html
[3]: https://github.com/zephyrproject-rtos/gcc/pull/65
Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Functions in assembler file pm_s2ram.S are declared with the usual:
SECTION_FUNC(TEXT, <function name>)
Note the first argument (section name) is `TEXT` in capital letters which
a define in `include/zephyr/linker/sections.h` should replace with `text`,
such that the functions are placed in section `.text.<function name>` which
matches the ".text.*" pattern in linker script. However, this file is not
included by pm_s2ram.S: as such, the substitution never happens and the
functions go in `.TEXT.<function name>` instead! This has not caused issues
thanks to a workaround in the Cortex-M linker script, which also has
".TEXT.*" as input section name pattern (unlike all other archs!), but is a
bug nonetheless.
Fix this issue by adding the missing include which ensures the functions
are placed in sections with the proper name.
Signed-off-by: Mathieu Choplain <mathieu.choplain-ext@st.com>
The eponymous function in __aeabi_read_tp.S is declared using:
SECTION_FUNC(TEXT, __aeabi_read_tp)
Note the first argument (section name) is `TEXT` in capital letters which
a define in `include/zephyr/linker/sections.h` should replace with `text`,
such that the function is placed in section `.text.__aeabi_read_tp` which
matches the ".text.*" pattern in linker script. However, this file is not
included by __aeabi_read_tp.S: as such, the substitution never happens and
the function goes in `.TEXT.__aeabi_read_tp` instead! This has not caused
issues thanks to a workaround in the Cortex-M linker script, which also
has ".TEXT.*" as input section name pattern (unlike all other archs!), but
is a bug nonetheless.
Fix this issue by adding the missing include which ensures the function
is placed in a section with the proper name.
Signed-off-by: Mathieu Choplain <mathieu.choplain-ext@st.com>
Cortex-R5F Technical Reference Manual by Arm says DMINLINE is the Log2 of
the minimum number of words (one word = four bytes) in a cache line.
For instance, say DMINLINE is 3, which means the cache line size is
2^3=8 words or 32 bytes, however with the current calculation, it comes
out to be 16 bytes. Therefore, we fix this calculation by correctly
calculating the number of bytes for the cache line size.
Signed-off-by: Amneesh Singh <amneesh@ti.com>
Upgrades the thread user_options to 16 bits from an 8-bit value to
provide more space for future values.
Also, as the size of this field has changed, the values for the
existing architecture specific thread options have also shifted
from the upper end of the old 8-bit field, to the upper end of
the new 16-bit field.
Fixes#101034
Signed-off-by: Peter Mitsis <peter.mitsis@intel.com>
As per Zephyr guidelines re: inclusive language, the term
"master" is replaced with "primary".
Signed-off-by: Benjamin Cabé <benjamin@zephyrproject.org>
soc_per_core_init_hook() is usually called from arch_kernel_init() and
arch_secondary_cpu_init() which are C functions. As such, there is no need
to check for CONFIG_SOC_PER_CORE_INIT_HOOK since platform/hooks.h provides
a no-op function-like macro implementation if the Kconfig option is not
enabled.
Remove the Kconfig option check from all files.
Signed-off-by: Mathieu Choplain <mathieu.choplain-ext@st.com>
Ensure callee registers included in coredump.
Push callee registers onto stack for
CONFIG_ARMV6_M_ARMV8_M_BASELINE as well
when CONFIG_EXTRA_EXCEPTION_INFO enabled.
Effectively a complement to df6b8c3 by mholden.
Signed-off-by: Andy Lin <andylinpersonal@gmail.com>
- CONFIG_ARM_MPU_SRAM_WRITE_THROUGH: enables Write-Through cache policy
for SRAM regions instead of default Write-Back
Includes corresponding MPU attribute macros for ARMv7-M and ARMv8-M
architectures. Maintains backward compatibility with existing
configurations.
Signed-off-by: Lucien Zhao <lucien.zhao@nxp.com>
commit a763207962 ("arch: arm: dwt: use the cmsis_6 macro
unconditionally") use cmsis_6 macro unconditionally, we can use DCB
instead of CoreDebug macro unconditionally.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Make sure that arch.mode is set with appropriate flags before setting up
the privileged stack start.
Fixes#99895
Signed-off-by: Sudan Landge <sudan.landge@arm.com>
Make sure bindesc are placed right after the vector table and fix ci
failure with sample.bindesc for fvp_baser_aemv8r/fvp_aemv8r_aarch32.
Without this change the bindesc are placed at a location that is not
mapped leading to a data abort while running the sample.
Signed-off-by: Sudan Landge <sudan.landge@arm.com>
Allows you to relocate the vector table from Flash to ITCM/DTCM to
minimize interrupt latency. TCM offers single-cycle access compared to
multi-cycle SRAM reads and even slower flash reads. This improves exception
handling speed for real-time workloads.
Signed-off-by: Peter van der Perk <peter.vanderperk@nxp.com>
The ATOMIC_OPERATIONS_* Kconfig option is not a choice, so it does not
have a default. However, the file that determines which actual atomic
operations backend will be used does default to
ATOMIC_OPERATIONS_BUILTIN:
3e537db71e/include/zephyr/sys/atomic.h (L26-L41)
Since we want to ensure that all SoCs intentionally select the atomic
operations backend they want to use, select it at the SoC level for all
SoCs, as well as for the Cortex-M arch when the Armv8-M baseline profile
is selected.
Signed-off-by: Carles Cufi <carles.cufi@nordicsemi.no>
Update include of header file arm_mpu_mem_cfg.h which has been moved
to a Cortex-M/-R-agnostic include directory.
Signed-off-by: Immo Birnbaum <mail@birnbaum.immo>