This fixes a subtle race-condition in the k_timer expiration
handler z_timer_expiration_handler(). There was a small window
of opportunity between when sys_clock_announce() unlocked
interrupts and that handler re-locked them that one or more
higher priority interrupts (or threads running on another CPU
if in an SMP environment) could not only abort the ktimer's
timeout, but restart it as well. Both of these situations are
now detectable in the handler (resulting in an immediate return
from the handler).
To make this work, every case where the ktimer internals either
adds or aborts its timeout is now encapsulated by the ktimer lock.
Thus, when the handler tests if the timeout handler has been
canceled with only the ktimer lock being held, we know that no
other thread or ISR can be modifying the ktimer's timeout.
Fixes#106654
Signed-off-by: Peter Mitsis <peter.mitsis@intel.com>
On SMP systems with tickless kernels, a race condition exists between
timer driver ISRs and the kernel's tick accounting. The driver updates
its hardware cycle baseline under a private lock, then calls
sys_clock_announce() which updates curr_tick under the separate
timeout_lock. In the gap between these two lock releases, any kernel
code calling sys_clock_elapsed() sees the new driver baseline but the
old curr_tick, producing inconsistent time values that can go backwards.
This affects every code path using the internal elapsed() helper:
uptime queries, timeout scheduling, timeout cancellation, remaining
time queries, and next-expiry calculations.
The root cause is two separate locks protecting state that must be
mutually consistent. Fix this by exposing the kernel's timeout_lock
to timer drivers via sys_clock_lock()/sys_clock_unlock(), and
providing sys_clock_announce_locked() which assumes the lock is
already held.
Timer drivers can now acquire the single lock, update their hardware
state, and announce ticks all under the same lock — eliminating the
race window entirely. The key is passed to sys_clock_announce_locked()
which consumes it (releasing the lock when it returns).
The existing sys_clock_announce() becomes a backward-compatible wrapper,
allowing incremental driver migration with no flag day.
Document that sys_clock_set_timeout(), sys_clock_elapsed(), and
sys_clock_idle_exit() are called by the kernel with the timer lock
held. Update the timer driver guide in clocks.rst accordingly.
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Add a runtime assertion in z_unpend_all_locked() to verify that
_sched_spinlock is actually held by the caller. This catches misuse
early given the function call depth involved.
Extend the availability of z_spin_is_locked() from CONFIG_SMP &&
CONFIG_TEST to also include CONFIG_ASSERT, so the check can be
used in __ASSERT() outside of test builds.
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
When halt_thread() calls k_thread_perms_all_clear() under
_sched_spinlock, the permission cleanup can trigger k_free() on
dynamic objects. k_heap_free() then calls z_unpend_all() which
attempts to take _sched_spinlock again, causing a recursive lock.
Fix this by introducing k_heap_free_sched_locked() and
k_free_sched_locked() variants that use z_unpend_all_locked()
to operate on the wait queue without re-acquiring the scheduler
lock. The existing z_unpend_all() becomes a wrapper that takes
the lock and delegates to z_unpend_all_locked().
unref_check() gains a sched_locked parameter: the abort path
(clear_perms_cb) passes true to use the locked free variant,
while k_thread_perms_clear() passes false for the normal path.
Fixes#106659
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Move signal_pending_ipi() inside the K_SPINLOCK block in
z_get_next_switch_handle(). Calling it after the lock release creates a
window where a CPU can consume its own pending IPI bit via atomic_clear
in signal_pending_ipi(), then silently drop it in
arch_sched_directed_ipi() which skips the calling CPU (i == id).
In configurations where secondary CPUs have a single pinned thread and
take no timer or external interrupts, this can lead to a permanent hang:
the idle CPU can only be woken by IPIs, but no IPIs are pending and no
timeslicing IPIs will be generated since the idle thread is not sliceable.
This was reproduced when running under QEMU with the following sequence
of events observed:
CPU 0 CPU 1
───── ─────
Thread calls k_poll(K_MSEC(1))
z_pend_curr():
mark thread PENDING
z_add_timeout(1ms)
do_swap() to idle thread
WFI
Timer tick fires
sys_clock_announce():
slice_timeout(cpu1):
flag_ipi(BIT(1))
signal_pending_ipi():
MSIP[cpu1] = 1
CPU1 wakes from WFI
z_get_next_switch_handle():
acquire _sched_spinlock
next_up() → idle
(thread still PENDING,
timeout hasn't fired yet)
release _sched_spinlock
Timer tick fires
sys_clock_announce():
z_thread_timeout(thread):
z_unpend_thread(thread)
z_ready_thread(thread):
flag_ipi(BIT(1))
signal_pending_ipi():
atomic_clear(pending_ipi)
returns BIT(1)
arch_sched_directed_ipi(BIT(1))
skips self, IPI silently lost
return to idle thread
WFI
thread still on ready queue
Such an interleaving of events is, of course, likely only reproducible in
practice in virtualized environments where (v)CPUs can be descheduled.
With signal_pending_ipi() inside the lock, next_up() and the IPI
dispatch are atomic. Either the concurrent flag_ipi lands before the
lock is acquired (and next_up sees the thread), or it lands after the
lock is released (and the caller dispatches the IPI). There is no
window where a CPU can consume its own bit for a thread it hasn't seen.
Similar races exist in reschedule() and z_reschedule_irqlock() as well.
Although they won't cause the same permanent hang described above, it
can result in unnecessary rescheduling latency. Fix reschedule(), and
add a TODO to z_reschedule_irqlock(); it doesn't not currently take
the sched spinlock.
Signed-off-by: Andrew Bresticker <abrestic@meta.com>
Workq optionally yield after every work handler to avoid starving
other threads.
When current workq is empty after this work handler, current thread
will go to sleep in next loop. So no need to yield, bringing one more
schedule cost.
Signed-off-by: Fengming Ye <frank.ye@nxp.com>
Between the points in time when sys_clock_announce() calls the
timeout handler for delayable work and when that handler wins
the work queue spinlock another thread or ISR could have called
k_work_reschedule_for_queue(). Should this occur, the timeout
that the handler is trying to process becomes stale and the
handler should not proceed any further with it.
As the workqueue spinlock is the controlling lock (it is always
held before either aborting or adding a timeout), it is safe
for the handler to call z_is_timeout_handler_canceled() once
it holds the workqueue spinlock.
Signed-off-by: Peter Mitsis <peter.mitsis@intel.com>
The workqueue work timeout feature is supposed to abort the work
queue thread if the time to execute a work item exceeds the work
queue's configured threshold. The work thread may race against the
timeout handler responsible for aborting the thread when the two
are running on separate CPUs--particularly since the timeout handler
only locks the workqueue spinlock for part of its duration.
To get around this, two separate flags must be checked a 'finished'
flag to indicate that the thread has finished processing the work
item and the timeout's flag indicating if it has been removed while
processing its timeout handler. Should either be found to be true
within in the timeout handler, the thread is deemed to have completed
in time and the timeout handler proceeds no further.
Otherwise the timeout handler is deemed to have won the race and the
workqueue thread is aborted. Should the workqueue thread detect this,
it goes to sleep until it can be aborted to prevent it from handling
any more work items.
Signed-off-by: Peter Mitsis <peter.mitsis@intel.com>
The routine sys_clock_announce() removes the timeout from the timeout
list and unlocks the timeout spinlock before invoking the timeout's
handler. This creates a window where another ISR (or a thread running
on another CPU) can abort or reuse the timeout before the handler
executes. When this happens, the timeout handler should bail early.
Use the dticks field to carry this state: set it to
TIMEOUT_DTICKS_ANNOUNCING after remove_timeout() (which needs
dticks = 0 to propagate remaining ticks) and before calling the
handler. In z_abort_timeout(), set TIMEOUT_DTICKS_ABORTED when the
timeout is either linked (existing behavior) or in the announcing
state (new). The z_add_timeout() path naturally overwrites dticks
with a real tick value, so re-use is also detected.
Provide z_is_timeout_handler_canceled() for handlers to check if
they should bail. This avoids adding a flags field to struct _timeout,
keeping the struct size unchanged.
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
With kernel coherence enabled, it is possible that the stack has
been allocated on uncached area. This has implications on
performance as memory access is not cached.
This adds a kconfig to force the indicated stack pointer of
the allocated thread stack object to be in cached area.
Signed-off-by: Daniel Leung <daniel.leung@intel.com>
Move the futex value validation inside the spinlock critical section
in z_impl_k_futex_wait().
Previously, a time-of-check to time-of-use (TOCTOU) race condition
existed because the futex value was evaluated before acquiring
futex_data->lock. This created a vulnerability window:
Thread A (waiter) Thread B (waker)
───────────────────────── ────────────────────────
atomic_get() == expected
atomic_set(new_val)
k_futex_wake() -> no waiters yet
k_spin_lock()
z_pend_curr()
[waits forever, wake lost]
If the waker updates the futex value and signals between the waiter's
value check and lock acquisition, the wake signal is lost. This causes
the waiting thread to block indefinitely.
Holding the lock during the evaluation ensures the value check and the
subsequent wait-queue operations are atomic relative to concurrent
wakeups. A concurrent wake must now either complete before the waiter
acquires the lock (waiter sees the updated value and returns -EAGAIN)
or arrive after (waiter is safely in the wait queue and gets woken).
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
This patch adds support for the OpenRISC 1000 (or1k) architecture: a
MIPS-like open hardware ISA which was first introduced in 2000.
The thread switching implementation uses the modern Zephyr thread "switch"
architecture.
Signed-off-by: Joel Holdsworth <jholdsworth@nvidia.com>
This allows user threads to test if they have permission to access
an object before attempting to perform an operation on it and fail
gracefully if not.
Signed-off-by: Christoph Busold <cbusold@qti.qualcomm.com>
Fix several incorrect uses of the Doxygen `@retval` and @return command in
kernel sources.
- Convert @return to structured @retval where functions return
discrete values.
- Replace incorrect @retval usage with @return for non-discrete
return types.
Signed-off-by: Tharaka Jayasena <9dmpires2k17.tuj@gmail.com>
When CONFIG_TIMEOUT_64BIT is not set, k_ticks_t is uint32_t. The previous
code cast left_ticks through int32_t but then stored the result back in
k_ticks_t (uint32_t), losing the sign. The subsequent ticks > 0 check was
therefore an unsigned comparison, causing a past-due wakeup (where the
subtraction wraps to a large uint32_t) to be misread as a large positive
remainder and propagated up through k_sleep() as INT_MAX ms.
Fix by retaining the signed intermediate and comparing it directly as
int32_t so negative remainders (past-due) correctly fall through to
return 0.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
This function was a little clumsy, taking the scheduler lock,
releasing it, and then calling z_reschedule_unlocked() instead of the
normal locked variant of reschedule. Don't take the lock twice.
Mostly this is a code size and hygiene win. Obviously the sched lock
is not normally a performance path, but I happened to have picked this
API for my own microbenchmark in tests/benchmarks/swap and so noticed
the double-lock while staring at disassembly.
Signed-off-by: Andy Ross <andyross@google.com>
z_reschedule() is the basic kernel entry point for context switch,
wrapping z_swap(), and thence arch_switch(). It's currently defined
as a first class function for entry from other files in the kernel and
elsewhere (e.g. IPC library code).
But in practice it's actually a very thin wrapper without a lot of
logic of its own, and the context switch layers of some of the more
obnoxiously clever architectures are designed to interoperate with the
compiler's own spill/fill logic to avoid double saving. And with a
small z_reschedule() there's not a lot to work with.
Make reschedule() an inlinable static, so the compiler has more
options.
Signed-off-by: Andy Ross <andyross@google.com>
z_get_next_switch_handle() is a clean API, but implementing it as a
(comparatively large) callable function requires significant
entry/exit boilerplate and hides the very common "no switch needed"
early exit condition from the enclosing C code that calls it. (Most
architectures call this from assembly though and don't notice).
Provide an unwrapped version for the specific needs non-SMP builds.
It's compatible in all other ways.
Slightly ugly, but the gains are significant (like a dozen cycles or
so).
Signed-off-by: Andy Ross <andyross@google.com>
Pick some low hanging fruit on non-SMP code paths:
+ The scheduler spinlock is always taken, but as we're already in an
irqlocked state that's a noop. But the optmizer can't tell, because
arch_irq_lock() involves an asm block it can't see inside. Elide
the call when possible.
+ The z_swap_next_thread() function evaluates to just a single load of
_kernel.ready_q.cache when !SMP, but wasn't being inlined because of
function location. Move that test up into do_swap() so it's always
done correctly.
Signed-off-by: Andy Ross <andyross@google.com>
Integrate the new context layer, allowing it to be selected via the
pre-existing CONFIG_USE_SWITCH. Not a lot of changes, but notable
ones:
+ There was code in the MPU layer to adjust PSP on exception exit at a
stack overflow so that it remained inside the defined stack bounds.
With the new context layer though, exception exit will rewrite the
stack frame in a larger format, and needs PSP to be adjusted to make
room.
+ There was no such treatment in the PSPLIM case (the hardware prents
the SP from going that low), so I had to add similar code to
validate PSP at exit from fault handling.
+ The various return paths for fault/svc assembly handlers need to
call out to the switch code to do the needed scheduler work. Really
almost all of these can be replaced with C now, only userspace
syscall entry (which has to "return" into the privileged stack)
needs special treatment.
+ There is a gcc bug that prevents the arch_switch() inline assembly
from building when frame pointers are enabled (which they almost
never are on ARM): it disallows you from touching r7 (the thumb
frame pointer) entirely. But it's a context switch, we need to!
Worked around by enforcing -fomit-frame-pointer even in the two
scheduler files that can swap when NO_OPTIMIZATIONS=y.
Signed-off-by: Andy Ross <andyross@google.com>
Signed-off-by: Sudan Landge <sudan.landge@arm.com>
Drivers supporting device deinitialization should not select
CONFIG_DEVICE_DEINIT_SUPPORT. Enabling deinit should be left up to the
application configuration.
Signed-off-by: Henrik Brix Andersen <henrik@brixandersen.dk>
Changes the loop variable type from 'int' to 'uint32_t' in the
create_free_list() routine to match the type of the 'num_blocks'
field. Otherwise, if a very large number of blocks is specified,
the conversion from 'uint32_t' to 'int' could have resulted in
a negative number. The result of this improper conversion would
be an empty free list.
Signed-off-by: Peter Mitsis <peter.mitsis@intel.com>
Not only checks if writer_ptr is smaller than buffer_end but also
checks that write_ptr + msg_size is smaller than buffer_end to
avoid overflow when copying data.
Signed-off-by: Flavio Ceolin <flavio@hubblenetwork.com>
Borrowing Peter Mitsis rationale in #104283
If someone passes a 0 block_size, then the buffer size must also be 0.
However, we iterate through the loop below num_blocks times, writing a
pointer to the buffer address. If the buffer is truly zero-sized, then
we are overwriting something else. If it is not truly zero-sized, then
we are creating a corrupted linked list as the pointer never actually
changes. This can cause problems later on when attempting to allocate
a slab because k_mem_slab_alloc() will only ever "allocate" the first
zero-sized block and act as though it was never truly consumed because
of the corrupted linked list.
Signed-off-by: Flavio Ceolin <flavio@hubblenetwork.com>
The message queue 'buffer_end' field points to the next address AFTER
the end of the buffer. When the buffer goes to the last addressable
byte, the next byte is 0x0. To ensure proper evaluation of the bounds
the __ASSERT_NO_MSG() checks must not use "< buffer_end", but
"<= buffer_end - 1".
Signed-off-by: Peter Mitsis <peter.mitsis@intel.com>
This adds the ability to de-initialization a memory domain.
This requires support in the architecture layer. One usage of
this is to release the resources associated with the domain.
For example, we can release allocated page tables so they can
go back to the pool of page tables to be allocated later.
Signed-off-by: Daniel Leung <daniel.leung@intel.com>
When CONFIG_SYSTEM_CLOCK_HW_CYCLES_PER_SEC_RUNTIME_UPDATE is enabled,
the system timer frequency can change at runtime. Some timer drivers
(e.g. Cortex-M SysTick) rescale the cycle counter when the frequency
changes, which can break k_busy_wait() if the frequency changes during
the wait period.
Update k_busy_wait() to handle runtime frequency changes:
- Add busy_wait_us_to_cyc_ceil32() helper to convert microseconds to
cycles with a given frequency (rounds up to avoid returning early)
- Implement a frequency-aware busy wait loop that:
- Samples the frequency before and after reading the cycle counter to
detect concurrent frequency changes
- Rescales the start_cycles reference point when the frequency changes
to keep it in the same scale as the cycle counter
- Recomputes cycles_to_wait with the new frequency to preserve the
requested duration
- Retries sampling if a frequency change is detected mid-read
The original implementation is preserved when
CONFIG_SYSTEM_CLOCK_HW_CYCLES_PER_SEC_RUNTIME_UPDATE is not enabled.
Signed-off-by: Zhaoxiang Jin <Zhaoxiang.Jin_1@nxp.com>
- Move the variable declaration and related code from kernel/timeout.c
to a new kernel/sys_clock_hw_cycles.c file. The motivation is that
both functions are part of the system clock frequency plumbing
(runtime query / update) and don’t naturally fit the responsibilities
of timeout.c, which is otherwise focused on timeout queue management
and tick announcement logic.
- Make sys_clock_hw_cycles_per_sec_runtime_get() (and its
z_impl_sys_clock_hw_cycles_per_sec_runtime_get() implementation)
visible under CONFIG_SYSTEM_CLOCK_HW_CYCLES_PER_SEC_RUNTIME_UPDATE
as well, not only under CONFIG_TIMER_READS_ITS_FREQUENCY_AT_RUNTIME.
This allows callers and time unit conversion helpers to retrieve the
current system timer frequency after runtime clock changes even when
the timer driver does not discover the rate by querying hardware.
Signed-off-by: Zhaoxiang Jin <Zhaoxiang.Jin_1@nxp.com>
Use z_abort_thread_timeout() instead of the lower-level z_abort_timeout().
The thread-flavoured version also has a stub fallback when
CONFIG_SYS_CLOCK_EXISTS=n removing the need for preprocessor checks.
Signed-off-by: Mathieu Choplain <mathieu.choplain-ext@st.com>
When CONFIG_WAITQ_SCALABLE=y, wake up all threads from a post-waitq-walk
callback which is invoked while the scheduler spinlock is still held. This
solves the race condition that was worked around via `no_wake_in_timeout`
flag in k_thread and `is_timeout` parameter of z_sched_wake_thread_locked()
which can now both be dropped.
Signed-off-by: Mathieu Choplain <mathieu.choplain-ext@st.com>
Modify z_sched_waitq_walk() to accept an optional callback invoked after
the walk while still holding the scheduler spinlock. This can be used to
perform post-walk operations "atomically". Update all callers to work with
this new function signature.
While at it, create dedicated (private) typedefs for the callbacks and
clean up/improve the routine and callbacks' documentation.
Signed-off-by: Mathieu Choplain <mathieu.choplain-ext@st.com>
When CONFIG_WAITQ_SCALABLE=n, the callback invoked by z_sched_waitq_walk()
is allowed to remove the thread provided as argument from the wait queue
(an operation implicitly performed when waking up a thread).
Use this to our advantage when waking threads pending on a k_event by
waking threads as part of the waitq walk callback instead of building
a list of threads to wake and performing the wake outside the callback.
When CONFIG_WAITQ_SCALABLE=n, this allows removing a pointer-sized field
from the thread structure which reduces the overhead of CONFIG_EVENTS=y.
The old implementation (build list in callback and wake outside callback)
is retained and used when CONFIG_WAITQ_SCALABLE=y since we can't modify
the wait queue as part of the walk callback in this situation. This is now
documented above the corresponding field in k_thread structure.
Signed-off-by: Mathieu Choplain <mathieu.choplain-ext@st.com>
z_sched_waitq_walk() used _WAIT_Q_FOR_EACH, a wrapper around the
"unsafe" SYS_DLIST_FOR_EACH_CONTAINER which does not allow detaching
elements from the list during the walk. As a result, attempting to
detach threads from the wait queue as part of the callback provided
to z_sched_waitq_walk() would result in breakage.
Introduce new _WAIT_Q_FOR_EACH_SAFE macro as wrapper around the "safe"
SYS_DLIST_FOR_EACH_CONTAINER_SAFE which allows detaching nodes from
the list during the walk, and use it inside z_sched_waitq_walk().
While at it:
- add documentation on the _WAIT_Q_FOR_EACH macro, including a warning
about detaching elements as part of the loop not being allowed
- add note to documentation of z_sched_waitq_walk() indicating that
the callback can safely remove the thread from wait queue as this
will no longer break the FOR_EACH loop
- add _WAIT_Q_FOR_EACH_SAFE to the list of ForEachMacros in .clang-format
NOTE: this new "safe removal inside callback" behavior is only available
when CONFIG_WAITQ_SCALABLE=n. When the option is 'y', red-black trees are
used instead of doubly-linked lists which prevent mutation of the list
while it is being walked. This limitation is explicitly documented.
Signed-off-by: Mathieu Choplain <mathieu.choplain-ext@st.com>
Don't acquire the _sched_spinlock in z_sched_wake_thread(). This allows
calling the function from callbacks which already own the spinlock. The
function is renamed to z_sched_wake_thread_locked() to reflect this new
behavior, and all existing callers are updated to ensure they hold the
_sched_spinlock as is now required.
Signed-off-by: Mathieu Choplain <mathieu.choplain-ext@st.com>
`k_yield()` can't be called when interrupt is disabled, update
`k_can_yield()` to reflect that.
Signed-off-by: Yong Cong Sin <ycsin@meta.com>
Signed-off-by: Yong Cong Sin <yongcong.sin@gmail.com>
There is no need to include the k_mutex priority inheritance code
when CONFIG_PRIORITY_CEILING is set to a priority level that is at
or below that of the idle thread.
Signed-off-by: Peter Mitsis <peter.mitsis@intel.com>
Use `Z_HEAP_MIN_SIZE_FOR` on the system heap. This fixes allocations
failing when there is only a single small user of the heap defining
a symbol like the following, even when only allocating 16 bytes.
```
config HEAP_MEM_POOL_ADD_SIZE_{X}
int
default 64
```
Signed-off-by: Jordan Yates <jordan@embeint.com>
Embeds both an anonymous union and an anonymous structure within the
k_spinlock structure to ensure that the structure can easily have a
non-zero size.
This new option provides a cleaner way to specify that the
spinlock structure must have a non-zero size. A non-zero size
is necessary when C++ support is enabled, or when a library
or application wants to create an array of spinlocks.
Fixes#59922
Signed-off-by: Peter Mitsis <peter.mitsis@intel.com>
As per Zephyr coding guideline #59, "operands shall not be of an
inappropriate essential type". This makes sure boolean variables are
initialized with true/false, not 1/0.
Signed-off-by: Benjamin Cabé <benjamin@zephyrproject.org>