It was possible with pathological timing (see below) for the scheduler
to pick a cycle of threads on each CPU and enter the context switch
path on all of them simultaneously.
Example:
* CPU0 is idle, CPU1 is running thread A
* CPU1 makes high priority thread B runnable
* CPU1 reaches a schedule point (or returns from an interrupt) and
decides to run thread B instead
* CPU0 simultaneously takes its IPI and returns, selecting thread A
Now both CPUs enter wait_for_switch() to spin, waiting for the context
switch code on the other thread to finish and mark the thread
runnable. So we have a deadlock, each CPU is spinning waiting for the
other!
Actually, in practice this seems not to happen on existing hardware
platforms, it's only exercisable in emulation. The reason is that the
hardware IPI time is much faster than the software paths required to
reach a schedule point or interrupt exit, so CPU1 always selects the
newly scheduled thread and no deadlock appears. I tried for a bit to
make this happen with a cycle of three threads, but it's complicated
to get right and I still couldn't get the timing to hit correctly. In
qemu, though, the IPI is implemented as a Unix signal sent to the
thread running the other CPU, which is far slower and opens the window
to see this happen.
The solution is simple enough: don't store the _current thread in the
run queue until we are on the tail end of the context switch path,
after wait_for_switch() and going to reach the end in guaranteed time.
Note that this requires changing a little logic to handle the yield
case: because we can no longer rely on _current's position in the run
queue to suppress it, we need to do the priority comparison directly
based on the existing "swap_ok" flag (which has always meant
"yielded", and maybe should be renamed).
Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The QUEUED state flag was managed separately from the run queue
insertion/deletion, and the logic (while AFAICT perfectly correct) was
tangled in a few places trying to keep them in sync. Put the
management of both behind a queue_thread()/dequeue_thread() API for
clarity. The ALWAYS_INLINE usage seems to be working to get the
compiler to condense the resulting multiple assignments. No behavior
change.
Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The "null out the switch handle and put it back" code in the swap
implementation is a holdover from some defensive coding (not wanting
to break the case where we picked our current thread), but it hides a
subtle SMP race: when that field goes NULL, another CPU that may have
selected that thread (which is to say, our current thread) as its next
to run will be spinning on that to detect when the field goes
non-NULL. So it will get the signal to move on when we revert the
value, when clearly we are still running on the stack!
In practice this was found on x86 which poisons the switch context
such that it crashes instantly.
Instead, be firm about state and always set the switch handle of a
currently running thread to NULL immediately before it starts running:
right before entering arch_switch() and symmetrically on the interrupt
exit path.
Fixes#28105
Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The z_swap_unlocked() function used a dummy spinlock for simplicity.
But this runs afouls of checking for stack-resident spinlocks
(forbidden when KERNEL_COHERENCE is set). And it's executing needless
code to release the lock anyway. Replace with a compile time NULL,
which will improve performance, correctness and code size.
Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The two calls to unpend a thread from a wait queue were inexplicably*
unsynchronized, as James Harris discovered. Rework them to call the
lowest level primities so we can wrap the process inside the scheduler
lock.
Fixes#32136
* I took a brief look. What seems to have happened here is that these
were originally synchronized via an implicit from an outer caller
(remember the original Uniprocessor irq_lock() API is a recursive
lock), and they were mostly implemented in terms of middle-level
calls that were themselves locked. So those got ported over to the
newer spinlock but the outer wrapper layer got forgotten.
Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
This lets the linker tell us what kind of alignment is required
for both tdata and tbss data when copying them into stack.
If they are not aligned as expected by the toolchain, generated
code would be accessing incorrect location for thread variables.
Fixes#32015
Signed-off-by: Daniel Leung <daniel.leung@intel.com>
The linker script defines `z_mapped_size` as follows:
```
z_mapped_size = z_mapped_end - z_mapped_start;
```
This is done with the belief that precomputed values at link time will
make the code smaller and faster.
On Aarch64, symbol values are relocated and loaded relative to the PC
as those are normally meant to be memory addresses.
Now if you have e.g. `CONFIG_SRAM_BASE_ADDRESS=0x2000000000` then
`z_mapped_size` might still have a reasonable value, say 0x59334.
But, when interpreted as an address, that's very very far from the PC
whose value is in the neighborhood of 0x2000000000. That overflows the
4GB relocation range:
```
kernel/libkernel.a(mmu.c.obj): in function `z_mem_manage_init':
kernel/mmu.c:527:(.text.z_mem_manage_init+0x1c):
relocation truncated to fit: R_AARCH64_ADR_PREL_PG_HI21
```
The solution is to define `Z_KERNEL_VIRT_SIZE` in terms of
`z_mapped_end - z_mapped_start` at the source code level. Given this
is used within loops that already start with `z_mapped_start` anyway,
the compiler is smart enough to combine the two occurrences and
dispense with a size counter, making the code effectively
slightly better for all while avoiding the Aarch64 relocation
overflow:
```
text data bss dec hex filename
1216 8 294936 296160 484e0 mmu.c.obj.arm64.before
1212 8 294936 296156 484dc mmu.c.obj.arm64.after
1110 8 9244 10362 287a mmu.c.obj.x86-64.before
1106 8 9244 10358 2876 mmu.c.obj.x86-64.after
```
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Some arches like x86 need all memory mapped so that they can
fetch information placed arbitrarily by firmware, like ACPI
tables.
Ensure that if this is the case, the kernel won't accidentally
clobber it by thinking the relevant virtual memory is unused.
Otherwise this has no effect on page frame management.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
If we evict enough pages to completely fill the backing store,
through APIs like k_mem_map(), z_page_frame_evict(), or
z_mem_page_out(), this will produce a crash the next time we
try to handle a page fault.
The backing store now always reserves a free storage location
for actual page faults.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Architecture layer hooks for demand paging. See
doxygen for these API definitions for more details.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Page tables created at build time may not include the
gperf data at the very end of RAM. Ensure this is mapped
properly at runtime to work around this.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Pre-allocation of paging structures is now required, such that
no allocations are ever needed when mapping memory.
Instantiation of new memory domains may still require allocations
unless a common page table is used.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
If we evict enough pages to completely fill the backing store,
through APIs like k_mem_map(), z_page_frame_evict(), or
z_mem_page_out(), this will produce a crash the next time we
try to handle a page fault.
The backing store now always reserves a free storage location
for actual page faults.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Architecture layer hooks for demand paging. See
doxygen for these API definitions for more details.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Page tables created at build time may not include the
gperf data at the very end of RAM. Ensure this is mapped
properly at runtime to work around this.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Pre-allocation of paging structures is now required, such that
no allocations are ever needed when mapping memory.
Instantiation of new memory domains may still require allocations
unless a common page table is used.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Following the idiom used for system calls, add script support to read
the initial application binary to identify which devices are defined,
and to use their offset in the device array as their unique handle
rather than the externally-defined ordinal from devicetree. The
device dependency arrays are updated to use these handles.
Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
Renamed to make its semantics clearer; this function maps
*physical* memory addresses and is not equivalent to
posix mmap(), which might confuse people.
mem_map test case remains the same name as other memory
mapping scenarios will be added in the fullness of time.
Parameter names to z_phys_map adjusted slightly to be more
consistent with names used in other memory mapping functions.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
- Remove SYS_ prefix
- shorten POWER_MANAGEMENT to just PM
- DEVICE_POWER_MANAGEMENT -> PM_DEVICE
and use PM_ as the prefix for all PM related Kconfigs
Signed-off-by: Anas Nashif <anas.nashif@intel.com>
Use GEN_OFFSET_SYM macro to genarate absolute symbols for the
_callee_saved struct and use these new symbols in the assembly code.
Signed-off-by: Carlo Caione <ccaione@baylibre.com>
Since the tracing of thread being switched in/out has the same
instrumentation points, we can roll the tracing function calls
into the one for thread stats gathering functions.
This avoids duplicating code to call another function.
Signed-off-by: Daniel Leung <daniel.leung@intel.com>
This adds the bits to gather the first thread runtime statictic:
thread execution time. It provides a rough idea of how much time
a thread is spent in active execution. Currently it is not being
used, pending following commits where it combines with the trace
points on context switch as they instrument the same locations.
Signed-off-by: Daniel Leung <daniel.leung@intel.com>
This adds the common struct fields and functions to support
the implementation of thread local storage in individual
architecture. This uses the thread stack to store TLS data.
Signed-off-by: Daniel Leung <daniel.leung@intel.com>
The core kernel does not use this yet, but it will be later used
as part of infrastructure for memory-mapping stacks, as detailed
in #28899.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Zephyr SMP kernels need to be able to run on architectures with
incoherent caches. Naive implementation of synchronization on such
architectures requires extensive cache flushing (e.g. flush+invalidate
everything on every spin lock operation, flush on every unlock!) and
is a performance problem.
Instead, many of these systems will have access to separate "coherent"
(usually uncached) and "incoherent" regions of memory. Where this is
available, place all writable data sections by default into the
coherent region. An "__incoherent" attribute flag is defined for data
regions that are known to be CPU-local and which should use the cache.
By default, this is used for stack memory.
Stack memory will be incoherent by default, as by definition it is
local to its current thread. This requires special cache management
on context switch, so an arch API has been added for that.
Also, when enabled, add assertions to strategic places to ensure that
shared kernel data is indeed coherent. We check thread objects, the
_kernel struct, waitq's, timeouts and spinlocks. In practice almost
all kernel synchronization is built on top of these structures, and
any shared data structs will contain at least one of them.
Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Signed-off-by: Anas Nashif <anas.nashif@intel.com>
Strictly speaking, any access to a mem domain or its
containing partitions should be serialized on this lock.
Architecture code may need to grab this lock if it is
using this data during, for example, context switches,
especially if they support SMP as locking interrupts
is not enough.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
When threads exited we were leaving dangling references to
them in the domain's mem_domain_q.
z_thread_single_abort() now calls into the memory domain
code via z_mem_domain_exit_thread() to take it off.
The thread setup code now invokes z_mem_domain_init_thread(),
avoiding extra checks in k_mem_domain_add_thread(), we know
the object isn't currently a member of a doamin.
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Both operands of an operator in the arithmetic conversions
performed shall have the same essential type category.
Changes are related to converting the integer constants to the
unsigned integer constants
Signed-off-by: Aastha Grover <aastha.grover@intel.com>
Fixes races where threads on another CPU are joining the
exiting thread, since it could still be running when
the joiners wake up on a different CPU.
Fixes problems where the thread object is still being
used by the kernel when the fn_abort() function is called,
preventing the thread object from being recycled or
freed back to a slab pool.
Fixes a race where a thread is aborted from one CPU while
it self-aborts on another CPU, that was currently worked
around with a busy-wait.
Precedent for doing this comes from FreeRTOS, which also
performs final thread cleanup in the idle thread.
Some logic in z_thread_single_abort() rearranged such that
when we release sched_spinlock, the thread object pointer
is never dereferenced by the kernel again; join waiters
or fn_abort() logic may free it immediately.
An assertion added to z_thread_single_abort() to ensure
it never gets called with thread == _current outside of an ISR.
Some logic has been added to ensure z_thread_single_abort()
tasks don't run more than once.
Fixes: #26486
Related to: #23063#23062
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
This code had one purpose only, feed timing information into a test and
was not used by anything else. The custom trace points unfortunatly were
not accurate and this test was delivering informatin that conflicted
with other tests we have due to placement of such trace points in the
architecture and kernel code.
For such measurements we are planning to use the tracing functionality
in a special mode that would be used for metrics without polluting the
architecture and kernel code with additional tracing and timing code.
Furthermore, much of the assembly code used had issues.
Signed-off-by: Anas Nashif <anas.nashif@intel.com>
Signed-off-by: Daniel Leung <daniel.leung@intel.com>
* Move switched_in into the arch context switch assembly code,
which will correctly record the switched_in information.
* Add switched_in/switched_out for context switch in irq exit.
Signed-off-by: Watson Zeng <zhiwei@synopsys.com>