diff --git a/doc/reference/kernel/index.rst b/doc/reference/kernel/index.rst index 99b2174ac90..b4f0ef27b43 100644 --- a/doc/reference/kernel/index.rst +++ b/doc/reference/kernel/index.rst @@ -38,6 +38,7 @@ synchronization. other/polling.rst synchronization/semaphores.rst synchronization/mutexes.rst + smp/smp.rst Data Passing ************ diff --git a/doc/reference/kernel/smp/smp.rst b/doc/reference/kernel/smp/smp.rst new file mode 100644 index 00000000000..ed078f1b3fe --- /dev/null +++ b/doc/reference/kernel/smp/smp.rst @@ -0,0 +1,291 @@ +.. _smp_arch: + +Symmetric Multiprocessing +######################### + +On multiprocessor architectures, Zephyr supports the use of multiple +physical CPUs running Zephyr application code. This support is +"symmetric" in the sense that no specific CPU is treated specially by +default. Any processor is capable of running any Zephyr thread, with +access to all standard Zephyr APIs supported. + +No special application code needs to be written to take advantage of +this feature. If there are two Zephyr application threads runnable on +a supported dual processor device, they will both run simultaneously. + +SMP configuration is controlled under the :option:`CONFIG_SMP` kconfig +variable. This must be set to "y" to enable SMP features, otherwise +a uniprocessor kernel will be built. In general the platform default +will have enabled this anywhere it's supported. When enabled, the +number of physical CPUs available is visible at build time as +:option:`CONFIG_MP_NUM_CPUS`. Likewise, the default for this will be the +number of available CPUs on the platform and it is not expected that +typical apps will change it. But it is legal and supported to set +this to a smaller (but obviously not larger) number for special +purposes (e.g. for testing, or to reserve a physical CPU for running +non-Zephyr code). + +Synchronization +*************** + +At the application level, core Zephyr IPC and synchronization +primitives all behave identically under an SMP kernel. For example +semaphores used to implement blocking mutual exclusion continue to be +a proper application choice. + +At the lowest level, however, Zephyr code has often used the +``irq_lock()``/``irq_unlock()`` primitives to implement fine grained +critical sections using interrupt masking. These APIs continue to +work via an emulation layer (see below), but the masking technique +does not: the fact that your CPU will not be interrupted while you are +in your critical section says nothing about whether a different CPU +will be running simultaneously and be inspecting or modifying the same +data! + +Spinlocks +========= + +SMP systems provide a more constrained ``k_spin_lock()`` primitive +that not only masks interrupts locally, as done by ``irq_lock()``, but +also atomically validates that a shared lock variable has been +modified before returning to the caller, "spinning" on the check if +needed to wait for the other CPU to exit the lock. The default Zephyr +implementation of ``k_spin_lock()`` and ``k_spin_unlock()`` is built +on top of the pre-existing ``atomic_t`` layer (itself usually +implemented using compiler intrinsics), though facilities exist for +architectures to define their own for performance reasons. + +One important difference between IRQ locks and spinlocks is that the +earlier API was naturally recursive: the lock was global, so it was +legal to acquire a nested lock inside of a critical section. +Spinlocks are separable: you can have many locks for separate +subsystems or data structures, preventing CPUs from contending on a +single global resource. But that means that spinlocks must not be +used recursively. Code that holds a specific lock must not try to +re-acquire it or it will deadlock (it is perfectly legal to nest +**distinct** spinlocks, however). A validation layer is available to +detect and report bugs like this. + +When used on a uniprocessor system, the data component of the spinlock +(the atomic lock variable) is unnecessary and elided. Except for the +recursive semantics above, spinlocks in single-CPU contexts produce +identical code to legacy IRQ locks. In fact the entirety of the +Zephyr core kernel has now been ported to use spinlocks exclusively. + +Legacy irq_lock() emulation +=========================== + +For the benefit of applications written to the uniprocessor locking +API, ``irq_lock()`` and ``irq_unlock()`` continue to work compatibly on +SMP systems with identical semantics to their legacy versions. They +are implemented as a single global spinlock, with a nesting count and +the ability to be atomically reacquired on context switch into locked +threads. The kernel will ensure that only one thread across all CPUs +can hold the lock at any time, that it is released on context switch, +and that it is re-acquired when necessary to restore the lock state +when a thread is switched in. Other CPUs will spin waiting for the +release to happen. + +The overhead involved in this process has measurable performance +impact, however. Unlike uniprocessor apps, SMP apps using +``irq_lock()`` are not simply invoking a very short (often ~1 +instruction) interrupt masking operation. That, and the fact that the +IRQ lock is global, means that code expecting to be run in an SMP +context should be using the spinlock API wherever possible. + +CPU Mask +******** + +It is often desirable for real time applications to deliberately +partition work across physical CPUs instead of relying solely on the +kernel scheduler to decide on which threads to execute. Zephyr +provides an API, controlled by the :option:`CONFIG_SCHED_CPU_MASK` +kconfig variable, which can associate a specific set of CPUs with each +thread, indicating on which CPUs it can run. + +By default, new threads can run on any CPU. Calling +``k_thread_cpu_mask_disable()`` with a particular CPU ID will prevent +that thread from running on that CPU in the future. Likewise +``k_thread_cpu_mask_enable()`` will re-enable execution. There are also +``k_thread_cpu_mask_clear()`` and ``k_thread_cpu_mask_enable_all()`` APIs +available for convenience. For obvious reasons, these APIs are +illegal if called on a runnable thread. The thread must be blocked or +suspended, otherwise an ``-EINVAL`` will be returned. + +Note that when this feature is enabled, the scheduler algorithm +involved in doing the per-CPU mask test requires that the list be +traversed in full. The kernel does not keep a per-CPU run queue. +That means that the performance benefits from the +:option:`CONFIG_SCHED_SCALABLE` and :option:`CONFIG_SCHED_MULTIQ` +scheduler backends cannot be realized. CPU mask processing is +available only when :option:`CONFIG_SCHED_DUMB` is the selected +backend. This requirement is enforced in the configuration layer. + +SMP Boot Process +**************** + +A Zephyr SMP kernel begins boot identically to a uniprocessor kernel. +Auxiliary CPUs begin in a disabled state in the architecture layer. +All standard kernel initialization, including device initialization, +happens on a single CPU before other CPUs are brought online. + +Just before entering the application ``main()`` function, the kernel +calls ``z_smp_init()`` to launch the SMP initialization process. This +enumerates over the configured CPUs, calling into the architecture +layer using ``z_arch_start_cpu()`` for each one. This function is +passed a memory region to use as a stack on the foreign CPU (in +practice it uses the area that will become that CPU's interrupt +stack), the address of a local ``smp_init_top()`` callback function to +run on that CPU, and a pointer to a "start flag" address which will be +used as an atomic signal. + +The local SMP initialization (``smp_init_top()``) on each CPU is then +invoked by the architecture layer. Note that interrupts are still +masked at this point. This routine is responsible for calling +``smp_timer_init()`` to set up any needed stat in the timer driver. On +many architectures the timer is a per-CPU device and needs to be +configured specially on auxiliary CPUs. Then it waits (spinning) for +the atomic "start flag" to be released in the main thread, to +guarantee that all SMP initialization is complete before any Zephyr +application code runs, and finally calls ``z_swap()`` to transfer +control to the appropriate runnable thread via the standard scheduler +API. + +Interprocessor Interrupts +************************* + +When running in multiprocessor environments, it is occasionally the +case that state modified on the local CPU needs to be synchronously +handled on a different processor. + +One example is the Zephyr ``k_thread_abort()`` API, which cannot return +until the thread that had been aborted is no longer runnable. If it +is currently running on another CPU, that becomes difficult to +implement. + +Another is low power idle. It is a firm requirement on many devices +that system idle be implemented using a low-power mode with as many +interrupts (including periodic timer interrupts) disabled or deferred +as is possible. If a CPU is in such a state, and on another CPU a +thread becomes runnable, the idle CPU has no way to "wake up" to +handle the newly-runnable load. + +So where possible, Zephyr SMP architectures should implement an +interprocessor interrupt. The current framework is very simple: the +architecture provides a ``z_arch_sched_ipi()`` call, which when invoked +will flag an interrupt on all CPUs (except the current one, though +that is allowed behavior) which will then invoke the ``z_sched_ipi()`` +function implemented in the scheduler. The expectation is that these +APIs will evolve over time to encompass more functionality +(e.g. cross-CPU calls), and that the scheduler-specific calls here +will be implemented in terms of a more general framework. + +Note that not all SMP architectures will have a usable IPI mechanism +(either missing, or just undocumented/unimplemented). In those cases +Zephyr provides fallback behavior that is correct, but perhaps +suboptimal. + +Using this, ``k_thread_abort()`` becomes only slightly more +complicated in SMP: for the case where a thread is actually running on +another CPU (we can detect this atomically inside the scheduler), we +broadcast an IPI and spin, waiting for the thread to either become +"DEAD" or for it to re-enter the queue (in which case we terminate it +the same way we would have in uniprocessor mode). Note that the +"aborted" check happens on any interrupt exit, so there is no special +handling needed in the IPI per se. This allows us to implement a +reasonable fallback when IPI is not available: we can simply spin, +waiting until the foreign CPU receives any interrupt, though this may +be a much longer time! + +Likewise idle wakeups are trivially implementable with an empty IPI +handler. If a thread is added to an empty run queue (i.e. there may +have been idle CPUs), we broadcast an IPI. A foreign CPU will then be +able to see the new thread when exiting from the interrupt and will +switch to it if available. + +Without an IPI, however, a low power idle that requires an interrupt +will not work to synchronously run new threads. The workaround in +that case is more invasive: Zephyr will **not** enter the system idle +handler and will instead spin in its idle loop, testing the scheduler +state at high frequency (not spinning on it though, as that would +involve severe lock contention) for new threads. The expectation is +that power constrained SMP applications are always going to provide an +IPI, and this code will only be used for testing purposes or on +systems without power consumption requirements. + +SMP Kernel Internals +******************** + +In general, Zephyr kernel code is SMP-agnostic and, like application +code, will work correctly regardless of the number of CPUs available. +But in a few areas there are notable changes in structure or behavior. + + +Per-CPU data +============ + +Many elements of the core kernel data need to be implemented for each +CPU in SMP mode. For example, the ``_current`` thread pointer obviously +needs to reflect what is running locally, there are many threads +running concurrently. Likewise a kernel-provided interrupt stack +needs to be created and assigned for each physical CPU, as does the +interrupt nesting count used to detect ISR state. + +These fields are now moved into a separate ``struct _cpu`` instance +within the ``_kernel`` struct, which has a ``cpus[]`` array indexed by ID. +Compatibility fields are provided for legacy uniprocessor code trying +to access the fields of ``cpus[0]`` using the older syntax and assembly +offsets. + +Note that an important requirement on the architecture layer is that +the pointer to this CPU struct be available rapidly when in kernel +context. The expectation is that ``z_arch_curr_cpu()`` will be +implemented using a CPU-provided register or addressing mode that can +store this value across arbitrary context switches or interrupts and +make it available to any kernel-mode code. + +Similarly, where on a uniprocessor system Zephyr could simply create a +global "idle thread" at the lowest priority, in SMP we may need one +for each CPU. This makes the internal predicate test for "_is_idle()" +in the scheduler, which is a hot path performance environment, more +complicated than simply testing the thread pointer for equality with a +known static variable. In SMP mode, idle threads are distinguished by +a separate field in the thread struct. + +Switch-based context switching +============================== + +The traditional Zephyr context switch primitive has been ``z_swap()``. +Unfortunately, this function takes no argument specifying a thread to +switch to. The expectation has always been that the scheduler has +already made its preemption decision when its state was last modified +and cached the resulting "next thread" pointer in a location where +architecture context switch primitives can find it via a simple struct +offset. That technique will not work in SMP, because the other CPU +may have modified scheduler state since the current CPU last exited +the scheduler (for example: it might already be running that cached +thread!). + +Instead, the SMP "switch to" decision needs to be made synchronously +with the swap call, and as we don't want per-architecture assembly +code to be handling scheduler internal state, Zephyr requires a +somewhat lower-level context switch primitives for SMP systems: +``z_arch_switch()`` is always called with interrupts masked, and takes +exactly two arguments. The first is an opaque (architecture defined) +handle to the context to which it should switch, and the second is a +pointer to such a handle into which it should store the handle +resulting from the thread that is being switched out. + +The kernel then implements a portable ``z_swap()`` implementation on top +of this primitive which includes the relevant scheduler logic in a +location where the architecture doesn't need to understand it. +Similarly, on interrupt exit, switch-based architectures are expected +to call ``z_get_next_switch_handle()`` to retrieve the next thread to +run from the scheduler, passing in an "interrupted" handle reflecting +the same opaque type used by switch, which the kernel will then save +in the interrupted thread struct. + +Note that while SMP requires :option:`CONFIG_USE_SWITCH`, the reverse is not +true. A uniprocessor architecture built with :option:`CONFIG_SMP` = n might +still decide to implement its context switching using +``z_arch_switch()``.