kernel/sched: Address thread abort termination delay issue on SMP

It's possible for a thread to abort itself simultaneously with an external abort from another thread. In fact in our test suite this is a common thing, as ztest will abort its own spawend threads at the end of a test, as they tend to be exiting on their own. When that happens, the thread marks itself DEAD and does all its scheduler bookeeping, but it is STILL RUNNING on its own stack until it makes its way to its final swap. The external context would see that "dead" metadata and return from k_thread_abort(), allowing the next test to reuse and spawn the same thread struct while the old context was still running. Obviously that's bad. Unfortunately, this is impossible to address completely without modifying every SMP architecture to add a API-visible hook to every swap that signals completion. In practice the best we can do is add a delay. But note the optimization: almost always, the scheduler IPI catches the running thread and kills it from interrupt context (i.e. on a different stack). When that happens, we know that the interrupted thread will never be resumed (because it's dead) and can elide the delay. We only pay the cost when we actually detect a race. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
2020-01-14 06:26:10 -08:00 · 2020-01-14 06:26:10 -08:00 · e06ba702d5
commit e06ba702d5
parent 60247ca149
2 changed files with 37 additions and 2 deletions
--- a/include/kernel_structs.h
+++ b/include/kernel_structs.h
@ -59,8 +59,11 @@
 /* Thread is being aborted (SMP only) */
 #define _THREAD_ABORTING (BIT(5))

+/* Thread was aborted in interrupt context (SMP only) */
+#define _THREAD_ABORTED_IN_ISR (BIT(6))
+
 /* Thread is present in the ready queue */
-#define _THREAD_QUEUED (BIT(6))
+#define _THREAD_QUEUED (BIT(7))

 /* end - states */

--- a/kernel/sched.c
+++ b/kernel/sched.c
@ -15,6 +15,13 @@
 #include <stdbool.h>
 #include <kernel_internal.h>

+/* Maximum time between the time a self-aborting thread flags itself
+ * DEAD and the last read or write to its stack memory (i.e. the time
+ * of its next swap()).  In theory this might be tuned per platform,
+ * but in practice this conservative value should be safe.
+ */
+#define THREAD_ABORT_DELAY_US 500
+
 #if defined(CONFIG_SCHED_DUMB)
 #define _priq_run_add		z_priq_dumb_add
 #define _priq_run_remove	z_priq_dumb_remove
@ -436,7 +443,20 @@ void z_thread_single_abort(struct k_thread *thread)
 				thread->base.pended_on = NULL;
 			}
 		}
-		thread->base.thread_state |= _THREAD_DEAD;
+
+		u32_t mask = _THREAD_DEAD;
+
+		/* If the abort is happening in interrupt context,
+		 * that means that execution will never return to the
+		 * thread's stack and that the abort is known to be
+		 * complete.  Otherwise the thread still runs a bit
+		 * until it can swap, requiring a delay.
+		 */
+		if (IS_ENABLED(CONFIG_SMP) && arch_is_in_isr()) {
+			mask |= _THREAD_ABORTED_IN_ISR;
+		}
+
+		thread->base.thread_state |= mask;
 	}

 	sys_trace_thread_abort(thread);
@ -1178,7 +1198,9 @@ void z_sched_abort(struct k_thread *thread)
 	 * it locally.  Not all architectures support that, alas.  If
 	 * we don't have it, we need to wait for some other interrupt.
 	 */
+	key = k_spin_lock(&sched_spinlock);
 	thread->base.thread_state |= _THREAD_ABORTING;
+	k_spin_unlock(&sched_spinlock, key);
 #ifdef CONFIG_SCHED_IPI_SUPPORTED
 	arch_sched_ipi();
 #endif
@ -1202,6 +1224,16 @@ void z_sched_abort(struct k_thread *thread)
 			k_busy_wait(100);
 		}
 	}
+
+	/* If the thread self-aborted (e.g. its own exit raced with
+	 * this external abort) then even though it is flagged DEAD,
+	 * it's still running until its next swap and thus the thread
+	 * object is still in use.  We have to resort to a fallback
+	 * delay in that circumstance.
+	 */
+	if ((thread->base.thread_state & _THREAD_ABORTED_IN_ISR) == 0U) {
+		k_busy_wait(THREAD_ABORT_DELAY_US);
+	}
 }
 #endif