Restart Recovery
Schedune is designed under the assumption that the Control Plane may crash, reboot, or become disconnected from its nodes at any time. Schedune guarantees Restart Resilience.
The Philosophy
- Persistence: The SQLite execution store is the source of truth for what should be running.
- Inspection: The runtime process table is the source of truth for what is actually running.
- No Guessing: Schedune will never silently relaunch, implicitly kill, or pretend a workload is healthy after a crash. If evidence is ambiguous, it goes to
Unknown.
Bootstrap Sequence
When the Schedune Control Plane starts, it performs a Recovery Bootstrap:
- Load Recoverable Executions: Pulls all non-terminal (
Starting,Running,Degraded,Terminating,Unknown) records from the database. - Stamp Recovery Epoch: Associates the current boot process with a new UUID epoch for auditability.
- Inspect Identities: Uses the PID and backend markers to verify the process still exists on the host.
- Classify and Transition:
- Process exists and matches identity: Emits
ExecutionRehydrated, state remainsRunning. - Process missing but was Running: Safely transitions to
Unknown(it may have exited legitimately during the blackout). - Mid-flight during crash (e.g.,
Launchingbefore a PID was assigned): EmitsERR_RECOVERY_STALE_HANDLEand transitions toUnknown. Terminatingprocess is missing: Concludes the termination completed and transitions toTerminated.
- Process exists and matches identity: Emits
Orphan Reconciliation (Upcoming)
Orphans are runtime processes that appear to belong to Schedune but have no durable record. Schedune surfaces orphan candidates with an explicit taxonomy (e.g., OrphanUnmanagedBackendProcess, OrphanStaleExecutionArtifact) but intentionally does not automatically kill or adopt them without explicit operator policy.