A watchdog? What is this?

Shared-memory technology allows us to split the workload into multiple programs each doing its own task. All very well until something bad happen: one of these programs crashes (or is forcefully interrupted), the whole computer crashes, etc.

In such situations the whole framework might come to a stop or it might be impossible to restart it (if the whole computer crashed). This is specially true if a non-existing process is still holding the lock on either the main memory allocator or a critical data container (the top folder "/", for example).

Less problematic but still annoying would be to have erroneous values for the reference counters (deleted data/objects would never be removed from the shared memory, creating memory leaks). And there are, of course, other issues for example the links of a node in a linked list might have erroneous values, etc.

The role of the vdsf watchdog is to reduce these issues (as much as possible) and to eventually eliminate them completely. It will do this by using these techniques:

  1. At startup, the watchdog will verify the VDS for all possible inconsistencies and will attempt to correct all the problems it finds.
  2. It will also intercept abnormal exits in real time, as they occur, and cleanup what ever mess was left behind by the crashed or interrupted process.

Current status

It should be noted that additional crash-recovery features will be added as the work progresses - this is an important feature of this software.

Last updated on May 22, 2008.