Dummy Worker
This is an umbrella issue for creating a dummy worker.
Motivation
Clockwork development is currently impeded by two major obstacles:
- Workers require real GPUs
- Prototyping and testing schedulers requires numerous GPUs
Solution
The solution to the above is to create a "dummy" worker, that emulates the behavior of a real worker, without actually performing the DNN executions. One dummy worker should be able to emulate many GPUs concurrently. This will (a) enable scheduler testing at scale, with few real machines; and (b) enable scheduler testing without GPUs.
Design
The dummy worker should present the same workerapi interface as the real worker. However, when actions are received off the network, instead of actually executing them, it should simply "do nothing".
The dummy worker should be able to emulate multiple GPUs concurrently. Clockwork already has support for multiple GPUs on workers, and this is already integrated into the controller. The dummy worker should be able to present several GPUs (configurable), and emulate them each independently.
Implementation
The following are all requirements of the implementation:
loadModelFromDisk
The dummy worker should respond to loadModelFromDisk
in the same way as the real worker, so that we can run workloads agnostic to whether the worker is real or not. However, loadModelFromDisk
shouldn't do the full model load that the real worker does. Instead, it only needs to read the model metadata (to determine the number of weights pages, input size, and output size) and the model's performance profile (to determine how long loadWeights
and infer
should spin for). loadModelFromDisk
should produce the same errors that the real worker would produce (e.g. if a model cannot be found). Unlike the real worker, loadModelFromDisk
is allowed to return an error if a model does not have a performance profile.
getWorkerStateAction
The getWorkerState action should behave in the same way as the real worker, including the response data that is sent, and any errors that might be sent.
earliest and latest
All actions should respect the earliest
and latest
timestamps and return the same errors that the real worker returns if actions cannot be executed in time.
Page Cache
Each emulated GPU should have its own page cache. The amount of memory/pages in the page cache should be configurable using the same ClockworkConfig that the real worker uses. loadWeights
and evictWeights
actions should update the page cache as appropriate. The same logic from the real worker applies here: loadWeights
should fail if there are insufficient free pages; evictWeights
should fail if no weights are present; infer
should fail if weights are not present when it begins executing; and infer
should fail if the weights were modified during execution (e.g. if they were evicted while executing).
Executor
Each executor should follow the same behavior as the real worker: dequeue actions one-at-a-time based on the earliest
timestamp, and discard actions if latest
has passed and return an error. Each executor shouldn't run as a separate thread, as unlike the real worker, having a thread-per-executor would be overkill, especially if we want to be able to simulate dozens of GPUs per worker. Instead, I would recommend implementing the executor in the style of an event simulator, so that multiple executors can be emulated in a single thread. Clockwork's workload generator engine is implemented in this style.
loadWeights
For each GPU, there should be a dedicated executor for its loadWeights
actions. For each specified model, loadWeights should execute for the duration specified in the model's performance profile. If, upon completion, the model's weights page allocation was already evicted, then the action should return an error.
evictWeights
Like loadWeights, there should be a dedicated executor for evictWeights
actions, separate to the loadWeights
actions. evictWeights
should only fail because of bad earliest
and latest
values, or because weights weren't present at all. It should be possible to evict weights even if the corresponding loadWeights
is still executing.
infer
Like loadWeights
, each emulated GPU should have a dedicated executor for infer
actions. The only difference is that infer actions can specify different batch sizes, so the appropriate value should be used from the model's performance profile. The dummy worker should support padded batches (e.g. if only 8 is supported, but the worker receives 7, then it should execute 8).
If the input size of the infer action is zero, then the inferresult output should be size zero.
If the input size of the infer action is non-zero, then the dummy worker will need to allocate an output to send with the inferresult (assuming successful).
Telemetry
The real worker records some telemetry to files for actions and for tasks. The initial dummy worker implementation can omit telemetry; we can revisit this decision in future if it turns out to be necessary.
Regarding CUDA calls
The dummy worker is allowed to depend on CUDA and make CUDA calls, since CUDA will work even on machines where there are no GPUs. However, the dummy worker should not make use of any CUDA devices, since it should be able to run on machines with no GPUs.