Dummy Worker

This is an umbrella issue for creating a dummy worker.

Motivation

Clockwork development is currently impeded by two major obstacles:

Workers require real GPUs
Prototyping and testing schedulers requires numerous GPUs

Solution

The solution to the above is to create a "dummy" worker, that emulates the behavior of a real worker, without actually performing the DNN executions. One dummy worker should be able to emulate many GPUs concurrently. This will (a) enable scheduler testing at scale, with few real machines; and (b) enable scheduler testing without GPUs.

Design

The dummy worker should present the same workerapi interface as the real worker. However, when actions are received off the network, instead of actually executing them, it should simply "do nothing".

The dummy worker should be able to emulate multiple GPUs concurrently. Clockwork already has support for multiple GPUs on workers, and this is already integrated into the controller. The dummy worker should be able to present several GPUs (configurable), and emulate them each independently.

Implementation

The following are all requirements of the implementation:

loadModelFromDisk

The dummy worker should respond to loadModelFromDisk in the same way as the real worker, so that we can run workloads agnostic to whether the worker is real or not. However, loadModelFromDisk shouldn't do the full model load that the real worker does. Instead, it only needs to read the model metadata (to determine the number of weights pages, input size, and output size) and the model's performance profile (to determine how long loadWeights and infer should spin for). loadModelFromDisk should produce the same errors that the real worker would produce (e.g. if a model cannot be found). Unlike the real worker, loadModelFromDisk is allowed to return an error if a model does not have a performance profile.

getWorkerStateAction

The getWorkerState action should behave in the same way as the real worker, including the response data that is sent, and any errors that might be sent.

earliest and latest

All actions should respect the earliest and latest timestamps and return the same errors that the real worker returns if actions cannot be executed in time.

Page Cache

Each emulated GPU should have its own page cache. The amount of memory/pages in the page cache should be configurable using the same ClockworkConfig that the real worker uses. loadWeights and evictWeights actions should update the page cache as appropriate. The same logic from the real worker applies here: loadWeights should fail if there are insufficient free pages; evictWeights should fail if no weights are present; infer should fail if weights are not present when it begins executing; and infer should fail if the weights were modified during execution (e.g. if they were evicted while executing).

Executor

Each executor should follow the same behavior as the real worker: dequeue actions one-at-a-time based on the earliest timestamp, and discard actions if latest has passed and return an error. Each executor shouldn't run as a separate thread, as unlike the real worker, having a thread-per-executor would be overkill, especially if we want to be able to simulate dozens of GPUs per worker. Instead, I would recommend implementing the executor in the style of an event simulator, so that multiple executors can be emulated in a single thread. Clockwork's workload generator engine is implemented in this style.

loadWeights

For each GPU, there should be a dedicated executor for its loadWeights actions. For each specified model, loadWeights should execute for the duration specified in the model's performance profile. If, upon completion, the model's weights page allocation was already evicted, then the action should return an error.

evictWeights

Like loadWeights, there should be a dedicated executor for evictWeights actions, separate to the loadWeights actions. evictWeights should only fail because of bad earliest and latest values, or because weights weren't present at all. It should be possible to evict weights even if the corresponding loadWeights is still executing.

infer

Like loadWeights, each emulated GPU should have a dedicated executor for infer actions. The only difference is that infer actions can specify different batch sizes, so the appropriate value should be used from the model's performance profile. The dummy worker should support padded batches (e.g. if only 8 is supported, but the worker receives 7, then it should execute 8).

If the input size of the infer action is zero, then the inferresult output should be size zero.

If the input size of the infer action is non-zero, then the dummy worker will need to allocate an output to send with the inferresult (assuming successful).

Telemetry

The real worker records some telemetry to files for actions and for tasks. The initial dummy worker implementation can omit telemetry; we can revisit this decision in future if it turns out to be necessary.

Regarding CUDA calls

The dummy worker is allowed to depend on CUDA and make CUDA calls, since CUDA will work even on machines where there are no GPUs. However, the dummy worker should not make use of any CUDA devices, since it should be able to run on machines with no GPUs.

Edited Jun 10, 2020 by Jonathan Mace