Implement a generic controller that handles loading workers current state, loading new models, and profiling models (!7) · Merge requests · cld / ml / clockwork

Merged Jonathan Mace requested to merge mace-wip-controller into master 4 years ago

This MR introduces a base controller implementation that will handle the model-loading and setup, before handing control over to the desired scheduler.

Scheduler

The Scheduler interface is a simpler version of controller logic. It only has to deal with infer requests, and it doesn't have to set up workers.

The scheduler has a method call start where it will be handed the worker connections (for it to use to send workers actions), as well as a ClockworkState object which captures the current state of all workers.

The scheduler can assume that, once it begins, no further model loading will occur. It only needs to deal with infer requests from clients, sending actions to workers, and receiving results back from workers.

ClockworkState

This includes all of the workers, gpus, the size of their memory, all loaded models, and profiled execution and loading times.

Currently, profiling isn't implemented, so the exec and load times will show up as 0. This will be implemented soon

ControllerWithStartupPhase

The ControllerWithStartupPhase class in src/clockwork/controller/controller.h.

This is the underlying controller implementation. To use it, pass it an instance of a scheduler that you want it to transition to once startup is complete.

The logic for the startup phase is as follows:

First, it queries workers for their currently loaded models
It will then wait for LoadModelFromDisk requests from clients.
After 10 seconds of inactivity (ie, once it stops receiving any new LoadModelFromDisk requests), the controller will start profiling the loaded models
After profiling has completed, the controller will activate the scheduler, and forward all infer requests and worker results to it.

Subtle points:

The 10-second countdown only begins after at least one LoadModelFromDisk has been received. The countdown will reset if any new requests are received. It also doesn't start counting down until all pending LoadModel requests have completed.
The profiling stage isn't currently implemented, and all measurements are initialized to 0. This will come soon.
You can send infer requests during the startup phase; they will simply buffer on the controller, and time out after 10 seconds. They don't time out immediately, to prevent spamming.
After the scheduler begins, any LoadModelFromDisk commands will be rejected.

This MR includes a few bugfixes and updates here and there but shouldn't break anything.

Edited 4 years ago by Jonathan Mace