Post-Startup Model Loading
At the time of writing, Clockwork pre-loads models during the startup phase.
This is not fundamental. There are two additional circumstances we would ideally support:
- Loading a model from disk at a later point, during inference execution
- Loading a model by transferring it over the network from a different worker (see #31 )
There are several implications of doing this and several considerations:
- Clockwork does not currently do a robust job managing host-side weights memory. For this, we really should treat it like a PageCache (see #12 ) in case a worker gets full
- The Clockwork controller will need to take into account the network as a resource, in order to do network model transfers.
- Interference between network model loads and network input/output transfer will need to be considered as well
- Loading CUDA modules can interfere with ongoing kernel execution and memory transfers. In particular, loading code seems to require a global barrier on the GPU. If this is fundamental then the controller should account for it. We require further profiling to measure the extent of this interference and how predictable the duration is.