Post-OSDI Tidyup
The following should be addressed ASAP:
- Rearrange code, especially workloads, experiments used for Clockwork submission, and schedulers.
- A few compiler warnings surfaced and should be addressed
- README instructions on installing and running clockwork are out of date and missing a couple of steps
- Begin utilizing the cpplint and generally improving code formatting
- Graceful controller shutdown/restart, and re-using of loaded models
- Broken tests
Medium-term important include:
- #49 Handling of network in scheduler; currently this is done implicitly through careful setting of params; but in general, it should be easy for us to incorporate network in a more principled way. At the least, resolve the lingering bug here.
- There remains some sort of livelock bug in the Infer4 scheduler that needs to be addressed. This bug tends to only manifest at scale, with 20 or more worker-gpus.
- Convert the infer4 scheduler to an event simulator style event loop, rather than an eager loop; this should drastically reduce the amount of inter-thread contention. It should also let the scheduler share one core for multiple workers.
- Factor out reusable scheduler components