Telemetry logging on workers
Telemetry definitions are in include/clockwork/telemetry.h. These can be changed -- for example it may be worth adding a model_id to TaskTelemetry. If you add fields to any of the structs, make sure to add a line such as PODS_MDR(executor_id)
to make sure the field gets serialized when serializing the telemetry.
telemetry.h defines TaskTelemetry, ExecutorTelemetry and RequestTelemetry. ExecutorTelemetry was added mostly as a placeholder and it's not actually used (feel free to use, modify, or delete this class). RequestTelemetry is only used by the clockwork alternatives, but not by clockwork itself. TaskTelemetry is used by both clockwork and the clockwork alternatives.
TaskTelemetry measurements are divided into two types.
- The chrono::time_point measurements are all host-side measurements. The main ones are
enqueued
,dequeued
,exec_complete
, andasync_complete
. Theexec_complete
timestamp is taken once all CUDA work for the task has been submitted, but not necessarily completed. Theasync_complete
timestamp is taken after all CUDA work has completed. - The async_wait and async_duration floats are CUDA-side measurements taken with cuda events. Only async_duration really is used.
Currently, every Task has telemetry (task.h). Telemetry values are recorded in various places during task execution, such as by executors (runtime.cpp and runtime.cpp) and by tasks that do asynchronous work (task.cpp).
Clockwork currently DOESN'T dump task telemetry to file -- that is what this issue is for.
Clockwork currently DOES report some telemetry values back to the controller. This is done in action.cpp on a per-action basis (e.g. for the InferAction). This is probably also the appropriate place to report the task telemetry to be logged to disk.
The clockwork alternatives had a simple telemetry logger and this can form the basis for clockwork's telemetry logger (see src/clockwork/telemetry_logger.h). The clockwork alternatives used RequestTelemetry but we probably want to just use TaskTelemetry instead (which might mean having to add a model_id or something to TaskTelemetry).
Summary of suggested modifications:
- Add a few fields to TaskTelemetry, such as model_id and maybe action_id
- Move the existing telemetry_logger.h to src/clockwork/alternatives
- Create a new telemetry_logger.h that logs TaskTelemetry instead of RequestTelemetry
- Modify ClockworkRuntime (src/clockwork/runtime.h) to contain an instance of the telemetry_logger; give it a suitable hard-coded output filename
- Modify actions.cpp so that when actions complete, they queue their TaskTelemetry to the telemetry logger