Network IO handling -- controller-side
Transferring large input batches over the network cannot happen instantaneously. A batchsize-16 network transfer can take 8ms on our 10G network.
The controller does not currently account for this delay, instead relying on a combination of generous latest
timestamps and sufficiently high schedule_ahead
.
This approach has some flaws: first, the controller can do a much better job if it properly took these delays into account, and schedule with much lower schedule_ahead. Second, if the controller doesn't account for these things, then it can suffer from easily-avoidable situations where inputs don't arrive in time on a worker for the action to execute.
It should be straightforward for the controller to better account for network transfer delays, since they are predictable. Perhaps they could be an executor in their own right.
This issue includes the following steps:
- Short term: ensure the current controller is more robust to these network delays
- Medium term: come up with a general design for handling network IO