cuModuleLoad - experimentation and implementation

cuModuleLoad requires a global barrier for cuda calls.

All cuda calls submitted before cuModuleLoad must complete before cuModuleLoad will execute.

Any cuda calls submitted after cuModuleLoad must wait for cuModuleLoad to complete before the cuda calls will execute.

Consider the following sequence:

A large memcpy (10ms) is asynchronously submitted
Immediately thereafter, cuModuleLoad is initiated (0.5 ms)
Immediately thereafter, a GPU execution (2ms) is asynchronously submitted.

In this example, the GPU execution cannot run until after cuModuleLoad has completed, which cannot run until after the memcpy completes. So what would normally be a 2ms execution is now 12.5ms.

In an abstract sense, incorporating cuModuleLoad into the central controller is straightforward -- it's a task that simultaneously requires all resources.

We need to experiment with this, and also check if there are any worker side mechanisms we need to enforce this (e.g. using a RW-lock for all executors).