Clockwork
A multi-tenant managed inference server, backed by a modified version of TVM.
This README file describes the pre-requisites and steps required to build and run Clockwork. If you follow these steps but encounter errors, please e-mail the mailing list.
Clockwork is not feature complete, but we welcome contributions from others!
Mailing List: clockwork-users@googlegroups.com
Step 1: Pre-Requisites
1. NVIDIA Driver and CUDA
Make sure NVIDIA driver and CUDA are installed and CUDA is on your PATH. MPI cluster machines have CUDA 9 installed by default; however Clockwork may also work with other CUDA versions. You can check if CUDA is installed and the version by running nvcc --version
.
Google Cloud VMs We have tested the below instructions using Operating System Deep Learning on Linux
, Version Deep Learning Image: Base m55 (with CUDA 10.0)
. This image comes with CUDA pre-installed. If you use this image, remove conda
from your PATH
variable, else you will see protobuf version mismatch errors.
2. Required Packages
The following apt packages pre-requisites:
apt install libtbb-dev libasio-dev libconfig++-dev libboost-all-dev g++-8 \
make cmake automake autoconf libtool curl unzip clang llvm
3. Installing Protobuf
git clone --recursive -b v3.12.0 https://github.com/protocolbuffers/protobuf.git
cd protobuf
./autogen.sh && ./configure
make -j $(nproc)
make install
/sbin/ldconfig
cd ..
4. Installing TVM
Clone our modified TVM and check out our modified branch (clockwork-v0.6
):
git clone --recursive -b clockwork-v0.6 https://gitlab.mpi-sws.org/cld/ml/tvm
Build TVM
cd tvm/build
cmake ..
make -j $(nproc)
cd ..
Set TVM_HOME
environment variable and add $TVM_HOME/build
to your LD_LIBRARY_PATH
and DYLD_LIBRARY_PATH
environment variables
echo "export TVM_HOME=`pwd`" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$TVM_HOME/build" >> ~/.bashrc
echo "export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:$TVM_HOME/build" >> ~/.bashrc
source ~/.bashrc
Step 2: Building Clockwork
git clone --recursive https://gitlab.mpi-sws.org/cld/ml/clockwork.git
cd clockwork
mkdir build
cd build
cmake ..
make -j $(nproc)
Step 3: Environment Configuration
Clockwork is a high-performance system that depends upon predictability. There are various tweaks to your environment that will make executions more predictable. These environment modifications should be made for Clockwork's worker, controller, and client processes. Some are optional but recommended.
Check your environment
You can check your environment by running Clockwork's:
./profile [check]
1. Increase resource limits (memlock, nofile, rtprio)
Limits on the number of open files, and the amount of page-locked memory, reduce the total number of DNNs clockwork can keep in memory at any point in time. A limit of 1024 is too low. A limit of 16k or higher is acceptable.
Limits can be checked with the ulimit
command (ulimit -aH
lists hard limits, ulimit -a
lists current)
Increase the RLIMIT_NOFILE
(number of open files) and RLIMIT_MEMLOCK
(amount of page-locked memory) to unlimited:
- Open
/etc/security/limits.conf
- Add the following lines:
* hard memlock unlimited
* soft memlock unlimited
* hard nofile unlimited
* soft nofile unlimited
* hard rtprio unlimited
* soft rtprio unlimited
Note: for MPI cluster machines with the default Debian distribution, you will also need to modify /etc/security/limits.d/mpiprefs.conf
- Modify
/etc/systemd/user.conf
and/etc/systemd/system.conf
to add:
DefaultLimitNOFILE=1048576
- Restart to take effect
- Upon restarting, use Clockwork's
./profile [check]
to check if the settings took effect
2. Increase mmap limits
Clockwork uses a lot of shared objects, and we need to increase the mmap limit. As root, run
/sbin/sysctl -w vm.max_map_count=10000000
In general you can check mmap limits with:
/sbin/sysctl vm.max_map_count
This normally does not require a restart. You can check using Clockwork's ./profile [check]
.
This normally does not require a restart. You can check using Clockwork's ./profile [check]
.
3. GPU Settings
3.1. Disable CUDA JIT
Prevent CUDA from caching compiled kernels (note: the models used by Clockwork do not compile to PTX anyway, but if choose to compile JITable models, this setting is important)
export CUDA_CACHE_DISABLE=1
3.1 Enable persistence mode.
nvidia-smi -pm 1
NOTE: This must be done on every restart
3.2 Enable exclusive process mode
nvidia-smi -c 3
NOTE: This must be done on every restart
3.3 Optional: Disable auto boost
nvidia-smi --auto-boost-default=DISABLED
NOTE: This must be done on every restart
3.4 Optional: Configure GPU clocks
You can specify which clock frequencies to use. This does not override built-in temperature auto-scaling.
List available GPU clock frequencies
nvidia-smi -q -d SUPPORTED_CLOCKS
Pick a memory and graphics clock frequency (usually the highest), e.g. on volta machines:
Supported Clocks
Memory : 877 MHz
Graphics : 1380 MHz
Set the default application clock and system clock to those highest values, e.g. on volta machines:
nvidia-smi -ac 877,1380
nvidia-smi -lgc 1380
NOTE: This must be done on every restart
4. Optional: Disable CPU frequency autoscaling
Set the "performance" governor to prevent CPU clock scaling
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Step 4: Runtime Environment Variables
In addition to the environment setup above, Clockwork has several environment variables of its own.
Required: CLOCKWORK_MODEL_DIR
This is required by Clockwork's ./worker
process.
CLOCKWORK_MODEL_DIR
should point to a local directory containing compiled models.
Pre-compiled models used by Clockwork can be found at the clockwork-modelzoo-volta
repository: https://gitlab.mpi-sws.org/cld/ml/clockwork-modelzoo-volta
The process of compiling models is not fully automated currently (please contribute!).
Recommended: CLOCKWORK_LOG_DIR
This is required by Clockwork's ./controller
process.
CLOCKWORK_LOG_DIR
should point to a directory where the controller can write its output request logs. Be aware that for long experiments, these files can be GB large.
If not specified or left blank, Clockwork's controller will write to /local
. If this does not exist on your machine or is not writable, the controller will not output anything.
Please ensure the directory exists; Clockwork will not create the directory for you. Upon startup, the controller will print the location it is writing its request and action logs to.
Optional: AZURE_TRACE_DIR
This is required by Clockwork's ./client
process if you are running the azure
workload.
AZURE_TRACE_DIR
should point to a local directory containing the AzureFunctionsDataset2019
from Microsoft Azure.
The original traces can be found by following the instructions on Microsoft's GitHub repository: https://github.com/Azure/AzurePublicDataset.
Alternatively, a repository containing the traces can be found here: https://gitlab.mpi-sws.org/cld/trace-datasets/azure-functions.
Optional: CLOCKWORK_DISABLE_INPUTS
This is used by Clockwork's ./client
process.
For some experiments you will want to generate model inputs at the controller rather than sending them over the network.
Setting CLOCKWORK_DISABLE_INPUTS=1
will disable clients from sending inputs.
Optional: CLOCKWORK_CONFIG_FILE
This is used by Clockwork's ./worker
process.
Clockwork has a configuration file located under config/default.cfg
. You can specify your own configuration elsewhere and set its path using CLOCKWORK_CONFIG_FILE
.
Step 5: Check Environment
Check the environment is OK
From Clockwork's build
directory,
./profile [check]
This will check your current environment settings.
Check models run
./check_model /home/jcmace/clockwork-modelzoo-volta/resnet50_v2/model
Path should be the path to a model, e.g. from clockwork-modelzoo-volta
in the above example.
Testing without using GPUs
At least 3 machines are required in order to run Clockwork (1 worker, 1 controller, 1 client).
Clockwork can run without GPUs using an emulated worker.
The binaries used below exist in Clockwork's build
directory
1. Start one or more workers
./worker_dummy -n 2
Note: the -n 2
specifies it should simulate 2 GPUs. You can simulate many GPUs by increasing this number.
By default the worker will listen on port 12345. Run ./worker_dummy -h
for more options.
2. Start the controller
./controller INFER4 volta01:12345
Here, volta01:12345
is the address of the worker started in step 1. INFER4
is the name of the default Clockwork scheduler.
By default, the controller will listen for client connections on port 12346. Run ./controller -h
for more options.
3. Start a client
./client volta02:12346 simple
Here, volta02:12346
is the address of the controller started in step 2. simple
is the name of a workload. Run ./client -h
to list available workloads.
Summary
After running the above, you will see the following outputs:
Worker:
jcmace@volta01:~/clockwork/build$ ./worker_dummy -n 2
Starting Clockwork Worker
Loading Clockwork worker default config from /home/jcmace/clockwork/config/default.cfg
IO service thread listening on 0.0.0.0:12345
Received A0:GetWorkerState
Sending R0:GetWorkerState:
page_size=16777216
gpus=
GPU-0 weights_cache=21.5GB (1376 pages) io_pool=512.0MB workspace_pool=512.0MB 0 models currently on GPU
GPU-1 weights_cache=21.5GB (1376 pages) io_pool=512.0MB workspace_pool=512.0MB 0 models currently on GPU
models=
Clock Skew=0 RTT=0 LdWts=0 Inf=0 Evct=0 || Total Pending=0 Errors=0
Received A0:LoadModelFromDisk model=0 [0.0, 16847613635838.0] /home/jcmace/clockwork-modelzoo-volta/resnet50_v2/model
Sending R0:LoadModelFromDisk input=602112 output=4000 weights=112.0 MB (7 pages) xfer=8.4 b1=3.3 b2=5.3 b4=7.5 b8=12.4 duration=14.9
Clock Skew=673306 RTT=53853 LdWts=4 Inf=2885 Evct=0 || Total Pending=2 Errors=0
Clock Skew=668080 RTT=55526 LdWts=0 Inf=4020 Evct=0 || Total Pending=2 Errors=0
Clock Skew=665855 RTT=54409 LdWts=0 Inf=3986 Evct=0 || Total Pending=1 Errors=0
Controller:
jcmace@volta02:~/clockwork/build$ ./controller INFER4 volta01:12345
Starting Clockwork Controller
Logging requests to /local/clockwork_request_log.tsv
Logging actions to /local/clockwork_action_log.tsv
ConcurrentInferAndLoadScheduler using:
default_slo=100000000
latest_delta=10000000
schedule_ahead=10000000
max_allowable_exec_time=250000000
max_batch_size=8
generate_inputs=0
max_gpus=100
IO service thread listening for clients on Connecting to worker volta01:12345
0.0.0.0:12346
Connection established
(Startup) Running ControllerStartup
(Startup-1) Bouncing LS and Infer requests until startup is complete
(Startup-2) Querying current worker state
Clockwork page_size=16777216
Worker 0 2 GPUs 0 models
GPU 0 21.5 GB (1376 pages) 0 loaded models
GPU 1 21.5 GB (1376 pages) 0 loaded models
(Startup-3) Awaiting LoadModel requests from clients
Client --> Req0:LoadModel path=/home/jcmace/clockwork-modelzoo-volta/resnet50_v2/model
(Startup-4) LoadModelStage has begun
Worker <-- A0:LoadModelFromDisk model=0 [0.0, 16847613635839.0] /home/jcmace/clockwork-modelzoo-volta/resnet50_v2/model
Worker --> R0:LoadModelFromDisk input=602112 output=4000 weights=112.0 MB (7 pages) xfer=8.4 b1=3.3 b2=5.3 b4=7.5 b8=12.4 duration=14.9
Client <-- Rsp0:LoadModel model_id=[0->1] input=602112 output=4000
Client --> Req1:LS
(Startup-6) LoadModelStage complete. Printing loaded models:
Clockwork page_size=16777216
Worker 0 2 GPUs 2 models
GPU 0 21.5 GB (1376 pages) 0 loaded models
GPU 1 21.5 GB (1376 pages) 0 loaded models
M-0 src=/home/jcmace/clockwork-modelzoo-volta/resnet50_v2/model input=602112 output=4000 weights=112.0 MB (7 pages) xfer=8.4 b1=3.3 b2=5.3 b4=7.5 b8=12.4
M-1 src=/home/jcmace/clockwork-modelzoo-volta/resnet50_v2/model input=602112 output=4000 weights=112.0 MB (7 pages) xfer=8.4 b1=3.3 b2=5.3 b4=7.5 b8=12.4
(Startup-end) Transitioning to scheduler
Client <-- Rsp1:LS error 5: Controller initializing
Created 2 models
Created 2 GPUs on 1 Workers
Total GPU capacity 2752 pages (1376 per GPU).
Total model pages 14 (0% oversubscription).
* Admitting inference requests
GPU handler [ 0 ] started
GPU handler [ 1 ] started
Network Status: Client ✔✔✔✔✔ Controller ✔✔✔✔✔ Workers (inputs generated by client)
W0-GPU0 LoadW min=8.37 max=8.37 mean=8.37 e2emean=9.67 e2emax=9.67 throughput=0.0 utilization=0.00 clock=[0-0] norm_max=0.00 norm_mean=0.00
Client throughput=0.0 success=100.00% min=13.9 max=13.9 mean=13.9
Network->Workers: 31.4MB/s (361 msgs) snd, 1.4MB/s (361 msgs) rcv,
W0-GPU0 LoadW min=8.37 max=8.37 mean=8.37 e2emean=17.80 e2emax=17.80 throughput=0.1 utilization=0.00 clock=[0-0] norm_max=0.00 norm_mean=0.00
W0-GPU0 Infer min=3.32 max=3.32 mean=3.32 e2emean=4.47 e2emax=7.25 throughput=200.6 utilization=0.67 clock=[1380-1380] norm_max=3.32 norm_mean=3.32
Client throughput=400.8 success=100.00% min=3.6 max=21.8 mean=4.7
Network->Workers: 34.8MB/s (402 msgs) snd, 1.6MB/s (402 msgs) rcv,
W0-GPU1 LoadW min=8.37 max=8.37 mean=8.37 e2emean=13.73 e2emax=17.79 throughput=0.2 utilization=0.00 clock=[0-0] norm_max=0.00 norm_mean=0.00
W0-GPU1 Infer min=3.32 max=3.32 mean=3.32 e2emean=4.47 e2emax=7.13 throughput=200.1 utilization=0.66 clock=[1380-1380] norm_max=3.32 norm_mean=3.32
Client:
jcmace@volta03:~/clockwork/build$ ./client volta02:12346 simple
Running workload `simple` on volta02:12346
Client is sending inputs with requests. Set CLOCKWORK_DISABLE_INPUTS=1 to disable inputs.
Connecting to clockwork @ volta02:12346
Connection established
Found 61 models in /home/jcmace/clockwork-modelzoo-volta
Clockwork initializing, retrying Controller initializing
throughput=300.20 min=3.82 max=22.49 mean=4.97
throughput=401.83 min=3.82 max=7.74 mean=4.96
throughput=398.53 min=3.82 max=7.86 mean=5.00
Testing with GPUs
Repeat the above steps, but instead of running worker_dummy
, run worker
. To see the options available, run worker -h
. By default, worker
will use all available GPUs.
The outputs from each process should be the same.
Troubleshooting
Cannot find nvidia-ml
Currently, the CMakeLists assumes CUDA lives in either /usr/local/cuda/lib64
(the default location in Ubuntu 14.x) or /usr/lib/x86_64-linux-gnu/nvidia/current
(the default location for MPI cluster machines). If you get build errors saying cannot find CUDA or cannot find nvidia-ml, then you'll need to update the include_directories
and link_directories
directives in the CMakeLists.txt with the CUDA location on your machine.
Undefinied reference to tvm::runtime::ManagedCuda...
Undefined reference to tvm::runtime::ManagedCuda... -- this probably means you didn't build TVM properly. Make sure you haven't modified or deleted the file build/config.cmake
in the TVM repository. make clean
and make
TVM again.
Unable to set number of open files with ulimit
Unable to set number of open files with ulimit: default values are picked up from conf files, e.g. /etc/security/limits.conf, but they may be overwritten by files in a subdirectory, e.g. /etc/security/limits.d/mpi.conf
Make sure, upon restarting, that the correct ulimit values have been set, by running ./profile [check]
Cannot apply memory protection
If you are loading lots of models, you might see the following:
src/clockwork/model/so.cpp:20: Check failed: lib_handle_ != nullptr: Failed to load SO /proc/26344/fd/14656/proc/26344/fd/14656: cannot apply additional memory protection after relocation: Cannot allocate memory
src/clockwork/model/so.cpp:20: Check failed: lib_handle_ != nullptr: Failed to load SO /proc/12386/fd/11804/proc/12386/fd/11804: failed to map segment from shared object
Make sure your mmap
limits have been correctly set as described above. You can check by running ./profile [check]
CUDA: out of memory
Upon starting a worker process, you might see:
what(): [20:58:09] /home/jcmace/clockwork/src/clockwork/cache.cpp:237: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: out of memory
If this happens, make sure the weights_cache_size
is set to an appropriate value in config/default.cfg
. By default, these values are configured assuming a 32GB v100 GPU. If you are using a 16GB v100 GPU, you need to reduce these values. For a 16GB GPU, 10737418240L
is an appropriate value for weights_cache_size
.
cuModuleLoadData Error: CUDA_ERROR_OUT_OF_MEMORY
Clockwork cannot load infinitely many models. Each model requires up to 1MB for its kernels on the GPU. With thousands of models this can add up!
While loading models, the client may exit with:
what(): [21:03:51] /home/jcmace/clockwork/src/clockwork/model/cuda.cpp:80: cuModuleLoadData Error: CUDA_ERROR_OUT_OF_MEMORY
If this happens, the GPU ran out of memory. To fix it, you can:
- Use fewer models for your experiment
- Reduce the amount of GPU memory used for
weights_cache_size
. You can modify this inconfig/default.cfg
on workers. Reducing weights cache will leave more memory available for kernels.
Protobuf compiler version doesn't match library version
Installing protocol buffers is annoying. The compiler version must match the library version. Check where the protoc
command leads to (which protoc
). Applications like conda
sometimes install their own, different version of the protocol buffers compiler. If you are on a Google Cloud VM, modify your PATH
variable to remove conda.
All cores exhausted
Clockwork workers require a minimum of 12 CPU cores (3 cores + 9 cores per GPU).
Clockwork controller requires 32 cores.