Host-side paging of `cudaMallocHost` memory
For efficiency reasons, any memory that will be transferred to the GPU needs to be allocated with cudaMallocHost
rather than malloc
. However, cudaMallocHost
is slow, and there are limits on the number of allocations.
For model inputs and outputs, we already have the io_cache
in memory.h to reuse host memory.
However, for model weights we need a solution. In particular, we should re-use the PageCache
class that is used for device-side weights, and also use it for host-side weights.
Making this change will entail:
- Modifying the constructor of
Model
in model.h to take astd::shared_ptr<Allocation>
rather than raw pointers. - Modifying
transfer_weights_to_device
to copy from the allocation rather than the raw pointers - Modifying
Model::loadFromDisk
and, if it has been implemented, loading models from over the network, so that model weights are loaded as pages rather than a single blob.