Host-side paging of `cudaMallocHost` memory

For efficiency reasons, any memory that will be transferred to the GPU needs to be allocated with cudaMallocHost rather than malloc. However, cudaMallocHost is slow, and there are limits on the number of allocations.

For model inputs and outputs, we already have the io_cache in memory.h to reuse host memory.

However, for model weights we need a solution. In particular, we should re-use the PageCache class that is used for device-side weights, and also use it for host-side weights.

Making this change will entail:

Modifying the constructor of Model in model.h to take a std::shared_ptr<Allocation> rather than raw pointers.
Modifying transfer_weights_to_device to copy from the allocation rather than the raw pointers
Modifying Model::loadFromDisk and, if it has been implemented, loading models from over the network, so that model weights are loaded as pages rather than a single blob.