Fatbin extractor

Currently, clockwork has a convertor binary convert (and corresponding code in the clockwork-convert directory) that converts a TVM model description into a more efficient clockwork representation.

Clockwork currently re-uses the TVM shared-object verbatim. The shared-object contains both host-side code and device-side CUDA code.

Clockwork's model implementation currently uses this TVM code and will continue to do so. However, rather than directly extracting CUDA code from the shared-object, we'd like to extract it ahead of time as part of the convert phase, then store it as a separate file.

Next, this lets us use nvcc to compile the cuda code into a fatbin, avoiding runtime JIT compilation of the PTX code. So ideally, convert will both pull out the CUDA code, then run nvcc, and store the result in a file.

Some hacky examples exist in src/cudafatbin.cc which does some of the work; all that's needed is to tidy it up, invoke nvcc, and incorporate this into clockwork's convert binary.

Edited Oct 10, 2019 by Jonathan Mace