Fatbin extractor
Currently, clockwork has a convertor binary convert
(and corresponding code in the clockwork-convert directory) that converts a TVM model description into a more efficient clockwork representation.
Clockwork currently re-uses the TVM shared-object verbatim. The shared-object contains both host-side code and device-side CUDA code.
Clockwork's model implementation currently uses this TVM code and will continue to do so. However, rather than directly extracting CUDA code from the shared-object, we'd like to extract it ahead of time as part of the convert
phase, then store it as a separate file.
Next, this lets us use nvcc
to compile the cuda code into a fatbin, avoiding runtime JIT compilation of the PTX code. So ideally, convert
will both pull out the CUDA code, then run nvcc
, and store the result in a file.
Some hacky examples exist in src/cudafatbin.cc which does some of the work; all that's needed is to tidy it up, invoke nvcc, and incorporate this into clockwork's convert
binary.