Understanding explicit OpenCL memory migration between devices

home

Lately someone emailed me asking how OpenCL memory migration works. To be specific, when and how to use the clEnqueueMigrateMemObjects API when using more then one device.

Yeah, the description of it in Khronos's documentation[1] is very unhelpful. Nor is this API used often. I also spent quite some time to understand what the documentation is talking about. But first, let's read the documentation. The API description reads:

Enqueues a command to indicate which device a set of memory objects should be associated with.

[1]: clEnqueueMigrateMemObjects - Khronos OpenCL 1.2 SDK documentation

What are you even talking about!? There's too many questions here. Why indicate? Why is the association indicated not assigned? And when do I actually need this? These are good questions. And the answer is below, in the notes section. Quote:

Typically, memory objects are implicitly migrated to a device for which enqueued commands, using the memory object, are targeted. clEnqueueMigrateMemObjects allows this migration to be explicitly performed ahead of the dependent commands. This allows a user to preemptively change the association of a memory object, through regular command queue scheduling, in order to prepare for another upcoming command. This also permits an application to overlap the placement of memory objects with other unrelated operations before these memory objects are needed potentially hiding transfer latencies.

People with some experience in low level programming probably are thinking "yeah, this makes sense" right now. Good for you. But I feel it needs more explanation. At least, to the best of my knowledge.

The explanation

OpenCL contexts are like resource managers. We developers allocate memory and command queues using the context. OpenCL contexts can manage resources of multiple devides at the same time. And allows sharing on resources. For example on a dual GPU system, both GPUs could be used under the same context. On TI's DSPC8681. The 4 DSP processors can be all under the same context. Or maybe a integrated GPU and a discrete GPU grouped together. There's really no limitation on the combination. A context is jsut a resource manager. For the sake of this post, let's say we have a context ctx. 2 GPU devices gpu0 and gpu1. And 2 command queues cmdq0 and cmdq1. Also I'm going to use the C++ wrapper instead of the C API.

vector<cl::Device> devices;
platform.getDevices(CL_DEVICE_TYPE_GPU, &devices);
auto gpu0 = devices[0];
auto gpu1 = devices[1];
cl::Context ctx({gpu0, gpu1});
cl::CommandQueue cmdq0(ctx, gpu0);
cl::CommandQueue cmdq1(ctx, gpu1);

Now, we can allocate a buffer on the context ctx by constructing a cl::Buffer object. Notice that it does not take a device as an argument. It just takes a context. This means that OpenCL does not know (nor cares) which device the buffer should be allocated on.

cl::Buffer buffer(ctx, CL_MEM_READ_WRITE, sizeof(int)*10);

Thus OpenCL can only actually allocate the memory on the device when we attempt to read/write to it, map it or bound it to a kernel and execute it. These operaions are bounded to a command queue, thus a physical device. Through the command queue OpenCL can establish a connection between the buffer and the device. This is what is document means by "association" - which device the buffer is physically allocated on.

// Write to the buffer. This associates the buffer with the device `gpu0`.
cmdq0.enqueueWriteBuffer(buffer, CL_FALSE, 0, sizeof(int)*10, some_data);
increment.setArg(0, buffer);
cmdq0.enqueueNDRangeKernel(increment, cl::NullRange, cl::NDRange(10));

What happens when we wish to use the same buffer on gpu1? Remember we can bind buffers to all devices in the same context. Executing kernel on GPU1 must work even if the buffer is on GPU0. GPU1 may not be able to access the memory of GPU0 for many reasonses. In fact GPU1 able to directly access GPU0's memory is very unlikely. Most commonly both GPUs are on physically different cards. What OpenCL does is an implicit migration. Before executing the kernel, it finds that buffer exists on gpu0 and moves it to gpu1. With the buffer residing on GPU1's VRAM, the kernel can execute successfully.

// this won't cause any error even if the buffer is on GPU0.
// OpenCL will implicitly migrate the buffer to GPU1.
decrement.setArg(0, buffer);
cmdq1.enqueueNDRangeKernel(decrement, cl::NullRange, cl::NDRange(10));

That's implicit migration. But what if we wish to proactively migrate the buffer? Maybe to reduce latency by doing work while migration is going on? This is where clEnqueueMigrateMemObjects comes in. It explicitly migrates the memory object to a device. Note that this operation may be a no-op as the buffer could already be on the device. Or the vendor desides it's better to not migrate the buffer (ex: both devices can access the same memory).

// Nothing is done since `buffer` is already on `gpu1`.
cmdq1.enqueueMigrateMemObjects({buffer}, CL_MIGRATE_MEM_OBJECT_HOST);

One pratical-ish use is when some CPU-side computation is needed to setup kernel execution. But we know a migraction is needed beforehand. And we need to wait for the kernel to finish so the result can be handed off to the CPU, maybe the next step is faster on a CPU (using events or not). Thus we can't hide latency by doing work after enqueueing kernel execution. Instead of letting OpenCL migrate the buffer right before kernel execution. We can migrate it early and do our CPU side setup. Effectively overlaping CPU compute and data transfer. Thereby lowering the latency.

// `buffer` is something GPU0 computed. And We wish to run finite element analysis
// on GPU1. We can start migration early.
cmdq1.enqueueMigrateMemObjects({buffer}, CL_MIGRATE_MEM_OBJECT_HOST);

// Do some work on CPU to not waste the transfer time.
int iteration = estimate_iteration(<simulation_parameters>);
int epsilon = estimate_epsilon(<simulation_parameters>);

fea_kernel.setArg(0, buffer);
fea_kernel.setArg(1, iteration);
fea_kernel.setArg(2, epsilon);
// Without explicit migration, the kernel has to wait for the migration to finish.
// But we hid the latency by doing it early and spend time computing on the CPU.
cmdq1.enqueueNDRangeKernel(fea_kernel, cl::NullRange, cl::NDRange(10));
// wait for the kernel to finish and print the result. CL_TRUE means blocking.
float* result = (float*)cmdq1.enqueueMapBuffer(buffer, CL_TRUE
    , CL_MAP_READ, 0, sizeof(float)*result_size, nullptr);

for (int i = 0; i < result_size; i++) {
    cout << result[i] << endl;
}

In my experices I don't see that much applications that both can utilize multiple GPUs while also needing CPU to do some of the work. I guess this is why explicit migration is used very little.

Notable points

There's several things about explicit migration that may be confusing. First. You don't need to explicitly migrate a buffer back if you explicitly migrated it. OpenCL can still detect that it's not on the same device and migrate it.

cmdq1.enqueueMigrateMemObjects({buffer}, CL_MIGRATE_MEM_OBJECT_HOST);
decrement.setArg(0, buffer);
cmdq1.enqueueNDRangeKernel(decrement, cl::NullRange, cl::NDRange(10));

// You don't need to explicitly migrate the buffer back as OpenCL will impicitly migrate 
// cmdq0.enqueueMigrateMemObjects({buffer}, CL_MIGRATE_MEM_OBJECT_HOST); // not needed
increment.setArg(0, buffer);
cmdq0.enqueueNDRangeKernel(increment, cl::NullRange, cl::NDRange(10));

Second. You don't need to re-bind the buffer to a kernel after migration. Afterall it is the same buffer under the same context. OpenCL will detect it has been migrated and will migrate it back again. Do be aware that migrating often causes a severe performance hit. For data that doesn't change, just make different buffers and upload them to separate devices.

increment.setArg(0, buffer);
cmdq1.enqueueMigrateMemObjects({buffer}, CL_MIGRATE_MEM_OBJECT_HOST);
...
// You don't need to re-bind the buffer to the kernel.
cmdq0.enqueueNDRangeKernel(increment, cl::NullRange, cl::NDRange(10));

Performance considerations

Migration, explicit or not, can be a costly operation. I advice to avoid migration as much as possible. One also has to be aware that implicit migration can happen. Thus if a buffer is shared among multiple devices. It may trigger migration each time a kernel is executed. Dragging the performance way down. In such case if the data is oncstant it's better to just duplicate that buffer. Or if the data is changing, try reodering the kernels/duplicate buffer to reduce the number of migrations.

Furthermore, the migration process is implementation dependent. The SDK of your deivce may decide the only option is to copy the content from GPU memory to main memory and transfer it to the other GPU. Or it may detect special inter-device communication and do the migration directly. Like directly DMA over PCIe (AFAIK AMD had supported that through DirectDMA). It also may decide no migration is needed. Like when migrating from iGPU to CPU. Or from a sub-device to a sub-device (that exists on the same device).

In general, if you are attempting at hiding latency. Explicit migration is an option. Even if there is added overhead when the underlying device does share memory. The overhead of no-op is negligible compared to the hid latency when needed.

Conclusion

You don't need to explicitly migrate memory objects. OpenCL will do it for you. But if conditions are met and you need the reduced latency, yes, explicit migration can be helpful. Also be aware that migration can be a costly operation. Avoid it when possible. Sometimes using more RAM or doing more conpute is better than migrating. Performance is as always, you only know when measured.

Martin Chang

Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.

marty1885 \at protonmail.com
Matrix: @clehaxze:matrix.clehaxze.tw
Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df