Skip to content

[Offload] Implement the remaining initial Offload API#122106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Apr 22, 2025

Conversation

callumfare
Copy link
Contributor

@callumfarecallumfare commented Jan 8, 2025

Implement the complete initial version of the Offload API, to the extent that is usable for simple offloading programs. Tested with a basic SYCL program.

As far as possible, these are simple wrappers over existing functionality in the plugins.

  • Allocating and freeing memory (host, device, shared).
  • Creating a program
  • Creating a queue (wrapper over asynchronous stream resource)
  • Enqueuing memcpy operations
  • Enqueuing kernel executions
  • Waiting on (optional) output events from the enqueue operations
  • Waiting on a queue to finish

Objects created with the API have reference counting semantics to handle their lifetime. They are created with an initial reference count of 1, which can be incremented and decremented with retain and release functions. They are freed when their reference count reaches 0. Platform and device objects are not reference counted, as they are expected to persist as long as the library is in use, and it's not meaningful for users to create or destroy them.

Tests have been added to offload.unittests, including device code for testing program and kernel related functionality.

The API should still be considered unstable and it's very likely we will need to change the existing entry points.

@jplehrjplehr self-requested a review January 8, 2025 15:17
@github-actionsGitHub Actions
Copy link

github-actionsbot commented Feb 4, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@callumfarecallumfare changed the title Draft: Implement the remaining initial Offload API[Offload] Implement the remaining initial Offload APIFeb 5, 2025
@callumfarecallumfare marked this pull request as ready for review February 5, 2025 13:04
@llvmbot
Copy link
Member

@llvm/pr-subscribers-offload

Author: Callum Fare (callumfare)

Changes

Implement the complete initial version of the Offload API, to the extent that is usable for simple offloading programs. Tested with a basic SYCL program.


Patch is 127.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/122106.diff

31 Files Affected:

  • (modified) offload/liboffload/API/Common.td (+20)
  • (added) offload/liboffload/API/Enqueue.td (+82)
  • (added) offload/liboffload/API/Event.td (+41)
  • (added) offload/liboffload/API/Kernel.td (+62)
  • (added) offload/liboffload/API/Memory.td (+48)
  • (modified) offload/liboffload/API/OffloadAPI.td (+6)
  • (added) offload/liboffload/API/Program.td (+44)
  • (added) offload/liboffload/API/Queue.td (+52)
  • (modified) offload/liboffload/include/generated/OffloadAPI.h (+806)
  • (modified) offload/liboffload/include/generated/OffloadEntryPoints.inc (+913)
  • (modified) offload/liboffload/include/generated/OffloadFuncs.inc (+42)
  • (modified) offload/liboffload/include/generated/OffloadImplFuncDecls.inc (+64)
  • (modified) offload/liboffload/include/generated/OffloadPrint.hpp (+325)
  • (modified) offload/liboffload/src/OffloadImpl.cpp (+358)
  • (modified) offload/plugins-nextgen/common/include/GlobalHandler.h (+3-2)
  • (modified) offload/plugins-nextgen/cuda/src/rtl.cpp (+28)
  • (modified) offload/plugins-nextgen/host/src/rtl.cpp (+2-2)
  • (modified) offload/tools/offload-tblgen/PrintGen.cpp (+33-2)
  • (modified) offload/tools/offload-tblgen/RecordTypes.hpp (+2)
  • (modified) offload/unittests/OffloadAPI/CMakeLists.txt (+11-1)
  • (modified) offload/unittests/OffloadAPI/common/Fixtures.hpp (+17-1)
  • (added) offload/unittests/OffloadAPI/enqueue/olEnqueueDataCopy.cpp (+36)
  • (added) offload/unittests/OffloadAPI/enqueue/olEnqueueDataRead.cpp (+29)
  • (added) offload/unittests/OffloadAPI/enqueue/olEnqueueDataWrite.cpp (+23)
  • (added) offload/unittests/OffloadAPI/memory/olMemAlloc.cpp (+45)
  • (added) offload/unittests/OffloadAPI/memory/olMemFree.cpp (+47)
  • (modified) offload/unittests/OffloadAPI/platform/olPlatformInfo.hpp (+1)
  • (added) offload/unittests/OffloadAPI/queue/olCreateQueue.cpp (+28)
  • (added) offload/unittests/OffloadAPI/queue/olFinishQueue.cpp (+17)
  • (added) offload/unittests/OffloadAPI/queue/olReleaseQueue.cpp (+21)
  • (added) offload/unittests/OffloadAPI/queue/olRetainQueue.cpp (+18)
diff --git a/offload/liboffload/API/Common.td b/offload/liboffload/API/Common.td index 5b19d1d47129ef0..7fedb2002f157ea 100644 --- a/offload/liboffload/API/Common.td+++ b/offload/liboffload/API/Common.td@@ -62,6 +62,26 @@ def : Handle { let desc = "Handle of context object"; } +def : Handle {+ let name = "ol_queue_handle_t";+ let desc = "Handle of queue object";+}++def : Handle {+ let name = "ol_event_handle_t";+ let desc = "Handle of event object";+}++def : Handle {+ let name = "ol_program_handle_t";+ let desc = "Handle of program object";+}++def : Handle {+ let name = "ol_kernel_handle_t";+ let desc = "Handle of kernel object";+}+ def : Enum { let name = "ol_errc_t"; let desc = "Defines Return/Error codes"; diff --git a/offload/liboffload/API/Enqueue.td b/offload/liboffload/API/Enqueue.td new file mode 100644 index 000000000000000..d9215e8175ef8a3 --- /dev/null+++ b/offload/liboffload/API/Enqueue.td@@ -0,0 +1,82 @@+//===-- Enqueue.td - Enqueue definitions for Offload -------*- tablegen -*-===//+//+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.+// See https://llvm.org/LICENSE.txt for license information.+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception+//+//===----------------------------------------------------------------------===//+//+// This file contains Offload API definitions related to enqueable operations+//+//===----------------------------------------------------------------------===//++def : Function {+ let name = "olEnqueueDataWrite";+ let desc = "Enqueue a write operation from host to device memory";+ let details = [];+ let params = [+ Param<"ol_queue_handle_t", "Queue", "handle of the queue", PARAM_IN>,+ Param<"void*", "SrcPtr", "host pointer to copy from", PARAM_IN>,+ Param<"void*", "DstPtr", "device pointer to copy to", PARAM_IN>,+ Param<"size_t", "Size", "size in bytes of data to copy", PARAM_IN>,+ Param<"ol_event_handle_t*", "EventOut", "optional recorded event for the enqueued operation", PARAM_OUT_OPTIONAL>+ ];+ let returns = [];+}++def : Function {+ let name = "olEnqueueDataRead";+ let desc = "Enqueue a read operation from device to host memory";+ let details = [];+ let params = [+ Param<"ol_queue_handle_t", "Queue", "handle of the queue", PARAM_IN>,+ Param<"void*", "SrcPtr", "device pointer to copy from", PARAM_IN>,+ Param<"void*", "DstPtr", "host pointer to copy to", PARAM_IN>,+ Param<"size_t", "Size", "size in bytes of data to copy", PARAM_IN>,+ Param<"ol_event_handle_t*", "EventOut", "optional recorded event for the enqueued operation", PARAM_OUT_OPTIONAL>+ ];+ let returns = [];+}++def : Function {+ let name = "olEnqueueDataCopy";+ let desc = "Enqueue a write operation between device allocations";+ let details = [];+ let params = [+ Param<"ol_queue_handle_t", "Queue", "handle of the queue", PARAM_IN>,+ Param<"void*", "SrcPtr", "device pointer to copy from", PARAM_IN>,+ Param<"void*", "DstPtr", "device pointer to copy to", PARAM_IN>,+ Param<"ol_device_handle_t", "DstDevice", "device that the destination pointer is resident on", PARAM_IN>,+ Param<"size_t", "Size", "size in bytes of data to copy", PARAM_IN>,+ Param<"ol_event_handle_t*", "EventOut", "optional recorded event for the enqueued operation", PARAM_OUT_OPTIONAL>+ ];+ let returns = [];+}+++def : Struct {+ let name = "ol_kernel_launch_size_args_t";+ let desc = "Size-related arguments for a kernel launch.";+ let members = [+ StructMember<"size_t", "Dimensions", "Number of work dimensions">,+ StructMember<"size_t", "NumGroupsX", "Number of work groups on the X dimension">,+ StructMember<"size_t", "NumGroupsY", "Number of work groups on the Y dimension">,+ StructMember<"size_t", "NumGroupsZ", "Number of work groups on the Z dimension">,+ StructMember<"size_t", "GroupSizeX", "Size of a work group on the X dimension.">,+ StructMember<"size_t", "GroupSizeY", "Size of a work group on the Y dimension.">,+ StructMember<"size_t", "GroupSizeZ", "Size of a work group on the Z dimension.">+ ];+}++def : Function {+ let name = "olEnqueueKernelLaunch";+ let desc = "Enqueue a kernel launch with the specified size and parameters";+ let details = [];+ let params = [+ Param<"ol_queue_handle_t", "Queue", "handle of the queue", PARAM_IN>,+ Param<"ol_kernel_handle_t", "Kernel", "handle of the kernel", PARAM_IN>,+ Param<"const ol_kernel_launch_size_args_t*", "LaunchSizeArgs", "pointer to the struct containing launch size parameters", PARAM_IN>,+ Param<"ol_event_handle_t*", "EventOut", "optional recorded event for the enqueued operation", PARAM_OUT_OPTIONAL>+ ];+ let returns = [];+}diff --git a/offload/liboffload/API/Event.td b/offload/liboffload/API/Event.td new file mode 100644 index 000000000000000..db90a7c8e2be43b --- /dev/null+++ b/offload/liboffload/API/Event.td@@ -0,0 +1,41 @@+//===-- Event.td - Event definitions for Offload -----------*- tablegen -*-===//+//+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.+// See https://llvm.org/LICENSE.txt for license information.+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception+//+//===----------------------------------------------------------------------===//+//+// This file contains Offload API definitions related to the event handle+//+//===----------------------------------------------------------------------===//++def : Function {+ let name = "olRetainEvent";+ let desc = "Increment the reference count of the given event";+ let details = [];+ let params = [+ Param<"ol_event_handle_t", "Event", "handle of the event", PARAM_IN>+ ];+ let returns = [];+}++def : Function {+ let name = "olReleaseEvent";+ let desc = "Decrement the reference count of the given event";+ let details = [];+ let params = [+ Param<"ol_event_handle_t", "Event", "handle of the event", PARAM_IN>+ ];+ let returns = [];+}++def : Function {+ let name = "olWaitEvent";+ let desc = "Wait for the event to be complete";+ let details = [];+ let params = [+ Param<"ol_event_handle_t", "Event", "handle of the event", PARAM_IN>+ ];+ let returns = [];+}diff --git a/offload/liboffload/API/Kernel.td b/offload/liboffload/API/Kernel.td new file mode 100644 index 000000000000000..4c8c84e9c71de89 --- /dev/null+++ b/offload/liboffload/API/Kernel.td@@ -0,0 +1,62 @@+def : Function {+ let name = "olCreateKernel";+ let desc = "";+ let details = [];+ let params = [+ Param<"ol_program_handle_t", "Program", "handle of the program", PARAM_IN>,+ Param<"const char*", "KernelName", "name of the kernel entry point in the program", PARAM_IN>,+ Param<"ol_kernel_handle_t*", "Kernel", "output pointer for the created kernel", PARAM_OUT>+ ];+ let returns = [];+}++def : Function {+ let name = "olRetainKernel";+ let desc = "Increment the reference count of the given kernel";+ let details = [];+ let params = [+ Param<"ol_kernel_handle_t", "Kernel", "handle of the kernel", PARAM_IN>+ ];+ let returns = [];+}++def : Function {+ let name = "olReleaseKernel";+ let desc = "Decrement the reference count of the given kernel";+ let details = [];+ let params = [+ Param<"ol_kernel_handle_t", "Kernel", "handle of the kernel", PARAM_IN>+ ];+ let returns = [];+}++def : Function {+ let name = "olSetKernelArgValue";+ let desc = "Set the value of a single kernel argument at the given index";+ let details = [+ "The implementation will construct and lay out the backing storage for the kernel arguments."+ "The effects of calls to this function on a kernel are lost if olSetKernelArgsData is called."+ ];+ let params = [+ Param<"ol_kernel_handle_t", "Kernel", "handle of the kernel", PARAM_IN>,+ Param<"uint32_t", "Index", "index of the argument", PARAM_IN>,+ Param<"size_t", "Size", "size of the argument data", PARAM_IN>,+ Param<"void*", "ArgData", "pointer to the argument data", PARAM_IN>+ ];+ let returns = [];+}++def : Function {+ let name = "olSetKernelArgsData";+ let desc = "Set the entire argument data for a kernel";+ let details = [+ "Previous calls to olSetKernelArgValue on the same kernel are invalidated by this function"+ "The data pointed to by ArgsData is assumed to be laid out correctly according to the requirements of the backend API"+ ];+ let params = [+ Param<"ol_kernel_handle_t", "Kernel", "handle of the kernel", PARAM_IN>,+ Param<"void*", "ArgsData", "pointer to the argument data", PARAM_IN>,+ Param<"size_t", "ArgsDataSize", "size of the argument data", PARAM_IN>+ ];+ let returns = [];+}diff --git a/offload/liboffload/API/Memory.td b/offload/liboffload/API/Memory.td new file mode 100644 index 000000000000000..2c3f4c83980d030 --- /dev/null+++ b/offload/liboffload/API/Memory.td@@ -0,0 +1,48 @@+//===-- Memory.td - Memory definitions for Offload ---------*- tablegen -*-===//+//+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.+// See https://llvm.org/LICENSE.txt for license information.+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception+//+//===----------------------------------------------------------------------===//+//+// This file contains Offload API definitions related to memory allocations+//+//===----------------------------------------------------------------------===//++def : Enum {+ let name = "ol_alloc_type_t";+ let desc = "Represents the type of allocation made with olMemAlloc";+ let etors = [+ Etor<"HOST", "Host allocation">,+ Etor<"DEVICE", "Device allocation">,+ Etor<"SHARED", "Shared allocation">+ ];+}++def : Function {+ let name = "olMemAlloc";+ let desc = "Creates a memory allocation on the specified device";+ let params = [+ Param<"ol_device_handle_t", "Device", "handle of the device to allocate on", PARAM_IN>,+ Param<"ol_alloc_type_t", "Type", "type of the allocation", PARAM_IN>,+ Param<"size_t", "Size", "size of the allocation in bytes", PARAM_IN>,+ Param<"void**", "AllocationOut", "output for the allocated pointer", PARAM_OUT>+ ];+ let returns = [+ Return<"OL_ERRC_INVALID_SIZE", [+ "`Size == 0`"+ ]>+ ];+}++def : Function {+ let name = "olMemFree";+ let desc = "Frees a memory allocation previously made by olMemAlloc";+ let params = [+ Param<"ol_device_handle_t", "Device", "handle of the device to allocate on", PARAM_IN>,+ Param<"ol_alloc_type_t", "Type", "type of the allocation", PARAM_IN>,+ Param<"void*", "Address", "address of the allocation to free", PARAM_IN>,+ ];+ let returns = [];+}diff --git a/offload/liboffload/API/OffloadAPI.td b/offload/liboffload/API/OffloadAPI.td index 8a0c3c405812232..f2822b93e6bf8f6 100644 --- a/offload/liboffload/API/OffloadAPI.td+++ b/offload/liboffload/API/OffloadAPI.td@@ -13,3 +13,9 @@ include "APIDefs.td" include "Common.td" include "Platform.td" include "Device.td" +include "Memory.td"+include "Queue.td"+include "Event.td"+include "Enqueue.td"+include "Program.td"+include "Kernel.td"diff --git a/offload/liboffload/API/Program.td b/offload/liboffload/API/Program.td new file mode 100644 index 000000000000000..684a6581320f8d5 --- /dev/null+++ b/offload/liboffload/API/Program.td@@ -0,0 +1,44 @@+//===-- Program.td - Program definitions for Offload -------*- tablegen -*-===//+//+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.+// See https://llvm.org/LICENSE.txt for license information.+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception+//+//===----------------------------------------------------------------------===//+//+// This file contains Offload API definitions related to the program handle+//+//===----------------------------------------------------------------------===//++def : Function {+ let name = "olCreateProgram";+ let desc = "";+ let details = [];+ let params = [+ Param<"ol_device_handle_t", "Device", "handle of the device", PARAM_IN>,+ Param<"void*", "ProgData", "pointer to the program binary data", PARAM_IN>,+ Param<"size_t", "ProgDataSize", "size of the program binary in bytes", PARAM_IN>,+ Param<"ol_program_handle_t*", "Queue", "output pointer for the created program", PARAM_OUT>+ ];+ let returns = [];+}++def : Function {+ let name = "olRetainProgram";+ let desc = "Create a queue for the given device";+ let details = [];+ let params = [+ Param<"ol_program_handle_t", "Program", "handle of the program", PARAM_IN>+ ];+ let returns = [];+}++def : Function {+ let name = "olReleaseProgram";+ let desc = "Create a queue for the given device";+ let details = [];+ let params = [+ Param<"ol_program_handle_t", "Program", "handle of the program", PARAM_IN>+ ];+ let returns = [];+}diff --git a/offload/liboffload/API/Queue.td b/offload/liboffload/API/Queue.td new file mode 100644 index 000000000000000..5629fa40d56d5f3 --- /dev/null+++ b/offload/liboffload/API/Queue.td@@ -0,0 +1,52 @@+//===-- Queue.td - Queue definitions for Offload -----------*- tablegen -*-===//+//+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.+// See https://llvm.org/LICENSE.txt for license information.+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception+//+//===----------------------------------------------------------------------===//+//+// This file contains Offload API definitions related to the queue handle+//+//===----------------------------------------------------------------------===//++def : Function {+ let name = "olCreateQueue";+ let desc = "Create a queue for the given device";+ let details = [];+ let params = [+ Param<"ol_device_handle_t", "Device", "handle of the device", PARAM_IN>,+ Param<"ol_queue_handle_t*", "Queue", "output pointer for the created queue", PARAM_OUT>+ ];+ let returns = [];+}++def : Function {+ let name = "olRetainQueue";+ let desc = "Create a queue for the given device";+ let details = [];+ let params = [+ Param<"ol_queue_handle_t", "Queue", "handle of the queue", PARAM_IN>+ ];+ let returns = [];+}++def : Function {+ let name = "olReleaseQueue";+ let desc = "Create a queue for the given device";+ let details = [];+ let params = [+ Param<"ol_queue_handle_t", "Queue", "handle of the queue", PARAM_IN>+ ];+ let returns = [];+}++def : Function {+ let name = "olFinishQueue";+ let desc = "Wait for the enqueued work on a queue to complete";+ let details = [];+ let params = [+ Param<"ol_queue_handle_t", "Queue", "handle of the queue", PARAM_IN>+ ];+ let returns = [];+}diff --git a/offload/liboffload/include/generated/OffloadAPI.h b/offload/liboffload/include/generated/OffloadAPI.h index 11fcc96625ab8dd..950c0e37ae67c4e 100644 --- a/offload/liboffload/include/generated/OffloadAPI.h+++ b/offload/liboffload/include/generated/OffloadAPI.h@@ -85,6 +85,22 @@ typedef struct ol_device_handle_t_ *ol_device_handle_t; /// @brief Handle of context object typedef struct ol_context_handle_t_ *ol_context_handle_t; +///////////////////////////////////////////////////////////////////////////////+/// @brief Handle of queue object+typedef struct ol_queue_handle_t_ *ol_queue_handle_t;++///////////////////////////////////////////////////////////////////////////////+/// @brief Handle of event object+typedef struct ol_event_handle_t_ *ol_event_handle_t;++///////////////////////////////////////////////////////////////////////////////+/// @brief Handle of program object+typedef struct ol_program_handle_t_ *ol_program_handle_t;++///////////////////////////////////////////////////////////////////////////////+/// @brief Handle of kernel object+typedef struct ol_kernel_handle_t_ *ol_kernel_handle_t;+ /////////////////////////////////////////////////////////////////////////////// /// @brief Defines Return/Error codes typedef enum ol_errc_t { @@ -460,6 +476,459 @@ OL_APIEXPORT ol_result_t OL_APICALL olGetDeviceInfoSize( // [out] pointer to the number of bytes required to store the query size_t *PropSizeRet); +///////////////////////////////////////////////////////////////////////////////+/// @brief Represents the type of allocation made with olMemAlloc+typedef enum ol_alloc_type_t {+ /// Host allocation+ OL_ALLOC_TYPE_HOST = 0,+ /// Device allocation+ OL_ALLOC_TYPE_DEVICE = 1,+ /// Shared allocation+ OL_ALLOC_TYPE_SHARED = 2,+ /// @cond+ OL_ALLOC_TYPE_FORCE_UINT32 = 0x7fffffff+ /// @endcond++} ol_alloc_type_t;++///////////////////////////////////////////////////////////////////////////////+/// @brief Creates a memory allocation on the specified device+///+/// @details+///+/// @returns+/// - ::OL_RESULT_SUCCESS+/// - ::OL_ERRC_UNINITIALIZED+/// - ::OL_ERRC_DEVICE_LOST+/// - ::OL_ERRC_INVALID_SIZE+/// + `Size == 0`+/// - ::OL_ERRC_INVALID_NULL_HANDLE+/// + `NULL == Device`+/// - ::OL_ERRC_INVALID_NULL_POINTER+/// + `NULL == AllocationOut`+OL_APIEXPORT ol_result_t OL_APICALL olMemAlloc(+ // [in] handle of the device to allocate on+ ol_device_handle_t Device,+ // [in] type of the allocation+ ol_alloc_type_t Type,+ // [in] size of the allocation in bytes+ size_t Size,+ // [out] output for the allocated pointer+ void **AllocationOut);++///////////////////////////////////////////////////////////////////////////////+/// @brief Frees a memory allocation previously made by olMemAlloc+///+/// @details+///+/// @returns+/// - ::OL_RESULT_SUCCESS+/// - ::OL_ERRC_UNINITIALIZED+/// - ::OL_ERRC_DEVICE_LOST+/// - ::OL_ERRC_INVALID_NULL_HANDLE+/// + `NULL == Device`+/// - ::OL_ERRC_INVALID_NULL_POINTER+/// + `NULL == Address`+OL_APIEXPORT ol_result_t OL_APICALL olMemFree(+ // [in] handle of the device to allocate on+ ol_device_handle_t Device,+ // [in] type of the allocation+ ol_alloc_type_t Type,+ // [in] address of the allocation to free+ void *Address);++///////////////////////////////////////////////////////////////////////////////+/// @brief Create a queue for the given device+///+/// @details+///+/// @returns+/// - ::OL_RESULT_SUCCESS+/// - ::OL_ERRC_UNINITIALIZED+/// - ::OL_ERRC_DEVICE_LOST+/// - ::OL_ERRC_INVALID_NULL_HANDLE+/// + `NULL == Device`+/// - ::OL_ERRC_INVALID_NULL_POINTER+/// + `NULL == Queue`+OL_APIEXPORT ol_result_t OL_APICALL olCreateQueue(+ // [in] handle of the device+ ol_device_handle_t Device,+ // [out] output pointer for the created queue+ ol_queue_handle_t *Queue);++///////////////////////////////////////////////////////////////////////////////+/// @brief Create a queue for the given device+///+/// @details+///+/// @returns+/// - ::OL_RESULT_SUCCESS+/// - ::OL_ERRC_UNINITIALIZED+/// - ::OL_ERRC_DEVICE_LOST+/// - ::OL_ERRC_INVALID_NULL_HANDLE+/// + `NULL == Queue`+/// - ::OL_ERRC_INVALID_NULL_POINTER+OL_APIEXPORT ol_result_t OL_APICALL olRetainQueue(+ // [in] handle of the queue+ ol_queue_handle_t Queue);++///////////////////////////////////////////////////////////////////////////////+/// @brief Create a queue for the given device+///+/// @details+///+/// @returns+/// - ::OL_RESULT_SUCCESS+/// - ::OL_ERRC_UNINITIALIZED+/// - ::OL_ERRC_DEVICE_LOST+/// - ::OL_ERRC_INVALID_NULL_HANDLE+/// + `NULL == Queue`+/// - ::OL_ERRC_INVALID_NULL_POINTER+OL_APIEXPORT ol_result_t OL_APICALL olReleaseQueue(+ // [in] handle of the queue+ ol_queue_handle_t Queue);++///////////////////////////////////////////////////////////////////////////////+/// @brief Wait for the enq... [truncated] 
Copy link
Contributor

@jplehrjplehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if these questions were already discussed in a meeting or so.

Copy link
Member

@jdoerfertjdoerfert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a bunch of comments in addition to what we discussed earlier.
There were also some copy-n-paste errors in the td files.

@callumfarecallumfareforce-pushed the offload_new_api_continued branch from b468b34 to 44122e1CompareFebruary 7, 2025 13:56
@jhuber6
Copy link
Contributor

What is the plan for testing things on the GPU? I'm thinking it would be best to store some C/C++ code that tests the desired functionality and build them separately to a specific name, then mark it as a dependency on the unit test target and just open it as file. I.e.

#include <gpuintrin.h> extern "C" __gpu_kernel void kernel() { some_test(); } 

Another thing we need to fix in the libomptarget runtime is that the images are assumed to live throughout the entire execution, we really need to copy it internally and maintain its lifetime ourself now that it's going to come from external users instead of a constant pointer in an ELF.

}

ol_impl_result_t olCreateKernel_impl(ol_program_handle_t Program,
const char *KernelName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally, we likely should use StringRefs.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interfaces of the implementation functions are identical to the actual entry points, so the string types are kept as const char* for consistency. The tablegen backend could convert them to StringRef though if we wanted. Elsewhere we should be consistently using StringRef.


auto Err = KernelImpl->init(Device, *Program->Image);
if (Err)
return {OL_ERRC_UNKNOWN, "Could not initialize the kernel"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess later we want to forward Err one way or another.

Comment on lines 1092 to 923
if (NULL == ArgumentsData) {
return OL_ERRC_INVALID_NULL_POINTER;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (NULL == ArgumentsData) {
returnOL_ERRC_INVALID_NULL_POINTER;
}
if (!ArgumentsData)
returnOL_ERRC_INVALID_NULL_POINTER;
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks appear in the auto-generated documentation, where I think it's better to have it explicitly spelled out as NULL == x

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@callumfare
Copy link
ContributorAuthor

I believe I've addressed most of the comments now so I've closed them, the remaining few I'll look at tomorrow. Let me know if I've missed anything major. Since there are no plugin changes required now I've brought across the changes from the other draft PR and squashed everything together.

The only thing I haven't addressed is the indirection of the handle types (e.g. ol_device_impl_t vs GenericDeviceTy) due to a few issues:

  • The ol_device_impl_t would have to be 1:1 with GenericDeviceTy, and therefore the host device would need to be represented with the actual device from the host plugin, rather than the current placeholder value. There are actually 4 hard-coded devices in the plugin, but even ignoring that it also means that LIBOMPTARGET_PLUGINS_TO_BUILD must contain host or olMemcpy won't work at all. I don't know if this requirement would break CI or other important configurations.

  • Queues and events are represented as void* and require additionally storing the device to be usable. The plugins would ideally have a GenericQueueTy and GenericEventTy to eliminate the indirection required here.

  • Reference counts would also have to be done differently, probably as a map of handle pointers to their reference count. There would obviously be a bit of a performance penalty from this but I wouldn't expect it to be unreasonable.

Given the limitations I think it would be better to keep the handle types for now and work towards eliminating them in the future, with the necessary plugin changes.

@@ -104,3 +104,15 @@ def : Function {
Return<"OL_ERRC_INVALID_DEVICE">
];
}

def : Function {
let name = "olGetHostDevice";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel like this should be a kind when traversing the device list.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doable but requires one of:

  • Having the host device appear in each platform (the actual handle would be identical)
  • Having a separate host platform with just the host device

I'm not sure which would be preferable. It's a bit clunky either way so having this function could be nice as a shortcut, although I don't know if we want to pollute the API with loads of helper functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, the platform is an implementation detail of the device. Something you can query, but the API just iterates over devices.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our discussion in the last call I think we were leaning to keeping the idea of a user-visible platform object rather than just a string or enum property of a device.

There are two ways I can think of doing this:

  • Keep the API as is. A user explicitly iterates over platforms, and then fetches devices from those platforms. The platform is a handle that can be queried for information like vendor name etc. Users that don't care about the platforms can just iterate over all of them to get all devices. We would need to add a special host platform to contain the host device.

  • Make devices the top-level object that the user interacts with. This essentially means that (as in libomptarget) the user iterates over a flat list of devices. There would still be a platform object that the user can query for from a device (using olGetDeviceInfo). This platform handle can then be queried for specific properties. We would need to add a special host platform to contain the host device, but it wouldn't really be user visible unless they go querying for it.

I slightly prefer the first option but I'm happy either way, it's just a small mechanical change in how the API is used to get to the list of available devices.

Is one of these options more preferable? Or is there a better way to go about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the second, since I think we should really think of this as abstracting over devices, not vendors. However it definitely makes some things more difficulit, because then we need to imagine how to handle things that are incompatible between devices. For example, you can't make an async object for an AMD kernel and then try to schedule an NVIDIA kernel on it.

A possible solution is to allow iterating devices with a filter on it, that's kind of what HSA does. You iterate all the devices and check if it's a GPU or a CPU or whatever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not a fan of it. I like how HSA does it with a callback better. OpenMP mostly uses it by more or less getting all devices that match an ELF or LLVM IR file.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could keep the filter callbacks but split the logic back out to olGetPlatforms and olGetDevices

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would HSA-like olIteratePlatforms and olIterateDevices be better? The latter taking a platform argument. We could still lazily initialize platforms so just iterating over them doesn't init all the plugins and load dependent runtimes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably fine.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed another update based on our last discussion. There's a single HSA style olDeviceIterate function and platforms (if needed) can be queried from the device handle. I've also finally moved the host device into the list of regular devices and removed olGetHostDevice.

def : Struct {
let name = "ol_kernel_launch_size_args_t";
let desc = "Size-related arguments for a kernel launch.";
let members = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this handle dynamic shared memory?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added dynamic shared memory to this struct now. I left it out originally because I'm not sure of a good way to test it in a device agnostic way. Is there something in gpuintrin.h that could help?

let name = "ol_alloc_type_t";
let desc = "Represents the type of allocation made with olMemAlloc.";
let etors = [
Etor<"HOST", "Host allocation">,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names suck, they roughly equate to CUDA's host, managed, and device memory. Honestly we should take this opportunity to change them to something more understandable.

In this context, I believe host is 'pinned' memory that always resides on the host managed is memory that can migrate in the unified memory context. While device is just memory that only exists on the GPU. HSA has coarse-grained and fine-grained. coarse-grained being only accessible to one 'agent' (i.e. GPU) while fine-grained is likely pinned. They also have their svm API which I believe is closer to managed. Naming things is hard unfortunately.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree but I'm struggling to come up with better names. Maybe we can discuss this in the next call and come to a decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably just go with pinned, migratable, and device if nothing else. But not going to bikeshed here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be called pinned. The memory is allocated on the host, accessible by the device and does not migrate to device. ie stays on host. pinned in non offload term could mean pinned on the host and not paged out.
Also the "migratable" memory is which is accessible by both host and device and the implementation may not migrate the memory but allocate in a location where both host and device can access.

let name = "olMemFree";
let desc = "Frees a memory allocation previously made by olMemAlloc.";
let params = [
Param<"ol_device_handle_t", "Device", "handle of the device to allocate on", PARAM_IN>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This having a device parameter is confounding. If we allocate pinned memory, who owns it? The CPU, the GPU? Both? This is likely something that could be managed inside, but I'm a little weary of putting more pointer trees in runtimes.

Copy link
ContributorAuthor

@callumfarecallumfareMar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point. We do need some way of dispatching to the correct plugin and like you said it's probably not ideal to track this for all pointers.

The UR equivalent (urUSMFree) takes a context parameter, where the context is a group of one or more devices from the same platform. Host allocations belong to a context rather than any specific device.

I guess the closest we have right now is the platform. The user will always know the platform of the device that did the allocation and can keep a hold of it for the olMemFree call. But by extension that means they could also just track the device itself anyway.

I think in terms of ownership it always makes sense for the GPU to always have ownership because the pinned memory is always managed by a specific device driver and (presumably) won't be compatible between GPUs from different plugins. Likewise shared memory can't migrate between any arbitrary devices.

The way olMemAlloc is designed, I'm not sure it's meaningful to do, for example, a OL_ALLOC_TYPE_SHARED allocation on the special host device. I'm not sure how to reconcile these different kinds and the existence of the host device as a valid allocation target.

Maybe we need to add the concept of a context, which contains one or more devices and has ownership of the allocations, and are restricted to containing the host device and one or more regular devices from a single plugin/platform. The host device could be implicit. Shared memory should always be migratable between the devices, pinned memory should be accessible on every device. Memcpy could take contexts parameters instead of device parameters, and it would still be able to handle copies across different contexts to support the use case of copies between different GPUs etc.

Fundamentally we have to track additional information that underlying APIs don't need to worry about in knowing what platform is managing the memory.

It might be possible to allow contexts to contain any devices but you'd need to be clever with handling pinned memory, and you have problems like device pointers maybe not being unique. I don't know if that would be worth the effort or even solvable.

I'm about to be away for a couple of weeks, but I think this direction is worth discussing, even if we kick the can down the road a bit and leave it for a future PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert, but I think in theory you could have HMM configured for the same memory region between separate GPUs. The HSA API thinks about thins in terms of memory pools and keeps some RB trees to see if a pointer is resident between some allocated ranged IIUC. The free function just takes a pointer and then determines the memory pool / device that owns it. They also have the concept of 'setting ownership'. I.e. when you allocate any memory you need to explicitly enable access to the device you want to touch it.

I think in general, we shouldn't force the device here since it'd get really complicated. We could pass the type like the CUDA APIs do if we really needed to, but ideally it'd just be a a pointer.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the Type and Device args from olMemFree, allocation info is now just tracked with a DenseMap. There's maybe a better way for the plugins to manage it themselves but this seems like an ok first step.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's maybe a better way for the plugins to manage it themselves but this seems like an ok first step.

But, without any sort of handle in the free function, you won't know how to dispatch the call to a plugin. In UR, every function has a handle pointer precisely so that it's always possible to route the call to the appropriate underlying adapter/plugin function.

To me this looks like we are trading-off performance and memory (maintaining a DenseMap of all allocations won't be cheap and will likely require a mutex) against... convenience of using this API in SYCL/OpenMP implementations?

I don't think that's the right trade-off. If using a device here would be complicated, maybe we should require a platform handle?
The abstraction used across SYCL, OpenCL and Intel Level Zero for this purpose is context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A platform handle would be reasonable, but you'd still need to track what kind of allocation it is unless we want to push that onto the user. HSA handles that implicitly, CUDA forces you to remember if it's managed, host, device.

Copy link
ContributorAuthor

@callumfarecallumfareApr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With CUDA you can call cuPointerGetAttribute to determine whether to call cuMemFree or cuMemFreeHost. That's how it's implemented in Unified Runtime.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could reinstate the device and kind parameters to remove the map but anything else will require a plugin change:

  • The MemoryManager currently belongs to the device, not the platform
  • dataDelete would need to be changed or an overload added to remove the TargetAllocTy argument and let the plugins deduce it if necessary

I'd rather avoid plugin changes in this PR and just expose the existing functionality, to avoid dragging out the review process any longer. I think we're all on the same page that the API is a work in progress so we can make breaking changes to this function after this PR.

From the Codeplay/Intel side of things, probably our highest priority TODO after this is investigating any gaps between memory capabilities in UR and offload, including whether the equivalent of UR's contexts is needed, which ties into all of this.

@callumfarecallumfareforce-pushed the offload_new_api_continued branch 2 times, most recently from f007522 to 5ff3f69CompareApril 1, 2025 12:52
@jhuber6
Copy link
Contributor

Do we have an interface to iterate symbols? Or do we need to just know the name beforehand.

@callumfare
Copy link
ContributorAuthor

Do we have an interface to iterate symbols? Or do we need to just know the name beforehand.

There's the generated OffloadFuncs.inc file that provides a macro interface to each of the entry points. It's not actually used for anything just now but I figured it would be useful eventually.

@callumfarecallumfareforce-pushed the offload_new_api_continued branch from 5ff3f69 to a8a901fCompareApril 15, 2025 14:57
Copy link
Contributor

@jhuber6jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now, we can hopefully do some follow-ups in a patch that's less massive. Thanks for sticking with this, at some point I'll need to get some hands-on experience with this so I can make some better judgements. I'm hoping to rewrite the libc loader in terms of this API soonish.

@callumfare
Copy link
ContributorAuthor

@jhuber6 Thanks! I've pushed some small fixes. Would you mind landing this PR when you get a chance? I don't have write access. No new libraries or lit tests this time so hopefully won't be as much trouble as the last one.

@jhuber6
Copy link
Contributor

@jhuber6 Thanks! I've pushed some small fixes. Would you mind landing this PR when you get a chance? I don't have write access. No new libraries or lit tests this time so hopefully won't be as much trouble as the last one.

I'll probably want to try testing it locally first. How do you test it?

@callumfare
Copy link
ContributorAuthor

callumfare commented Apr 21, 2025

@jhuber6 Thanks! I've pushed some small fixes. Would you mind landing this PR when you get a chance? I don't have write access. No new libraries or lit tests this time so hopefully won't be as much trouble as the last one.

I'll probably want to try testing it locally first. How do you test it?

From the runtime build dir:

ninja LibomptUnitTests ./offload/unittests/OffloadAPI/offload.unittests --platform=AMDGPU

You can set OFFLOAD_TRACE=1 to see the api calls

@jhuber6
Copy link
Contributor

Worked when I tested it. But in the future running the ninja target needs to run the test with a target like ninja check-offload-unit and should auto detect the GPUs on the system.

@jhuber6jhuber6 merged commit 800d949 into llvm:mainApr 22, 2025
9 checks passed
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment
Labels
7 participants
@callumfare@llvmbot@jhuber6@jdoerfert@jplehr@pbalcer@RaviNarayanaswamy
close