GPU Compute with SYCL/oneAPI
Why SYCL
CUDA only targets NVIDIA GPUs. SYCL is an open standard (Khronos Group) that targets Intel GPUs, NVIDIA GPUs (via CUDA backend), AMD GPUs (via ROCm/HIP backend), and even FPGAs — all from a single C++ codebase. Intel's oneAPI implementation (DPC++) is the most mature SYCL compiler, but AdaptiveCpp (formerly hipSYCL) and ComputeCpp also exist.
The tradeoff: SYCL is marginally more verbose than CUDA, and the multi-backend story works better for standard operations than for advanced features like warp-level primitives. NVIDIA-specific optimizations (tensor cores, cooperative groups) are easier to reach through CUDA. But if your organization targets multiple GPU vendors — or may need to in the future — SYCL avoids vendor lock-in.
SYCL Concepts
Queue
A sycl::queue is the connection between your code and a device. Every kernel submission and memory operation goes through a queue:
#include <sycl/sycl.hpp>
// select the default device (usually a GPU if available)
sycl::queue q;
// explicitly select a GPU
sycl::queue q_gpu{sycl::gpu_selector_v};
// select CPU (useful for debugging)
sycl::queue q_cpu{sycl::cpu_selector_v};
std::cout << "Running on: "
<< q.get_device().get_info<sycl::info::device::name>() << "\n";Buffers and Accessors
SYCL's buffer/accessor model manages data dependencies automatically. A sycl::buffer owns data; a sycl::accessor provides typed access within a kernel and declares the access mode (read, write, read_write):
std::vector<float> host_data(1024, 1.0f);
{
sycl::buffer<float, 1> buf(host_data.data(), sycl::range<1>(1024));
q.submit([&](sycl::handler& h) {
auto acc = buf.get_access<sycl::access::mode::read_write>(h);
h.parallel_for(sycl::range<1>(1024), [=](sycl::id<1> idx) {
acc[idx] *= 2.0f;
});
});
} // buffer destructor blocks until kernel completes and copies data back
// host_data is now updatedBuffers are safer than raw pointers because the runtime tracks data dependencies between kernels. If kernel B reads a buffer that kernel A writes, the runtime automatically inserts a dependency. No manual synchronization needed.
Events
sycl::event provides explicit synchronization when you need it:
sycl::event e = q.submit([&](sycl::handler& h) {
// kernel work
});
e.wait(); // block until this specific submission completes
// or chain dependencies
q.submit([&](sycl::handler& h) {
h.depends_on(e); // wait for previous kernel
// next kernel
});Writing a SYCL Kernel
Element-wise multiply — the same operation as the CUDA recipe for comparison:
#include <sycl/sycl.hpp>
#include <vector>
int main() {
const int n = 1 << 20;
std::vector<float> a(n, 2.0f), b(n, 3.0f), c(n);
sycl::queue q{sycl::gpu_selector_v};
{
sycl::buffer buf_a(a.data(), sycl::range<1>(n));
sycl::buffer buf_b(b.data(), sycl::range<1>(n));
sycl::buffer buf_c(c.data(), sycl::range<1>(n));
q.submit([&](sycl::handler& h) {
auto acc_a = buf_a.get_access<sycl::access::mode::read>(h);
auto acc_b = buf_b.get_access<sycl::access::mode::read>(h);
auto acc_c = buf_c.get_access<sycl::access::mode::write>(h);
h.parallel_for(sycl::range<1>(n), [=](sycl::id<1> idx) {
acc_c[idx] = acc_a[idx] * acc_b[idx];
});
});
} // buffers sync back to host here
// c now contains element-wise products
}Compared to the CUDA version: no cudaMalloc/cudaMemcpy, no manual grid/block calculation. The SYCL runtime chooses the work-group size. You can override it with sycl::nd_range when you need explicit control:
h.parallel_for(
sycl::nd_range<1>(sycl::range<1>(n), sycl::range<1>(256)),
[=](sycl::nd_item<1> item) {
int idx = item.get_global_id(0);
if (idx < n) acc_c[idx] = acc_a[idx] * acc_b[idx];
});Unified Shared Memory (USM)
USM provides pointer-based programming closer to CUDA's memory model. Three allocation types:
// Device allocation — accessible only on the device
float* d_ptr = sycl::malloc_device<float>(n, q);
// Shared allocation — accessible from both host and device, migrated automatically
float* s_ptr = sycl::malloc_shared<float>(n, q);
// Host allocation — on host but accessible from device (slow device access)
float* h_ptr = sycl::malloc_host<float>(n, q);With USM, kernels use raw pointers instead of accessors:
float* a = sycl::malloc_shared<float>(n, q);
float* b = sycl::malloc_shared<float>(n, q);
float* c = sycl::malloc_shared<float>(n, q);
// initialize on host
for (int i = 0; i < n; i++) { a[i] = 2.0f; b[i] = 3.0f; }
q.parallel_for(sycl::range<1>(n), [=](sycl::id<1> idx) {
c[idx] = a[idx] * b[idx];
}).wait();
sycl::free(a, q);
sycl::free(b, q);
sycl::free(c, q);When to use USM vs buffers:
- USM when porting CUDA code (similar pointer semantics), when you need fine-grained control over data placement, or when the buffer/accessor verbosity is a barrier.
- Buffers when you want the runtime to manage dependencies automatically, when correctness is more important than squeezing out the last bit of control.
oneAPI DPC++ and icpx
Intel's oneAPI DPC++ compiler (icpx) is the primary SYCL implementation. Install via the oneAPI Base Toolkit:
# Intel APT repository (Ubuntu)
sudo apt install intel-oneapi-compiler-dpcpp-cpp
# set up environment
source /opt/intel/oneapi/setvars.sh
# compile
icpx -fsycl main.cpp -o mainCMake integration:
cmake_minimum_required(VERSION 3.20)
project(myproject LANGUAGES CXX)
find_package(IntelSYCL REQUIRED)
# or use the compiler directly:
# set(CMAKE_CXX_COMPILER icpx)
# set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl")
add_executable(myapp main.cpp)
add_sycl_to_target(TARGET myapp SOURCES main.cpp)Backend Portability
The promise of SYCL: one source, multiple backends.
Intel GPUs (default):
icpx -fsycl main.cpp -o mainNVIDIA GPUs:
icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda main.cpp -o mainAMD GPUs (via AdaptiveCpp):
acpp --acpp-targets="hip:gfx90a" main.cpp -o mainIn practice, basic operations (parallel_for, reductions, memory management) work across all backends. Advanced features (sub-groups, specific memory hierarchies) may behave differently or have performance variations across backends. Always profile on the target hardware.
oneMKL
Intel's oneAPI Math Kernel Library provides SYCL interfaces for BLAS, LAPACK, FFT, RNG, and sparse linear algebra. It works across backends:
#include <oneapi/mkl.hpp>
sycl::queue q{sycl::gpu_selector_v};
// GEMM: C = alpha * A * B + beta * C
oneapi::mkl::blas::column_major::gemm(
q,
oneapi::mkl::transpose::nontrans,
oneapi::mkl::transpose::nontrans,
m, n, k,
1.0f, // alpha
d_A, lda,
d_B, ldb,
0.0f, // beta
d_C, ldc
).wait();oneMKL dispatches to the optimal backend library automatically — MKL on Intel CPUs, cuBLAS on NVIDIA GPUs, rocBLAS on AMD GPUs. This is the strongest argument for SYCL in scientific computing: write the linear algebra once, and it runs optimally everywhere.
When to Choose SYCL Over CUDA
Choose SYCL when:
- Your deployment targets multiple GPU vendors (Intel + NVIDIA, or NVIDIA + AMD)
- Your organization is standardizing on Intel's oneAPI ecosystem
- Long-term portability matters more than squeezing the last 5% of NVIDIA-specific performance
- You want a single C++ codebase without
#ifdefblocks for each GPU vendor
Choose CUDA when:
- You only target NVIDIA GPUs and always will
- You need NVIDIA-specific features (tensor cores, NCCL, cuDNN)
- Your team already has deep CUDA expertise
- You need the richest ecosystem of third-party libraries and community examples
The pragmatic middle ground: write performance-critical kernels in CUDA if you need them, and use SYCL for the rest of the application. Or use vendor math libraries (oneMKL, cuBLAS) through SYCL's dispatch layer and avoid writing kernels at all.