hpc
Experimental
This skill is experimental. Recipes and structure may change.
Context skill for high-performance and distributed computing in Python and C++.
Requirements
- C++17 compiler (GCC 10+, Clang 14+, or MSVC 19.30+)
- CUDA Toolkit (optional, for GPU compute recipes)
- Python 3.11+
uv— Python package and project manager- MPI implementation (optional, for cluster recipes — OpenMPI or MPICH)
Philosophy
Performance work is measurement-driven. Every recipe follows the same pattern: establish a baseline, apply the technique, measure the improvement. Techniques that can't be measured shouldn't be applied. Recipes cover threading, SIMD, GPU compute, MPI clustering, and distributed Python with Ray — from laptop to multi-node cluster.
Recipes
- C++ Threading Primitives — std::thread, std::async, std::jthread; thread-safe patterns
- OpenMP — pragma-based loop/section parallelism; reduction patterns
- Intel TBB — task graphs, parallel_for, pipeline, flow graph
- SIMD with AVX2/NEON — intrinsics, auto-vectorization, measurement
- GPU Compute with CUDA — kernels, memory hierarchy, streams, Nsight
- GPU Compute with SYCL/oneAPI — write-once GPU code for Intel/NVIDIA/AMD
- MPI Fundamentals — point-to-point, collectives, MPI+OpenMP hybrid
- Python Parallelism — multiprocessing, threading, asyncio; GIL
- Distributed Compute with Ray — tasks, actors, placement groups, clusters
- HPC Project Conventions — CMake setup, profiling tools, benchmarking
- Thrust: GPU Parallel Algorithms — STL-style sort/reduce/scan/transform on GPU, device_vector, fancy iterators, stream-async
- Kokkos: Performance Portability — write-once C++ for CUDA/HIP/OpenMP/SYCL, Views, parallel_for/reduce/scan, thread teams
- NVIDIA Warp (Python GPU) — Python kernels compiled to PTX, tiles, streams, autodiff, simulation loops