{"id":479431,"date":"2026-05-12T10:15:02","date_gmt":"2026-05-12T10:15:02","guid":{"rendered":"https:\/\/savepearlharbor.com\/?p=479431"},"modified":"-0001-11-30T00:00:00","modified_gmt":"-0001-11-29T21:00:00","slug":"","status":"publish","type":"post","link":"https:\/\/savepearlharbor.com\/?p=479431","title":{"rendered":"cuda-core 1.0 \u2014 \u043f\u0438\u0448\u0435\u043c CUDA-\u044f\u0434\u0440\u0430 \u043d\u0430 Python \u0431\u0435\u0437 C++ (\u043d\u0443 \u043f\u043e\u0447\u0442\u0438)"},"content":{"rendered":"<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<p>11 \u043c\u0430\u044f 2026 \u0433\u043e\u0434\u0430 NVIDIA \u0432\u044b\u043f\u0443\u0441\u0442\u0438\u043b\u0430 <a href=\"https:\/\/github.com\/NVIDIA\/cuda-python\/releases\/tag\/cuda-core-v1.0.0\" rel=\"noopener noreferrer nofollow\">cuda-core v1.0.0<\/a> \u2014 \u043f\u0435\u0440\u0432\u044b\u0439 \u0441\u0442\u0430\u0431\u0438\u043b\u044c\u043d\u044b\u0439 \u0440\u0435\u043b\u0438\u0437 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0438, \u043a\u043e\u0442\u043e\u0440\u0430\u044f \u0434\u0430\u0451\u0442 Python-\u0440\u0430\u0437\u0440\u0430\u0431\u043e\u0442\u0447\u0438\u043a\u0430\u043c \u043f\u0440\u044f\u043c\u043e\u0439 \u0434\u043e\u0441\u0442\u0443\u043f \u043a CUDA Runtime \u0431\u0435\u0437 \u0442\u044f\u0436\u0435\u043b\u044b\u0445 C++ \u043e\u0431\u0432\u044f\u0437\u043e\u043a.<\/p>\n<p>\u041c\u044b \u0432\u0437\u044f\u043b\u0438 3 \u0432\u0438\u0434\u0435\u043e\u043a\u0430\u0440\u0442\u044b (4090, 3090, A100 80Gb) \u0438 \u043f\u0440\u043e\u0442\u0435\u0441\u0442\u0438\u0440\u043e\u0432\u0430\u043b\u0438 \u0440\u0430\u0431\u043e\u0442\u0443 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0438 \u043d\u0430 \u043a\u0430\u0436\u0434\u043e\u0439.<\/p>\n<p>cuda-core \u2014 \u044d\u0442\u043e Pythonic-\u043e\u0431\u0451\u0440\u0442\u043a\u0430 \u043d\u0430\u0434 <a href=\"https:\/\/docs.nvidia.com\/cuda\/cuda-runtime-api\/\" rel=\"noopener noreferrer nofollow\">CUDA Runtime<\/a>. \u041e\u043d\u0430 \u0437\u0430\u043a\u0440\u044b\u0432\u0430\u0435\u0442 \u0442\u0443 \u043d\u0438\u0448\u0443, \u043a\u043e\u0442\u043e\u0440\u0443\u044e \u0440\u0430\u043d\u044c\u0448\u0435 \u0437\u0430\u043d\u0438\u043c\u0430\u043b\u0438 <code>pycuda<\/code> \u0438\u043b\u0438 \u0440\u0443\u0447\u043d\u044b\u0435 \u0432\u044b\u0437\u043e\u0432\u044b \u0447\u0435\u0440\u0435\u0437 <code>ctypes<\/code> (\u043a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u044f \u044f\u0434\u0435\u0440 \u043f\u0440\u044f\u043c\u043e \u0438\u0437 Python, \u0443\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u0438\u0435 \u043f\u0430\u043c\u044f\u0442\u044c\u044e \u043d\u0430 GPU, \u0437\u0430\u043f\u0443\u0441\u043a \u044f\u0434\u0435\u0440 \u0431\u0435\u0437 C++ \u0440\u0430\u0441\u0448\u0438\u0440\u0435\u043d\u0438\u0439). \u0412\u0435\u0440\u0441\u0438\u044f 1.0.0 \u0444\u0438\u043a\u0441\u0438\u0440\u0443\u0435\u0442 \u043f\u0443\u0431\u043b\u0438\u0447\u043d\u044b\u0439 API \u2014 \u0442\u0435\u043f\u0435\u0440\u044c \u043c\u043e\u0436\u043d\u043e \u043f\u0440\u0438\u043c\u0435\u043d\u044f\u0442\u044c \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0443 \u0432 \u043f\u0440\u043e\u0434\u0430\u043a\u0448\u043d-\u0437\u0430\u0432\u0438\u0441\u0438\u043c\u043e\u0441\u0442\u044f\u0445.<\/p>\n<hr\/>\n<p>\u0412 \u044d\u043a\u043e\u0441\u0438\u0441\u0442\u0435\u043c\u0435 CUDA Python \u0434\u0432\u0430 \u0441\u043b\u043e\u044f:<\/p>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">\u0421\u043b\u043e\u0439<\/p>\n<\/th>\n<th>\n<p align=\"left\">\u041f\u0430\u043a\u0435\u0442<\/p>\n<\/th>\n<th>\n<p align=\"left\">\u0427\u0442\u043e \u0434\u0435\u043b\u0430\u0435\u0442<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">\u041d\u0438\u0437\u043a\u0438\u0439<\/p>\n<\/td>\n<td>\n<p align=\"left\"><code>cuda-bindings<\/code><\/p>\n<\/td>\n<td>\n<p align=\"left\">1:1 \u043c\u0430\u043f\u043f\u0438\u043d\u0433 C-API (cudaMemcpy \u0438 \u043f\u0440.)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">\u0412\u044b\u0441\u043e\u043a\u0438\u0439<\/p>\n<\/td>\n<td>\n<p align=\"left\"><code>cuda-core<\/code><\/p>\n<\/td>\n<td>\n<p align=\"left\">Pythonic API: Device, Stream, Program, Buffer<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>\u0414\u043b\u044f \u043a\u043e\u0433\u043e \u043f\u043e\u043b\u0435\u0437\u043d\u043e:<\/p>\n<ul>\n<li>\n<p>\u0418\u0441\u0441\u043b\u0435\u0434\u043e\u0432\u0430\u0442\u0435\u043b\u0438, \u043f\u0440\u043e\u0442\u043e\u0442\u0438\u043f\u0438\u0440\u0443\u044e\u0449\u0438\u0435 \u043d\u0435\u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u043d\u044b\u0435 \u043e\u043f\u0435\u0440\u0430\u0446\u0438\u0438 \u0434\u043b\u044f \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f \u043c\u043e\u0434\u0435\u043b\u0435\u0439<\/p>\n<\/li>\n<li>\n<p>\u0418\u043d\u0444\u0435\u0440\u0435\u043d\u0441-\u0438\u043d\u0436\u0435\u043d\u0435\u0440\u044b, \u043e\u043f\u0442\u0438\u043c\u0438\u0437\u0438\u0440\u0443\u044e\u0449\u0438\u0435 \u0433\u043e\u0440\u044f\u0447\u0438\u0435 \u043f\u0443\u0442\u0438 \u043f\u043e\u0434 \u043a\u043e\u043d\u043a\u0440\u0435\u0442\u043d\u043e\u0435 \u0436\u0435\u043b\u0435\u0437\u043e<\/p>\n<\/li>\n<li>\n<p>\u0410\u0432\u0442\u043e\u0440\u044b ML-\u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a, \u043a\u043e\u0442\u043e\u0440\u044b\u043c \u043d\u0443\u0436\u0435\u043d GPU-\u0434\u043e\u0441\u0442\u0443\u043f \u0431\u0435\u0437 C++ build system<\/p>\n<\/li>\n<\/ul>\n<h3>\u0427\u0442\u043e \u043d\u043e\u0432\u043e\u0433\u043e \u0432 v1.0.0<\/h3>\n<h4>\u041a\u044d\u0448 \u043a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u0438 \u044f\u0434\u0435\u0440<\/h4>\n<p>\u041a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u044f CUDA-\u044f\u0434\u0440\u0430 \u0447\u0435\u0440\u0435\u0437 NVRTC \u0437\u0430\u043d\u0438\u043c\u0430\u0435\u0442 10\u201350 \u043c\u0441. \u041f\u0440\u0438 \u043a\u0430\u0436\u0434\u043e\u043c \u0441\u0442\u0430\u0440\u0442\u0435 \u043f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u044f \u044d\u0442\u043e \u0441\u043a\u043b\u0430\u0434\u044b\u0432\u0430\u0435\u0442\u0441\u044f \u0432 \u0441\u0435\u043a\u0443\u043d\u0434\u044b. v1.0.0 \u0434\u043e\u0431\u0430\u0432\u043b\u044f\u0435\u0442 \u0434\u0432\u0430 \u043a\u043b\u0430\u0441\u0441\u0430:<\/p>\n<ul>\n<li>\n<p><code>InMemoryProgramCache<\/code> \u2014 in-process LRU-\u043a\u044d\u0448 \u043d\u0430 \u0432\u0440\u0435\u043c\u044f \u0436\u0438\u0437\u043d\u0438 \u043f\u0440\u043e\u0446\u0435\u0441\u0441\u0430<\/p>\n<\/li>\n<li>\n<p><code>FileStreamProgramCache<\/code> \u2014 \u043f\u0435\u0440\u0441\u0438\u0441\u0442\u0435\u043d\u0442\u043d\u044b\u0439 \u043a\u044d\u0448 \u043d\u0430 \u0434\u0438\u0441\u043a\u0435 \u043c\u0435\u0436\u0434\u0443 \u0437\u0430\u043f\u0443\u0441\u043a\u0430\u043c\u0438<\/p>\n<\/li>\n<\/ul>\n<pre><code class=\"python\">from cuda.core import Device, Program, ProgramOptions, ObjectCodefrom cuda.core.utils import InMemoryProgramCached = Device(0)d.set_current()cuda_src = r\"\"\"extern \"C\" __global__ void saxpy(    float alpha, const float* __restrict__ x,    const float* __restrict__ y, float* out, int n) {    int idx = blockIdx.x * blockDim.x + threadIdx.x;    if (idx &lt; n) out[idx] = alpha * x[idx] + y[idx];}\"\"\"arch = f'sm_{d.compute_capability.major}{d.compute_capability.minor}'cache = InMemoryProgramCache()cache_key = f'saxpy_{arch}'# \u041f\u0435\u0440\u0432\u044b\u0439 \u0437\u0430\u043f\u0443\u0441\u043a \u2014 \u043a\u043e\u043c\u043f\u0438\u043b\u0438\u0440\u0443\u0435\u043c \u0438 \u043a\u043b\u0430\u0434\u0451\u043c \u0432 \u043a\u044d\u0448 (~15 \u043c\u0441)if cache.get(cache_key) is None:    prog = Program(cuda_src, 'c++', options=ProgramOptions(arch=arch))    cache[cache_key] = prog.compile('cubin')# \u0412\u0441\u0435 \u043f\u043e\u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0438\u0435 \u2014 \u043c\u0433\u043d\u043e\u0432\u0435\u043d\u043d\u043e (~0.05 \u043c\u0441, 300\u00d7 \u0431\u044b\u0441\u0442\u0440\u0435\u0435)kernel = ObjectCode.from_cubin(cache.get(cache_key)).get_kernel('saxpy')<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:87px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<h4>Green Contexts: \u0434\u0435\u043b\u0438\u043c GPU \u043c\u0435\u0436\u0434\u0443 \u0437\u0430\u0434\u0430\u0447\u0430\u043c\u0438<\/h4>\n<p>Green Contexts \u2014 \u043c\u0435\u0445\u0430\u043d\u0438\u0437\u043c \u0440\u0430\u0437\u0434\u0435\u043b\u0435\u043d\u0438\u044f SM-\u0440\u0435\u0441\u0443\u0440\u0441\u043e\u0432 GPU \u043c\u0435\u0436\u0434\u0443 \u0437\u0430\u0434\u0430\u0447\u0430\u043c\u0438 \u0431\u0435\u0437 \u0444\u0438\u0437\u0438\u0447\u0435\u0441\u043a\u043e\u0439 \u0438\u0437\u043e\u043b\u044f\u0446\u0438\u0438. RTX 4090 \u0438\u043c\u0435\u0435\u0442 128 SM; \u043c\u043e\u0436\u043d\u043e \u043d\u0430\u0440\u0435\u0437\u0430\u0442\u044c \u0438\u0445 \u043d\u0430 \u043f\u0430\u0440\u0442\u0438\u0446\u0438\u0438 \u0438 \u0437\u0430\u043f\u0443\u0441\u043a\u0430\u0442\u044c \u0437\u0430\u0434\u0430\u0447\u0438 \u043f\u0430\u0440\u0430\u043b\u043b\u0435\u043b\u044c\u043d\u043e \u0441 \u0433\u0430\u0440\u0430\u043d\u0442\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u044b\u043c\u0438 \u0440\u0435\u0441\u0443\u0440\u0441\u0430\u043c\u0438 \u0434\u043b\u044f \u043a\u0430\u0436\u0434\u043e\u0439.<\/p>\n<p>\u0422\u0438\u043f\u0438\u0447\u043d\u044b\u0439 \u0441\u0446\u0435\u043d\u0430\u0440\u0438\u0439: \u0434\u0432\u0430 \u043d\u0435\u0437\u0430\u0432\u0438\u0441\u0438\u043c\u044b\u0445 \u0438\u043d\u0444\u0435\u0440\u0435\u043d\u0441-\u0437\u0430\u043f\u0440\u043e\u0441\u0430 \u0434\u043e\u043b\u0436\u043d\u044b \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u043d\u0430 \u043e\u0434\u043d\u043e\u043c GPU \u0431\u0435\u0437 \u0432\u0437\u0430\u0438\u043c\u043d\u043e\u0433\u043e \u0432\u044b\u0442\u0435\u0441\u043d\u0435\u043d\u0438\u044f \u0437\u0430 SM. \u0411\u0435\u0437 Green Contexts \u0437\u0430\u0434\u0430\u0447\u0438 \u043a\u043e\u043d\u043a\u0443\u0440\u0438\u0440\u0443\u044e\u0442 \u0437\u0430 \u0440\u0435\u0441\u0443\u0440\u0441 \u043d\u0435\u043a\u043e\u043d\u0442\u0440\u043e\u043b\u0438\u0440\u0443\u0435\u043c\u043e. \u0421 \u043d\u0438\u043c\u0438 \u2014 \u043a\u0430\u0436\u0434\u0430\u044f \u043f\u043e\u043b\u0443\u0447\u0430\u0435\u0442 \u0441\u0432\u043e\u0439 \u0432\u044b\u0434\u0435\u043b\u0435\u043d\u043d\u044b\u0439 \u043a\u0443\u0441\u043e\u043a.<\/p>\n<pre><code class=\"python\">import numpy as npimport ctypesimport threadingfrom cuda.core import (    Device, SMResourceOptions, ContextOptions,    Program, ProgramOptions, LaunchConfig, launch, ObjectCode,    DeviceMemoryResource, DeviceMemoryResourceOptions)from cuda.core.utils import InMemoryProgramCachefrom cuda.bindings import runtime as cudartN = 5_000_000d = Device(0)d.set_current()total_sm = d.resources.sm.sm_countprint(f\"GPU: {d.name}, Total SMs: {total_sm}\")  # Total SMs: 128# \u0414\u0435\u043b\u0438\u043c 128 SM \u043d\u0430 \u0434\u0432\u0430 \u0440\u0430\u0437\u0434\u0435\u043b\u0430 \u043f\u043e 64sm_opts = SMResourceOptions(count=[total_sm \/\/ 2, total_sm \/\/ 2])groups, _ = d.resources.sm.split(sm_opts)ctx_a = d.create_context(ContextOptions([groups[0]]))ctx_b = d.create_context(ContextOptions([groups[1]]))print(f\"Context A: {ctx_a.resources.sm.sm_count} SMs, is_green={ctx_a.is_green}\")print(f\"Context B: {ctx_b.resources.sm.sm_count} SMs, is_green={ctx_b.is_green}\")# Context A: 64 SMs, is_green=True# Context B: 64 SMs, is_green=Truecuda_src = r\"\"\"extern \"C\" __global__ void saxpy(    float alpha, const float* __restrict__ x,    const float* __restrict__ y, float* out, int n) {    int idx = blockIdx.x * blockDim.x + threadIdx.x;    if (idx &lt; n) out[idx] = alpha * x[idx] + y[idx];}\"\"\"arch = f\"sm_{d.compute_capability.major}{d.compute_capability.minor}\"cache = InMemoryProgramCache()cache_key = f\"saxpy_{arch}\"if cache.get(cache_key) is None:    prog = Program(cuda_src, \"c++\", options=ProgramOptions(arch=arch))    cache[cache_key] = prog.compile(\"cubin\")def run_task(ctx, task_name, alpha, fill_x, fill_y):    d.set_current()  # \u043d\u0443\u0436\u0435\u043d primary context \u0432 \u043a\u0430\u0436\u0434\u043e\u043c \u043f\u043e\u0442\u043e\u043a\u0435    # stream, \u0441\u043e\u0437\u0434\u0430\u043d\u043d\u044b\u0439 \u0447\u0435\u0440\u0435\u0437 green context, \u043f\u0440\u0438\u0432\u044f\u0437\u0430\u043d \u043a \u0435\u0433\u043e SM-\u043f\u0430\u0440\u0442\u0438\u0446\u0438\u0438    stream = ctx.create_stream()    mr = DeviceMemoryResource(0, DeviceMemoryResourceOptions())    kernel = ObjectCode.from_cubin(cache.get(cache_key)).get_kernel(\"saxpy\")    size = N * 4  # float32    buf_x = mr.allocate(size, stream=stream)    buf_y = mr.allocate(size, stream=stream)    buf_o = mr.allocate(size, stream=stream)    x_h = np.full(N, fill_x, dtype=np.float32)    y_h = np.full(N, fill_y, dtype=np.float32)    cudart.cudaMemcpy(buf_x.handle, x_h.ctypes.data, size, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)    cudart.cudaMemcpy(buf_y.handle, y_h.ctypes.data, size, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)    cfg = LaunchConfig(grid=(N + 255) \/\/ 256, block=256)    launch(stream, cfg, kernel,           np.float32(alpha), buf_x.handle, buf_y.handle, buf_o.handle, ctypes.c_int(N))    stream.sync()    out_h = np.empty(N, dtype=np.float32)    cudart.cudaMemcpy(out_h.ctypes.data, buf_o.handle, size, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost)    print(f\"  {task_name}: result={out_h[0]:.1f}\")    buf_x.close(); buf_y.close(); buf_o.close()    stream.close()# \u0417\u0430\u043f\u0443\u0441\u043a\u0430\u0435\u043c \u0434\u0432\u0430 \u044f\u0434\u0440\u0430 \u043f\u0430\u0440\u0430\u043b\u043b\u0435\u043b\u044c\u043d\u043e \u2014 \u043a\u0430\u0436\u0434\u043e\u0435 \u043d\u0430 \u0441\u0432\u043e\u0438\u0445 64 SMt_a = threading.Thread(target=run_task, args=(ctx_a, \"Task-A (64 SM)\", 3.0, 2.0, 1.0))t_b = threading.Thread(target=run_task, args=(ctx_b, \"Task-B (64 SM)\", 5.0, 4.0, 2.0))t_a.start(); t_b.start()t_a.join(); t_b.join()# Task-A (64 SM): result=7.0# Task-B (64 SM): result=22.0ctx_a.close(); ctx_b.close()<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>\u0413\u0434\u0435 \u0434\u043e\u0441\u0442\u0443\u043f\u043d\u043e: Ada Lovelace (RTX 40xx, sm_89) \u0438 \u043d\u043e\u0432\u0435\u0435, CUDA 12.4+.<\/p>\n<h4>\u0420\u0430\u0441\u0448\u0438\u0440\u0435\u043d\u043d\u044b\u0439 NVML<\/h4>\n<p>\u041c\u043e\u0434\u0443\u043b\u044c <code>system<\/code> \u043f\u043e\u043b\u0443\u0447\u0438\u043b GPU-\u043c\u043e\u043d\u0438\u0442\u043e\u0440\u0438\u043d\u0433 \u0432 \u0440\u0435\u0430\u043b\u044c\u043d\u043e\u043c \u0432\u0440\u0435\u043c\u0435\u043d\u0438:<\/p>\n<pre><code class=\"python\">from cuda.core import system, Deviceprint(f\"Devices: {system.get_num_devices()}\")d = Device(0)d.set_current()sd = d.to_system_device()util = sd.utilizationmem  = sd.memory_infoprint(f\"{sd.name}: GPU {util.gpu}%, MEM {util.memory}%\")print(f\"Memory: {mem.used \/\/ 10242} MB \/ {mem.total \/\/ 10242} MB\")<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>\u0422\u0430\u043a\u0436\u0435 \u0434\u043e\u0431\u0430\u0432\u043b\u0435\u043d\u044b: MIG-\u0440\u0435\u0436\u0438\u043c, NVLink (\u0432\u0435\u0440\u0441\u0438\u044f, \u0441\u043e\u0441\u0442\u043e\u044f\u043d\u0438\u0435), \u0441\u043f\u0438\u0441\u043e\u043a \u0437\u0430\u043f\u0443\u0449\u0435\u043d\u043d\u044b\u0445 compute-\u043f\u0440\u043e\u0446\u0435\u0441\u0441\u043e\u0432 \u0441 \u043f\u043e\u0442\u0440\u0435\u0431\u043b\u0435\u043d\u0438\u0435\u043c \u043f\u0430\u043c\u044f\u0442\u0438.<\/p>\n<h4>\u0423\u0441\u043a\u043e\u0440\u0435\u043d\u0438\u0435 StridedMemoryView \u0434\u043b\u044f PyTorch<\/h4>\n<p><code>StridedMemoryView<\/code> \u043f\u043e\u043b\u0443\u0447\u0438\u043b fast path \u0447\u0435\u0440\u0435\u0437 AOT Inductor \u2014 7\u201320\u00d7 \u0431\u043e\u043b\u0435\u0435 \u0431\u044b\u0441\u0442\u0440\u043e\u0435 \u043f\u043e\u0441\u0442\u0440\u043e\u0435\u043d\u0438\u0435 view \u0434\u043b\u044f PyTorch-\u0442\u0435\u043d\u0437\u043e\u0440\u043e\u0432 \u0431\u0435\u0437 \u043a\u043e\u043f\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u044f \u0434\u0430\u043d\u043d\u044b\u0445.<\/p>\n<h3>\u041f\u043e\u043b\u043d\u044b\u0439 \u0440\u0430\u0431\u043e\u0447\u0438\u0439 \u043f\u0440\u0438\u043c\u0435\u0440: SAXPY \u043d\u0430 GPU<\/h3>\n<p>\u041f\u043e\u043b\u043d\u044b\u0439 \u0446\u0438\u043a\u043b: \u0438\u043d\u0438\u0446\u0438\u0430\u043b\u0438\u0437\u0430\u0446\u0438\u044f \u2192 \u043a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u044f \u2192 \u043f\u0430\u043c\u044f\u0442\u044c \u2192 \u0437\u0430\u043f\u0443\u0441\u043a \u2192 \u043f\u0440\u043e\u0432\u0435\u0440\u043a\u0430. \u041f\u0440\u043e\u0442\u0435\u0441\u0442\u0438\u0440\u043e\u0432\u0430\u043d\u043e \u043d\u0430 RTX 4090 \/ CUDA 13.0.<\/p>\n<pre><code class=\"python\">import numpy as npimport ctypesfrom cuda.core import (    Device, DeviceMemoryResource, DeviceMemoryResourceOptions,    Program, ProgramOptions, LaunchConfig, launch, ObjectCode)from cuda.core.utils import InMemoryProgramCachefrom cuda.bindings import runtime as cudartN = 10_000_000FLOAT_BYTES = 4# 1. \u0423\u0441\u0442\u0440\u043e\u0439\u0441\u0442\u0432\u043e \u0438 \u0441\u0442\u0440\u0438\u043cd = Device(0)d.set_current()stream = d.create_stream()print(f\"GPU: {d.name}, SM: {d.properties.multiprocessor_count}\")# 2. Memory poolmr = DeviceMemoryResource(0, DeviceMemoryResourceOptions())# 3. Kernel: SAXPY = alpha * X + Ycuda_src = r\"\"\"extern \"C\" __global__ void saxpy(    float alpha, const float* __restrict__ x,    const float* __restrict__ y, float* out, int n) {    int idx = blockIdx.x * blockDim.x + threadIdx.x;    if (idx &lt; n) out[idx] = alpha * x[idx] + y[idx];}\"\"\"# 4. \u041a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u044f \u0441 \u043a\u044d\u0448\u0435\u043carch = f'sm_{d.compute_capability.major}{d.compute_capability.minor}'cache = InMemoryProgramCache()cache_key = f'saxpy_{arch}'if cache.get(cache_key) is None:    prog = Program(cuda_src, 'c++', options=ProgramOptions(arch=arch))    cache[cache_key] = prog.compile('cubin')kernel = ObjectCode.from_cubin(cache.get(cache_key)).get_kernel('saxpy')# 5. GPU-\u043f\u0430\u043c\u044f\u0442\u044csize_bytes = N * FLOAT_BYTESbuf_x   = mr.allocate(size_bytes, stream=stream)buf_y   = mr.allocate(size_bytes, stream=stream)buf_out = mr.allocate(size_bytes, stream=stream)# 6. H\u2192Dx_host = np.ones(N, dtype=np.float32) * 2.0y_host = np.ones(N, dtype=np.float32) * 1.0alpha  = np.float32(3.0)cudart.cudaMemcpy(buf_x.handle, x_host.ctypes.data, size_bytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)cudart.cudaMemcpy(buf_y.handle, y_host.ctypes.data, size_bytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)# 7. \u0417\u0430\u043f\u0443\u0441\u043acfg = LaunchConfig(grid=(N + 255) \/\/ 256, block=256)launch(stream, cfg, kernel, alpha, buf_x.handle, buf_y.handle, buf_out.handle, ctypes.c_int(N))# 8. D\u2192H + \u043f\u0440\u043e\u0432\u0435\u0440\u043a\u0430stream.sync()out_host = np.empty(N, dtype=np.float32)cudart.cudaMemcpy(out_host.ctypes.data, buf_out.handle, size_bytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost)assert np.allclose(out_host, 3.0 * x_host + y_host)print(f\"SAXPY OK: {out_host[0]:.1f}\")  # 7.0buf_x.close(); buf_y.close(); buf_out.close()stream.close()<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>\u0412\u0440\u0435\u043c\u044f \u044f\u0434\u0440\u0430 \u043d\u0430 10M \u044d\u043b\u0435\u043c\u0435\u043d\u0442\u043e\u0432 \u2014 ~0.13 \u043c\u0441, \u044d\u0444\u0444\u0435\u043a\u0442\u0438\u0432\u043d\u0430\u044f \u043f\u043e\u043b\u043e\u0441\u0430 \u043f\u0440\u043e\u043f\u0443\u0441\u043a\u0430\u043d\u0438\u044f ~920 GB\/s (91% \u043e\u0442 \u043f\u0438\u043a\u0430 RTX 4090).<\/p>\n<p>\u041f\u043e\u043b\u043d\u044b\u0439 \u043a\u043e\u0434 \u043f\u0440\u0438\u043c\u0435\u0440\u043e\u0432 \u0438 \u0431\u0435\u043d\u0447\u043c\u0430\u0440\u043a\u043e\u0432: <a href=\"https:\/\/github.com\/IntelionCloud\/research-examples\/cuda-core\/\" rel=\"noopener noreferrer nofollow\">github.com\/IntelionCloud\/research-examples\/cuda-core\/<\/a><\/p>\n<h3>Benchmark: \u043f\u0440\u043e\u043f\u0443\u0441\u043a\u043d\u0430\u044f \u0441\u043f\u043e\u0441\u043e\u0431\u043d\u043e\u0441\u0442\u044c \u043f\u0430\u043c\u044f\u0442\u0438<\/h3>\n<p>\u0420\u0435\u0430\u043b\u044c\u043d\u044b\u0435 \u0438\u0437\u043c\u0435\u0440\u0435\u043d\u0438\u044f SAXPY \u043d\u0430 cuda-core v1.0.0 (100M float32 \u044d\u043b\u0435\u043c\u0435\u043d\u0442\u043e\u0432, 5 \u043f\u0440\u043e\u0433\u043e\u043d\u043e\u0432, warmup):<\/p>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">GPU<\/p>\n<\/th>\n<th>\n<p align=\"left\">\u0422\u0435\u043e\u0440. BW<\/p>\n<\/th>\n<th>\n<p align=\"left\">\u0417\u0430\u043c\u0435\u0440\u0435\u043d\u043d\u044b\u0439 BW<\/p>\n<\/th>\n<th>\n<p align=\"left\">\u042d\u0444\u0444\u0435\u043a\u0442\u0438\u0432\u043d\u043e\u0441\u0442\u044c<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\"><a href=\"https:\/\/gpuark.com\/ru\/gpu\/nvidia-a100-pcie-80-gb\/\" rel=\"noopener noreferrer nofollow\">A100 80GB PCIe<\/a><\/p>\n<\/td>\n<td>\n<p align=\"left\">1935 GB\/s<\/p>\n<\/td>\n<td>\n<p align=\"left\">1584 GB\/s<\/p>\n<\/td>\n<td>\n<p align=\"left\">82%<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\"><a href=\"https:\/\/gpuark.com\/ru\/gpu\/nvidia-geforce-rtx-4090\/\" rel=\"noopener noreferrer nofollow\">RTX 4090<\/a> (24 GB)<\/p>\n<\/td>\n<td>\n<p align=\"left\">1008 GB\/s<\/p>\n<\/td>\n<td>\n<p align=\"left\">929 GB\/s<\/p>\n<\/td>\n<td>\n<p align=\"left\">92%<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\"><a href=\"https:\/\/gpuark.com\/ru\/gpu\/nvidia-geforce-rtx-3090\/\" rel=\"noopener noreferrer nofollow\">RTX 3090<\/a><\/p>\n<\/td>\n<td>\n<p align=\"left\">936 GB\/s<\/p>\n<\/td>\n<td>\n<p align=\"left\">848 GB\/s<\/p>\n<\/td>\n<td>\n<p align=\"left\">91%<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>\u0412\u0441\u0435 \u0442\u0440\u0438 GPU \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u043b\u0438 100% \u0441\u0432\u043e\u0438\u0445 SM: \u043f\u0440\u0438 100M \u044d\u043b\u0435\u043c\u0435\u043d\u0442\u0430\u0445 \u0438 block=256 \u043f\u043e\u043b\u0443\u0447\u0430\u0435\u0442\u0441\u044f 390 625 \u0431\u043b\u043e\u043a\u043e\u0432 \u2014 \u0441 \u0438\u0437\u0431\u044b\u0442\u043a\u043e\u043c \u0434\u043b\u044f \u0437\u0430\u0433\u0440\u0443\u0437\u043a\u0438 108 SM (A100), 128 SM (RTX 4090) \u0438 82 SM (RTX 3090). \u0420\u0430\u0437\u043d\u044b\u0435 \u043f\u0440\u043e\u0446\u0435\u043d\u0442\u044b \u044d\u0444\u0444\u0435\u043a\u0442\u0438\u0432\u043d\u043e\u0441\u0442\u0438 \u043e\u0442\u0440\u0430\u0436\u0430\u044e\u0442 \u0430\u0440\u0445\u0438\u0442\u0435\u043a\u0442\u0443\u0440\u0443 \u043f\u0430\u043c\u044f\u0442\u0438, \u0430 \u043d\u0435 SM-\u0443\u0442\u0438\u043b\u0438\u0437\u0430\u0446\u0438\u044e: GDDR6X (RTX 4090, 3090) \u0445\u043e\u0440\u043e\u0448\u043e \u043d\u0430\u0441\u044b\u0449\u0430\u0435\u0442\u0441\u044f \u043f\u0440\u043e\u0441\u0442\u044b\u043c \u043f\u043e\u0441\u043b\u0435\u0434\u043e\u0432\u0430\u0442\u0435\u043b\u044c\u043d\u044b\u043c \u0441\u0442\u0440\u0438\u043c\u0438\u043d\u0433\u043e\u043c, HBM2e (A100) \u043e\u043f\u0442\u0438\u043c\u0438\u0437\u0438\u0440\u043e\u0432\u0430\u043d \u043f\u043e\u0434 \u043c\u0430\u0441\u0441\u0438\u0432\u043d\u043e-\u043f\u0430\u0440\u0430\u043b\u043b\u0435\u043b\u044c\u043d\u044b\u0439 \u0434\u043e\u0441\u0442\u0443\u043f \u0438 \u043c\u0430\u0442\u0440\u0438\u0447\u043d\u044b\u0435 \u043e\u043f\u0435\u0440\u0430\u0446\u0438\u0438 \u2014 SAXPY \u0435\u0433\u043e \u043d\u0435 \u0440\u0430\u0441\u043a\u0440\u044b\u0432\u0430\u0435\u0442 (SAXPY = Single-precision A\u00b7X Plus Y \u2014 \u043e\u043f\u0435\u0440\u0430\u0446\u0438\u044f \u0438\u0437 \u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u0430 Basic Linear Algebra Subprograms, 1970-\u0435). \u041d\u0430 transformer attention \u0438 \u043a\u0440\u0443\u043f\u043d\u044b\u0445 matmul \u043a\u0430\u0440\u0442\u0438\u043d\u0430 \u0431\u044b\u043b\u0430 \u0431\u044b \u0434\u0440\u0443\u0433\u043e\u0439.<\/p>\n<h3>cuda-core vs pycuda<\/h3>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">\u0425\u0430\u0440\u0430\u043a\u0442\u0435\u0440\u0438\u0441\u0442\u0438\u043a\u0430<\/p>\n<\/th>\n<th>\n<p align=\"left\">pycuda<\/p>\n<\/th>\n<th>\n<p align=\"left\">cuda-core v1.0.0<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">\u0421\u0442\u0430\u0442\u0443\u0441<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u0421\u0442\u043e\u0440\u043e\u043d\u043d\u0438\u0439, \u043d\u0435\u043e\u0444\u0438\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u0439<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u041e\u0444\u0438\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u0439 NVIDIA<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">\u0421\u0442\u0430\u0431\u0438\u043b\u044c\u043d\u044b\u0439 API<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u041d\u0435\u0442<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u0414\u0430, \u0441 1.0.0<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">CUDA 12\/13<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u0427\u0430\u0441\u0442\u0438\u0447\u043d\u0430\u044f \u043f\u043e\u0434\u0434\u0435\u0440\u0436\u043a\u0430<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u041f\u043e\u043b\u043d\u0430\u044f<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">\u041a\u044d\u0448 \u043a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u0438<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u041d\u0435\u0442<\/p>\n<\/td>\n<td>\n<p align=\"left\">InMemory + File<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Green Contexts \/ SM partition<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u041d\u0435\u0442<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u0415\u0441\u0442\u044c (sm_89+)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">PyTorch StridedMemoryView<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u041d\u0435\u0442<\/p>\n<\/td>\n<td>\n<p align=\"left\">7\u201320\u00d7 \u0431\u044b\u0441\u0442\u0440\u0435\u0435<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Python 3.12\/3.13<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u041f\u0440\u043e\u0431\u043b\u0435\u043c\u044b<\/p>\n<\/td>\n<td>\n<p align=\"left\">\u041f\u043e\u0434\u0434\u0435\u0440\u0436\u0438\u0432\u0430\u0435\u0442\u0441\u044f<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<h3>\u041e\u0433\u0440\u0430\u043d\u0438\u0447\u0435\u043d\u0438\u044f<\/h3>\n<ul>\n<li>\n<p>CUDA 12+ \u043e\u0431\u044f\u0437\u0430\u0442\u0435\u043b\u0435\u043d<\/p>\n<\/li>\n<li>\n<p>Green Contexts \u2014 Ada Lovelace (sm_89) \u0438 \u043d\u043e\u0432\u0435\u0435, CUDA 12.4+<\/p>\n<\/li>\n<li>\n<p>Process checkpointing \u2014 \u0442\u043e\u043b\u044c\u043a\u043e Linux<\/p>\n<\/li>\n<\/ul>\n<h3>\u0418\u0442\u043e\u0433<\/h3>\n<p>cuda-core v1.0.0 \u2014 \u043e\u0444\u0438\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u0439 \u0438 \u0442\u0435\u043f\u0435\u0440\u044c \u0441\u0442\u0430\u0431\u0438\u043b\u044c\u043d\u044b\u0439 Python-\u0438\u043d\u0442\u0435\u0440\u0444\u0435\u0439\u0441 \u043a CUDA. \u041a\u044d\u0448 \u043a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u0438 \u0443\u0431\u0438\u0440\u0430\u0435\u0442 latency cold start, Green Contexts \u0434\u0430\u044e\u0442 \u0434\u0435\u0442\u0435\u0440\u043c\u0438\u043d\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u043e\u0435 \u0440\u0430\u0437\u0434\u0435\u043b\u0435\u043d\u0438\u0435 GPU \u043c\u0435\u0436\u0434\u0443 \u0437\u0430\u0434\u0430\u0447\u0430\u043c\u0438, NVML \u2014 \u043c\u043e\u043d\u0438\u0442\u043e\u0440\u0438\u043d\u0433 \u0438\u0437 \u0442\u043e\u0439 \u0436\u0435 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0438. \u0414\u043b\u044f \u0442\u0435\u0445, \u043a\u0442\u043e \u043f\u0438\u0441\u0430\u043b \u043a\u0430\u0441\u0442\u043e\u043c\u043d\u044b\u0435 \u044f\u0434\u0440\u0430 \u0447\u0435\u0440\u0435\u0437 pycuda \u0438\u043b\u0438 ctypes \u2014 \u0432\u0440\u0435\u043c\u044f \u043c\u0438\u0433\u0440\u0438\u0440\u043e\u0432\u0430\u0442\u044c.<\/p>\n<pre><code class=\"bash\"># CUDA 12 (\u0434\u0440\u0430\u0439\u0432\u0435\u0440 525+)pip install \"cuda-core[cu12]\"# CUDA 13 (\u0434\u0440\u0430\u0439\u0432\u0435\u0440 570+)pip install \"cuda-core[cu13]\"<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<\/div>\n<p>\u0441\u0441\u044b\u043b\u043a\u0430 \u043d\u0430 \u043e\u0440\u0438\u0433\u0438\u043d\u0430\u043b \u0441\u0442\u0430\u0442\u044c\u0438 <a href=\"https:\/\/habr.com\/ru\/articles\/1034172\/\">https:\/\/habr.com\/ru\/articles\/1034172\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>11 \u043c\u0430\u044f 2026 \u0433\u043e\u0434\u0430 NVIDIA \u0432\u044b\u043f\u0443\u0441\u0442\u0438\u043b\u0430 cuda-core v1.0.0 \u2014 \u043f\u0435\u0440\u0432\u044b\u0439 \u0441\u0442\u0430\u0431\u0438\u043b\u044c\u043d\u044b\u0439 \u0440\u0435\u043b\u0438\u0437 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0438, \u043a\u043e\u0442\u043e\u0440\u0430\u044f \u0434\u0430\u0451\u0442 Python-\u0440\u0430\u0437\u0440\u0430\u0431\u043e\u0442\u0447\u0438\u043a\u0430\u043c \u043f\u0440\u044f\u043c\u043e\u0439 \u0434\u043e\u0441\u0442\u0443\u043f \u043a CUDA Runtime \u0431\u0435\u0437 \u0442\u044f\u0436\u0435\u043b\u044b\u0445 C++ \u043e\u0431\u0432\u044f\u0437\u043e\u043a.\u041c\u044b \u0432\u0437\u044f\u043b\u0438 3 \u0432\u0438\u0434\u0435\u043e\u043a\u0430\u0440\u0442\u044b (4090, 3090, A100 80Gb) \u0438 \u043f\u0440\u043e\u0442\u0435\u0441\u0442\u0438\u0440\u043e\u0432\u0430\u043b\u0438 \u0440\u0430\u0431\u043e\u0442\u0443 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0438 \u043d\u0430 \u043a\u0430\u0436\u0434\u043e\u0439.cuda-core \u2014 \u044d\u0442\u043e Pythonic-\u043e\u0431\u0451\u0440\u0442\u043a\u0430 \u043d\u0430\u0434 CUDA Runtime. \u041e\u043d\u0430 \u0437\u0430\u043a\u0440\u044b\u0432\u0430\u0435\u0442 \u0442\u0443 \u043d\u0438\u0448\u0443, \u043a\u043e\u0442\u043e\u0440\u0443\u044e \u0440\u0430\u043d\u044c\u0448\u0435 \u0437\u0430\u043d\u0438\u043c\u0430\u043b\u0438 pycuda \u0438\u043b\u0438 \u0440\u0443\u0447\u043d\u044b\u0435 \u0432\u044b\u0437\u043e\u0432\u044b \u0447\u0435\u0440\u0435\u0437 ctypes (\u043a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u044f \u044f\u0434\u0435\u0440 \u043f\u0440\u044f\u043c\u043e \u0438\u0437 Python, \u0443\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u0438\u0435 \u043f\u0430\u043c\u044f\u0442\u044c\u044e \u043d\u0430 GPU, \u0437\u0430\u043f\u0443\u0441\u043a \u044f\u0434\u0435\u0440 \u0431\u0435\u0437 C++ \u0440\u0430\u0441\u0448\u0438\u0440\u0435\u043d\u0438\u0439). \u0412\u0435\u0440\u0441\u0438\u044f 1.0.0 \u0444\u0438\u043a\u0441\u0438\u0440\u0443\u0435\u0442 \u043f\u0443\u0431\u043b\u0438\u0447\u043d\u044b\u0439 API \u2014 \u0442\u0435\u043f\u0435\u0440\u044c \u043c\u043e\u0436\u043d\u043e \u043f\u0440\u0438\u043c\u0435\u043d\u044f\u0442\u044c \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0443 \u0432 \u043f\u0440\u043e\u0434\u0430\u043a\u0448\u043d-\u0437\u0430\u0432\u0438\u0441\u0438\u043c\u043e\u0441\u0442\u044f\u0445.\u0412 \u044d\u043a\u043e\u0441\u0438\u0441\u0442\u0435\u043c\u0435 CUDA Python \u0434\u0432\u0430 \u0441\u043b\u043e\u044f:\u0421\u043b\u043e\u0439\u041f\u0430\u043a\u0435\u0442\u0427\u0442\u043e \u0434\u0435\u043b\u0430\u0435\u0442\u041d\u0438\u0437\u043a\u0438\u0439cuda-bindings1:1 \u043c\u0430\u043f\u043f\u0438\u043d\u0433 C-API (cudaMemcpy \u0438 \u043f\u0440.)\u0412\u044b\u0441\u043e\u043a\u0438\u0439cuda-corePythonic API: Device, Stream, Program, Buffer\u0414\u043b\u044f \u043a\u043e\u0433\u043e \u043f\u043e\u043b\u0435\u0437\u043d\u043e:\u0418\u0441\u0441\u043b\u0435\u0434\u043e\u0432\u0430\u0442\u0435\u043b\u0438, \u043f\u0440\u043e\u0442\u043e\u0442\u0438\u043f\u0438\u0440\u0443\u044e\u0449\u0438\u0435 \u043d\u0435\u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u043d\u044b\u0435 \u043e\u043f\u0435\u0440\u0430\u0446\u0438\u0438 \u0434\u043b\u044f \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f \u043c\u043e\u0434\u0435\u043b\u0435\u0439\u0418\u043d\u0444\u0435\u0440\u0435\u043d\u0441-\u0438\u043d\u0436\u0435\u043d\u0435\u0440\u044b, \u043e\u043f\u0442\u0438\u043c\u0438\u0437\u0438\u0440\u0443\u044e\u0449\u0438\u0435 \u0433\u043e\u0440\u044f\u0447\u0438\u0435 \u043f\u0443\u0442\u0438 \u043f\u043e\u0434 \u043a\u043e\u043d\u043a\u0440\u0435\u0442\u043d\u043e\u0435 \u0436\u0435\u043b\u0435\u0437\u043e\u0410\u0432\u0442\u043e\u0440\u044b ML-\u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a, \u043a\u043e\u0442\u043e\u0440\u044b\u043c \u043d\u0443\u0436\u0435\u043d GPU-\u0434\u043e\u0441\u0442\u0443\u043f \u0431\u0435\u0437 C++ build system\u0427\u0442\u043e \u043d\u043e\u0432\u043e\u0433\u043e \u0432 v1.0.0\u041a\u044d\u0448 \u043a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u0438 \u044f\u0434\u0435\u0440\u041a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u044f CUDA-\u044f\u0434\u0440\u0430 \u0447\u0435\u0440\u0435\u0437 NVRTC \u0437\u0430\u043d\u0438\u043c\u0430\u0435\u0442 10\u201350 \u043c\u0441. \u041f\u0440\u0438 \u043a\u0430\u0436\u0434\u043e\u043c \u0441\u0442\u0430\u0440\u0442\u0435 \u043f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u044f \u044d\u0442\u043e \u0441\u043a\u043b\u0430\u0434\u044b\u0432\u0430\u0435\u0442\u0441\u044f \u0432 \u0441\u0435\u043a\u0443\u043d\u0434\u044b. v1.0.0 \u0434\u043e\u0431\u0430\u0432\u043b\u044f\u0435\u0442 \u0434\u0432\u0430 \u043a\u043b\u0430\u0441\u0441\u0430:InMemoryProgramCache \u2014 in-process LRU-\u043a\u044d\u0448 \u043d\u0430 \u0432\u0440\u0435\u043c\u044f \u0436\u0438\u0437\u043d\u0438 \u043f\u0440\u043e\u0446\u0435\u0441\u0441\u0430FileStreamProgramCache \u2014 \u043f\u0435\u0440\u0441\u0438\u0441\u0442\u0435\u043d\u0442\u043d\u044b\u0439 \u043a\u044d\u0448 \u043d\u0430 \u0434\u0438\u0441\u043a\u0435 \u043c\u0435\u0436\u0434\u0443 \u0437\u0430\u043f\u0443\u0441\u043a\u0430\u043c\u0438from cuda.core import Device, Program, ProgramOptions, ObjectCodefrom cuda.core.utils import InMemoryProgramCached = Device(0)d.set_current()cuda_src = r&#187;&#187;&#187;extern &#171;C&#187; __global__ void saxpy(    float alpha, const float* __restrict__ x,    const float* __restrict__ y, float* out, int n) {    int idx = blockIdx.x * blockDim.x + threadIdx.x;    if (idx &lt; n) out[idx] = alpha * x[idx] + y[idx];}&#187;&#187;&#187;arch = f&#8217;sm_{d.compute_capability.major}{d.compute_capability.minor}&#8217;cache = InMemoryProgramCache()cache_key = f&#8217;saxpy_{arch}&#8217;# \u041f\u0435\u0440\u0432\u044b\u0439 \u0437\u0430\u043f\u0443\u0441\u043a \u2014 \u043a\u043e\u043c\u043f\u0438\u043b\u0438\u0440\u0443\u0435\u043c \u0438 \u043a\u043b\u0430\u0434\u0451\u043c \u0432 \u043a\u044d\u0448 (~15 \u043c\u0441)if cache.get(cache_key) is None:    prog = Program(cuda_src, &#8216;c++&#8217;, options=ProgramOptions(arch=arch))    cache[cache_key] = prog.compile(&#8216;cubin&#8217;)# \u0412\u0441\u0435 \u043f\u043e\u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0438\u0435 \u2014 \u043c\u0433\u043d\u043e\u0432\u0435\u043d\u043d\u043e (~0.05 \u043c\u0441, 300\u00d7 \u0431\u044b\u0441\u0442\u0440\u0435\u0435)kernel = ObjectCode.from_cubin(cache.get(cache_key)).get_kernel(&#8216;saxpy&#8217;)Green Contexts: \u0434\u0435\u043b\u0438\u043c GPU \u043c\u0435\u0436\u0434\u0443 \u0437\u0430\u0434\u0430\u0447\u0430\u043c\u0438Green Contexts \u2014 \u043c\u0435\u0445\u0430\u043d\u0438\u0437\u043c \u0440\u0430\u0437\u0434\u0435\u043b\u0435\u043d\u0438\u044f SM-\u0440\u0435\u0441\u0443\u0440\u0441\u043e\u0432 GPU \u043c\u0435\u0436\u0434\u0443 \u0437\u0430\u0434\u0430\u0447\u0430\u043c\u0438 \u0431\u0435\u0437 \u0444\u0438\u0437\u0438\u0447\u0435\u0441\u043a\u043e\u0439 \u0438\u0437\u043e\u043b\u044f\u0446\u0438\u0438. RTX 4090 \u0438\u043c\u0435\u0435\u0442 128 SM; \u043c\u043e\u0436\u043d\u043e \u043d\u0430\u0440\u0435\u0437\u0430\u0442\u044c \u0438\u0445 \u043d\u0430 \u043f\u0430\u0440\u0442\u0438\u0446\u0438\u0438 \u0438 \u0437\u0430\u043f\u0443\u0441\u043a\u0430\u0442\u044c \u0437\u0430\u0434\u0430\u0447\u0438 \u043f\u0430\u0440\u0430\u043b\u043b\u0435\u043b\u044c\u043d\u043e \u0441 \u0433\u0430\u0440\u0430\u043d\u0442\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u044b\u043c\u0438 \u0440\u0435\u0441\u0443\u0440\u0441\u0430\u043c\u0438 \u0434\u043b\u044f \u043a\u0430\u0436\u0434\u043e\u0439.\u0422\u0438\u043f\u0438\u0447\u043d\u044b\u0439 \u0441\u0446\u0435\u043d\u0430\u0440\u0438\u0439: \u0434\u0432\u0430 \u043d\u0435\u0437\u0430\u0432\u0438\u0441\u0438\u043c\u044b\u0445 \u0438\u043d\u0444\u0435\u0440\u0435\u043d\u0441-\u0437\u0430\u043f\u0440\u043e\u0441\u0430 \u0434\u043e\u043b\u0436\u043d\u044b \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u043d\u0430 \u043e\u0434\u043d\u043e\u043c GPU \u0431\u0435\u0437 \u0432\u0437\u0430\u0438\u043c\u043d\u043e\u0433\u043e \u0432\u044b\u0442\u0435\u0441\u043d\u0435\u043d\u0438\u044f \u0437\u0430 SM. \u0411\u0435\u0437 Green Contexts \u0437\u0430\u0434\u0430\u0447\u0438 \u043a\u043e\u043d\u043a\u0443\u0440\u0438\u0440\u0443\u044e\u0442 \u0437\u0430 \u0440\u0435\u0441\u0443\u0440\u0441 \u043d\u0435\u043a\u043e\u043d\u0442\u0440\u043e\u043b\u0438\u0440\u0443\u0435\u043c\u043e. \u0421 \u043d\u0438\u043c\u0438 \u2014 \u043a\u0430\u0436\u0434\u0430\u044f \u043f\u043e\u043b\u0443\u0447\u0430\u0435\u0442 \u0441\u0432\u043e\u0439 \u0432\u044b\u0434\u0435\u043b\u0435\u043d\u043d\u044b\u0439 \u043a\u0443\u0441\u043e\u043a.import numpy as npimport ctypesimport threadingfrom cuda.core import (    Device, SMResourceOptions, ContextOptions,    Program, ProgramOptions, LaunchConfig, launch, ObjectCode,    DeviceMemoryResource, DeviceMemoryResourceOptions)from cuda.core.utils import InMemoryProgramCachefrom cuda.bindings import runtime as cudartN = 5_000_000d = Device(0)d.set_current()total_sm = d.resources.sm.sm_countprint(f&#187;GPU: {d.name}, Total SMs: {total_sm}&#187;)  # Total SMs: 128# \u0414\u0435\u043b\u0438\u043c 128 SM \u043d\u0430 \u0434\u0432\u0430 \u0440\u0430\u0437\u0434\u0435\u043b\u0430 \u043f\u043e 64sm_opts = SMResourceOptions(count=[total_sm \/\/ 2, total_sm \/\/ 2])groups, _ = d.resources.sm.split(sm_opts)ctx_a = d.create_context(ContextOptions([groups[0]]))ctx_b = d.create_context(ContextOptions([groups[1]]))print(f&#187;Context A: {ctx_a.resources.sm.sm_count} SMs, is_green={ctx_a.is_green}&#187;)print(f&#187;Context B: {ctx_b.resources.sm.sm_count} SMs, is_green={ctx_b.is_green}&#187;)# Context A: 64 SMs, is_green=True# Context B: 64 SMs, is_green=Truecuda_src = r&#187;&#187;&#187;extern &#171;C&#187; __global__ void saxpy(    float alpha, const float* __restrict__ x,    const float* __restrict__ y, float* out, int n) {    int idx = blockIdx.x * blockDim.x + threadIdx.x;    if (idx &lt; n) out[idx] = alpha * x[idx] + y[idx];}&#187;&#187;&#187;arch = f&#187;sm_{d.compute_capability.major}{d.compute_capability.minor}&#187;cache = InMemoryProgramCache()cache_key = f&#187;saxpy_{arch}&#187;if cache.get(cache_key) is None:    prog = Program(cuda_src, &#171;c++&#187;, options=ProgramOptions(arch=arch))    cache[cache_key] = prog.compile(&#171;cubin&#187;)def run_task(ctx, task_name, alpha, fill_x, fill_y):    d.set_current()  # \u043d\u0443\u0436\u0435\u043d primary context \u0432 \u043a\u0430\u0436\u0434\u043e\u043c \u043f\u043e\u0442\u043e\u043a\u0435    # stream, \u0441\u043e\u0437\u0434\u0430\u043d\u043d\u044b\u0439 \u0447\u0435\u0440\u0435\u0437 green context, \u043f\u0440\u0438\u0432\u044f\u0437\u0430\u043d \u043a \u0435\u0433\u043e SM-\u043f\u0430\u0440\u0442\u0438\u0446\u0438\u0438    stream = ctx.create_stream()    mr = DeviceMemoryResource(0, DeviceMemoryResourceOptions())    kernel = ObjectCode.from_cubin(cache.get(cache_key)).get_kernel(&#171;saxpy&#187;)    size = N * 4  # float32    buf_x = mr.allocate(size, stream=stream)    buf_y = mr.allocate(size, stream=stream)    buf_o = mr.allocate(size, stream=stream)    x_h = np.full(N, fill_x, dtype=np.float32)    y_h = np.full(N, fill_y, dtype=np.float32)    cudart.cudaMemcpy(buf_x.handle, x_h.ctypes.data, size, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)    cudart.cudaMemcpy(buf_y.handle, y_h.ctypes.data, size, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)    cfg = LaunchConfig(grid=(N + 255) \/\/ 256, block=256)    launch(stream, cfg, kernel,           np.float32(alpha), buf_x.handle, buf_y.handle, buf_o.handle, ctypes.c_int(N))    stream.sync()    out_h = np.empty(N, dtype=np.float32)    cudart.cudaMemcpy(out_h.ctypes.data, buf_o.handle, size, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost)    print(f&#187;  {task_name}: result={out_h[0]:.1f}&#187;)    buf_x.close(); buf_y.close(); buf_o.close()    stream.close()# \u0417\u0430\u043f\u0443\u0441\u043a\u0430\u0435\u043c \u0434\u0432\u0430 \u044f\u0434\u0440\u0430 \u043f\u0430\u0440\u0430\u043b\u043b\u0435\u043b\u044c\u043d\u043e \u2014 \u043a\u0430\u0436\u0434\u043e\u0435 \u043d\u0430 \u0441\u0432\u043e\u0438\u0445 64 SMt_a = threading.Thread(target=run_task, args=(ctx_a, &#171;Task-A (64 SM)&#187;, 3.0, 2.0, 1.0))t_b = threading.Thread(target=run_task, args=(ctx_b, &#171;Task-B (64 SM)&#187;, 5.0, 4.0, 2.0))t_a.start(); t_b.start()t_a.join(); t_b.join()# Task-A (64 SM): result=7.0# Task-B (64 SM): result=22.0ctx_a.close(); ctx_b.close()\u0413\u0434\u0435 \u0434\u043e\u0441\u0442\u0443\u043f\u043d\u043e: Ada Lovelace (RTX 40xx, sm_89) \u0438 \u043d\u043e\u0432\u0435\u0435, CUDA 12.4+.\u0420\u0430\u0441\u0448\u0438\u0440\u0435\u043d\u043d\u044b\u0439 NVML\u041c\u043e\u0434\u0443\u043b\u044c system \u043f\u043e\u043b\u0443\u0447\u0438\u043b GPU-\u043c\u043e\u043d\u0438\u0442\u043e\u0440\u0438\u043d\u0433 \u0432 \u0440\u0435\u0430\u043b\u044c\u043d\u043e\u043c \u0432\u0440\u0435\u043c\u0435\u043d\u0438:from cuda.core import system, Deviceprint(f&#187;Devices: {system.get_num_devices()}&#187;)d = Device(0)d.set_current()sd = d.to_system_device()util = sd.utilizationmem  = sd.memory_infoprint(f&#187;{sd.name}: GPU {util.gpu}%, MEM {util.memory}%&#187;)print(f&#187;Memory: {mem.used \/\/ 10242} MB \/ {mem.total \/\/ 10242} MB&#187;)\u0422\u0430\u043a\u0436\u0435 \u0434\u043e\u0431\u0430\u0432\u043b\u0435\u043d\u044b: MIG-\u0440\u0435\u0436\u0438\u043c, NVLink (\u0432\u0435\u0440\u0441\u0438\u044f, \u0441\u043e\u0441\u0442\u043e\u044f\u043d\u0438\u0435), \u0441\u043f\u0438\u0441\u043e\u043a \u0437\u0430\u043f\u0443\u0449\u0435\u043d\u043d\u044b\u0445 compute-\u043f\u0440\u043e\u0446\u0435\u0441\u0441\u043e\u0432 \u0441 \u043f\u043e\u0442\u0440\u0435\u0431\u043b\u0435\u043d\u0438\u0435\u043c \u043f\u0430\u043c\u044f\u0442\u0438.\u0423\u0441\u043a\u043e\u0440\u0435\u043d\u0438\u0435 StridedMemoryView \u0434\u043b\u044f PyTorchStridedMemoryView \u043f\u043e\u043b\u0443\u0447\u0438\u043b fast path \u0447\u0435\u0440\u0435\u0437 AOT Inductor \u2014 7\u201320\u00d7 \u0431\u043e\u043b\u0435\u0435 \u0431\u044b\u0441\u0442\u0440\u043e\u0435 \u043f\u043e\u0441\u0442\u0440\u043e\u0435\u043d\u0438\u0435 view \u0434\u043b\u044f PyTorch-\u0442\u0435\u043d\u0437\u043e\u0440\u043e\u0432 \u0431\u0435\u0437 \u043a\u043e\u043f\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u044f \u0434\u0430\u043d\u043d\u044b\u0445.\u041f\u043e\u043b\u043d\u044b\u0439 \u0440\u0430\u0431\u043e\u0447\u0438\u0439 \u043f\u0440\u0438\u043c\u0435\u0440: SAXPY \u043d\u0430 GPU\u041f\u043e\u043b\u043d\u044b\u0439 \u0446\u0438\u043a\u043b: \u0438\u043d\u0438\u0446\u0438\u0430\u043b\u0438\u0437\u0430\u0446\u0438\u044f \u2192 \u043a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u044f \u2192 \u043f\u0430\u043c\u044f\u0442\u044c \u2192 \u0437\u0430\u043f\u0443\u0441\u043a \u2192 \u043f\u0440\u043e\u0432\u0435\u0440\u043a\u0430. \u041f\u0440\u043e\u0442\u0435\u0441\u0442\u0438\u0440\u043e\u0432\u0430\u043d\u043e \u043d\u0430 RTX 4090 \/ CUDA 13.0.import numpy as npimport ctypesfrom cuda.core import (    Device, DeviceMemoryResource, DeviceMemoryResourceOptions,    Program, ProgramOptions, LaunchConfig, launch, ObjectCode)from cuda.core.utils import InMemoryProgramCachefrom cuda.bindings import runtime as cudartN = 10_000_000FLOAT_BYTES = 4# 1. \u0423\u0441\u0442\u0440\u043e\u0439\u0441\u0442\u0432\u043e \u0438 \u0441\u0442\u0440\u0438\u043cd = Device(0)d.set_current()stream = d.create_stream()print(f&#187;GPU: {d.name}, SM: {d.properties.multiprocessor_count}&#187;)# 2. Memory poolmr = DeviceMemoryResource(0, DeviceMemoryResourceOptions())# 3. Kernel: SAXPY = alpha * X + Ycuda_src = r&#187;&#187;&#187;extern &#171;C&#187; __global__ void saxpy(    float alpha, const float* __restrict__ x,    const float* __restrict__ y, float* out, int n) {    int idx = blockIdx.x * blockDim.x + threadIdx.x;    if (idx &lt; n) out[idx] = alpha * x[idx] + y[idx];}&#187;&#187;&#187;# 4. \u041a\u043e\u043c\u043f\u0438\u043b\u044f\u0446\u0438\u044f \u0441 \u043a\u044d\u0448\u0435\u043carch = f&#8217;sm_{d.compute_capability.major}{d.compute_capability.minor}&#8217;cache = InMemoryProgramCache()cache_key = f&#8217;saxpy_{arch}&#8217;if cache.get(cache_key) is None:    prog = Program(cuda_src, &#8216;c++&#8217;, options=ProgramOptions(arch=arch))    cache[cache_key] = prog.compile(&#8216;cubin&#8217;)kernel = ObjectCode.from_cubin(cache.get(cache_key)).get_kernel(&#8216;saxpy&#8217;)# 5. GPU-\u043f\u0430\u043c\u044f\u0442\u044csize_bytes = N * FLOAT_BYTESbuf_x   = mr.allocate(size_bytes, stream=stream)buf_y   = mr.allocate(size_bytes, stream=stream)buf_out = mr.allocate(size_bytes, stream=stream)# 6. H\u2192Dx_host = np.ones(N, dtype=np.float32) * 2.0y_host = np.ones(N, dtype=np.float32) * 1.0alpha  = np.float32(3.0)cudart.cudaMemcpy(buf_x.handle, x_host.ctypes.data, size_bytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)cudart.cudaMemcpy(buf_y.handle, y_host.ctypes.data, size_bytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)# 7. \u0417\u0430\u043f\u0443\u0441\u043acfg = LaunchConfig(grid=(N + 255) \/\/ 256, block=256)launch(stream, cfg, kernel, alpha, buf_x.handle, buf_y.handle, buf_out.handle, ctypes.c_int(N))# 8. D\u2192H + \u043f\u0440\u043e\u0432\u0435\u0440\u043a\u0430stream.sync()out_host = np.empty(N, dtype=np.float32)cudart.cudaMemcpy(out_host.ctypes.data, buf_out.handle, size_bytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost)assert np.allclose(out_host, 3.0 * x_host + y_host)print(f&#187;SAXPY OK: {out_host[0]:.1f}&#187;)  # 7.0buf_x.close(); buf_y.close(); buf_out.close()stream.close()\u0412\u0440\u0435\u043c\u044f \u044f\u0434\u0440\u0430 \u043d\u0430 10M \u044d\u043b\u0435\u043c\u0435\u043d\u0442\u043e\u0432 \u2014 ~0.13 \u043c\u0441, \u044d\u0444\u0444\u0435\u043a\u0442\u0438\u0432\u043d\u0430\u044f \u043f\u043e\u043b\u043e\u0441\u0430 \u043f\u0440\u043e\u043f\u0443\u0441\u043a\u0430\u043d\u0438\u044f ~920 GB\/s (91% \u043e\u0442 \u043f\u0438\u043a\u0430 RTX 4090).\u041f\u043e\u043b\u043d\u044b\u0439 \u043a\u043e\u0434 \u043f\u0440\u0438\u043c\u0435\u0440\u043e\u0432 \u0438 \u0431\u0435\u043d\u0447\u043c\u0430\u0440\u043a\u043e\u0432: github.com\/IntelionCloud\/research-examples\/cuda-core\/Benchmark: \u043f\u0440\u043e\u043f\u0443\u0441\u043a\u043d\u0430\u044f \u0441\u043f\u043e\u0441\u043e\u0431\u043d\u043e\u0441\u0442\u044c \u043f\u0430\u043c\u044f\u0442\u0438\u0420\u0435\u0430\u043b\u044c\u043d\u044b\u0435 \u0438\u0437\u043c\u0435\u0440\u0435\u043d\u0438\u044f SAXPY \u043d\u0430 cuda-core v1.0.0 (100M float32 \u044d\u043b\u0435\u043c\u0435\u043d\u0442\u043e\u0432, 5 \u043f\u0440\u043e\u0433\u043e\u043d\u043e\u0432, warmup):GPU\u0422\u0435\u043e\u0440. BW\u0417\u0430\u043c\u0435\u0440\u0435\u043d\u043d\u044b\u0439 BW\u042d\u0444\u0444\u0435\u043a\u0442\u0438\u0432\u043d\u043e\u0441\u0442\u044cA100 80GB PCIe1935 GB\/s1584 GB\/s82%RTX 4090 (24 GB)1008 GB\/s929 GB\/s92%RTX 3090936 GB\/s848 GB\/s91%\u0412\u0441\u0435 \u0442\u0440\u0438 GPU \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u043b\u0438 100% \u0441\u0432\u043e\u0438\u0445 SM: \u043f\u0440\u0438 100M \u044d\u043b\u0435\u043c\u0435\u043d\u0442\u0430\u0445 \u0438 block=256 \u043f\u043e\u043b\u0443\u0447\u0430\u0435\u0442\u0441\u044f 390 625 \u0431\u043b\u043e\u043a\u043e\u0432 \u2014 \u0441 \u0438\u0437\u0431\u044b\u0442\u043a\u043e\u043c \u0434\u043b\u044f \u0437\u0430\u0433\u0440\u0443\u0437\u043a\u0438 108 SM (A100), 128 SM (RTX 4090) \u0438 82 SM (RTX 3090). \u0420\u0430\u0437\u043d\u044b\u0435 \u043f\u0440\u043e\u0446\u0435\u043d\u0442\u044b \u044d\u0444\u0444\u0435\u043a\u0442\u0438\u0432\u043d\u043e\u0441\u0442\u0438 \u043e\u0442\u0440\u0430\u0436\u0430\u044e\u0442 \u0430\u0440\u0445\u0438\u0442\u0435\u043a\u0442\u0443\u0440\u0443 \u043f\u0430\u043c\u044f\u0442\u0438, \u0430 \u043d\u0435 SM-\u0443\u0442\u0438\u043b\u0438\u0437\u0430\u0446\u0438\u044e: GDDR6X (RTX 4090, 3090) \u0445\u043e\u0440\u043e\u0448\u043e \u043d\u0430\u0441\u044b\u0449\u0430\u0435\u0442\u0441\u044f \u043f\u0440\u043e\u0441\u0442\u044b\u043c \u043f\u043e\u0441\u043b\u0435\u0434\u043e\u0432\u0430\u0442\u0435\u043b\u044c\u043d\u044b\u043c \u0441\u0442\u0440\u0438\u043c\u0438\u043d\u0433\u043e\u043c, HBM2e (A100) \u043e\u043f\u0442\u0438\u043c\u0438\u0437\u0438\u0440\u043e\u0432\u0430\u043d \u043f\u043e\u0434 \u043c\u0430\u0441\u0441\u0438\u0432\u043d\u043e-\u043f\u0430\u0440\u0430\u043b\u043b\u0435\u043b\u044c\u043d\u044b\u0439 \u0434\u043e\u0441\u0442\u0443\u043f \u0438 \u043c\u0430\u0442\u0440\u0438\u0447\u043d\u044b\u0435 \u043e\u043f\u0435\u0440\u0430\u0446\u0438\u0438 \u2014 SAXPY \u0435\u0433\u043e \u043d\u0435 \u0440\u0430\u0441\u043a\u0440\u044b\u0432\u0430\u0435\u0442 (SAXPY = Single-precision A\u00b7X Plus Y \u2014 \u043e\u043f\u0435\u0440\u0430\u0446\u0438\u044f \u0438\u0437 \u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u0430 Basic Linear Algebra Subprograms, 1970-\u0435). \u041d\u0430 transformer attention \u0438 \u043a\u0440\u0443\u043f\u043d\u044b\u0445 matmul \u043a\u0430\u0440\u0442\u0438\u043d\u0430 \u0431\u044b\u043b\u0430 \u0431\u044b \u0434\u0440\u0443\u0433\u043e\u0439.cuda-core vs pycuda\u0425\u0430\u0440\u0430\u043a\u0442\u0435\u0440\u0438\u0441\u0442\u0438\u043a\u0430pycudacuda-core v1.0.0\u0421\u0442\u0430\u0442\u0443\u0441\u0421\u0442\u043e\u0440\u043e\u043d\u043d\u0438\u0439, \u043d\u0435\u043e\u0444\u0438\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u0439\u041e\u0444\u0438\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u0439 NVIDIA\u0421\u0442\u0430\u0431\u0438\u043b\u044c\u043d\u044b\u0439 API\u041d\u0435\u0442\u0414\u0430, \u0441 1.0.0CUDA 12\/13\u0427\u0430\u0441\u0442\u0438\u0447\u043d\u0430\u044f&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-479431","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/479431","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=479431"}],"version-history":[{"count":0,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/479431\/revisions"}],"wp:attachment":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=479431"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=479431"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=479431"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}