CUDA入门¶

这并不是一个严格意义上的CUDA入门，只是希望帮助大家更好的去理解CUDA代码的并行性以及其运行在GPU上这个事实

与运行在CPU上的C代码区别¶

GPU有大量核心，天生具有并行性，因此作为GPU的编程语言，CUDA的编程更类似于并行编程。

CUDA函数主要特点如下:

我们需要指定这个函数使用的线程数目func<<<gridDim, blockDim>>>(...)，这样指定后，GPU便会将这这些线程分布到其众多核心上并行执行，CUDA函数所干的事情就是安排好每个线程做什么。一般来说，若干线程组成一个block，若干block又组成一个grid，总线程数目就是blockDim * gridDim（注：blockDim和gridDim可以是三维的，为了解释方便我们就认为其是一个数了）。
由于有很多个线程同时执行这个函数，线程与线程之间的执行顺序是完全随机的，我们不能假设线程0，线程1之间可以同时执行完某一部分代码，也不能假定他们有什么特定的先后关系。这就需要我们在CUDA函数中添加同步点，来保证每个线程都可以看到正确的结果。
其只能使用CUDA指针与特定类型的函数，这个也非常好理解，GPU能直接访问的只有它自己的显存，CPU上的内存是不好直接用的。同时如果存在CUDA内部的函数调用，它也只能调用带有__device__属性的CUDA函数以及带有__global__属性的CUDA函数，不能调用C函数。
代码中在调用CUDA函数后，CPU不会等待CUDA函数执行完毕，而是会直接继续执行后面的代码，如果我们没有进行同步，直接调用函数结果的话结果就会出错。

除了这些，CUDA函数的编写方式基本与C函数相差无几。　　

小试牛刀¶

下面的代码使用GPU实现了对数组求和的功能，但是其中有些小bug需要大家来找一下。程序正确运行之后可以对比两个计时点得到的计时，体会正常的C函数与CUDA函数的不同。

编译方式：

nvcc reduce.cu -o reduce

如果代码看不懂可以问问开源的模型

// reduce.cu
#include <cuda_runtime.h>
#include <iostream>
#include <time.h>



__global__ void reductionKernel(float* input, float* output, int size) {
    extern __shared__ float sharedData[];

    int tid = threadIdx.x;
    int index = blockIdx.x * blockDim.x + threadIdx.x;

    // Load data into shared memory
    sharedData[tid] = (index < size) ? input[index] : 0;
    // __syncthreads();

    // Perform reduction in shared memory
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) {
            sharedData[tid] += sharedData[tid + stride];
        }
        // __syncthreads();
    }

    // Write result for this block to global memory
    if (tid == 0) {
        output[blockIdx.x] = sharedData[0];
    }
}

void checkCudaError(cudaError_t err, const char* msg) {
    if (err != cudaSuccess) {
        std::cerr << "CUDA Error: " << msg << " - " << cudaGetErrorString(err) << std::endl;
        exit(EXIT_FAILURE);
    }
}

void reduction_gpu(float* input, float* output, int size) {
    int blockSize = 256; // Number of threads per block
    int gridSize = (size + blockSize - 1) / blockSize;

    float* d_input, * d_output;
    checkCudaError(cudaMalloc(&d_input, size * sizeof(float)), "Failed to allocate device memory for input");
    checkCudaError(cudaMalloc(&d_output, gridSize * sizeof(float)), "Failed to allocate device memory for output");

    checkCudaError(cudaMemcpy(d_input, input, size * sizeof(float), cudaMemcpyHostToDevice), "Failed to copy data to device");

    // Launch the reduction kernel using cudaLaunchKernel
    clock_t start = clock();
    void* kernelArgs[] = { &input, &output, &size };
    // void* kernelArgs[] = { &d_input, &d_output, &size };
    checkCudaError(cudaLaunchKernel((void*)reductionKernel, 
                                    dim3(gridSize), dim3(blockSize), 
                                    kernelArgs, blockSize * sizeof(float), 0), 
                                    "Failed to launch reduction kernel");

    clock_t end = clock();
    double duration_kernel = (double)(end - start) / (CLOCKS_PER_SEC);
    printf("reduction_kernel: %.9lf\n", duration_kernel);

    // Check for any errors during kernel execution
    checkCudaError(cudaGetLastError(), "Kernel execution failed");
    // Wait for the kernel to complete 
    cudaDeviceSynchronize();

    end = clock();
    duration_kernel = (double)(end - start) / (CLOCKS_PER_SEC);
    printf("reduction_kernel after sync: %.9lf\n", duration_kernel);  
    // Copy the partial results back to the host
    checkCudaError(cudaMemcpy(output, d_output, gridSize * sizeof(float), cudaMemcpyDeviceToHost), "Failed to copy data back to host");
    // Perform the final reduction on the host
    float finalSum = 0.0f;
    for (int i = 0; i < gridSize; ++i) {
        finalSum += output[i];
    }

    std::cout << "Total sum: " << finalSum << std::endl;

    cudaFree(d_input);
    cudaFree(d_output);
}

int main() {
    const int size = 1024 * 1024 * 16;
    float* input = new float[size];
    float* output = new float[(size + 255) / 256];

    // Initialize input data
    for (int i = 0; i < size; ++i) {
        input[i] = static_cast<float>(2);
    }

    reduction_gpu(input, output, size);

    delete[] input;
    delete[] output;

    return 0;
}

学习&拓展：¶

参照PKU超算队的CUDA文档学习CUDA的基本写法
使用CUDA实现一个矩阵乘法，并可以测试其性能
block之内的线程如何进行同步，block之间的线程如何进行同步（可以考虑摸不到的H100）
了解if分支带来的warp divergence现象，学习如何避免这种现象
了解学习GPU上shared memory的作用，以及其作用域（grid，block or thread）
根据教程使用分块+sharedmemory 优化自己写的矩阵乘法，如果有兴趣可以对比下自己的实现与cublas实现的性能差距，测试矩阵使用方阵即可。

一些链接¶

CUDA环境安装 WSL2

CUDA环境安装 ubuntu

CUDA环境安装 Mac

PKU CUDA入门(这里系统介绍了CUDA的一些概念，适合新手学习如何编写CUDA程序)

CUDA Best Parctice: control flow

CUDA Best Parctice: memory

GEMM 分块优化