- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

NVIDIA CUDA编程实操一：环境检查与入门

黄生发表于 2025/06/05 17:03:12 2025/06/05

【摘要】我使用的华为云开发者的AI Gallery提供的环境，选择一个Notebook案例运行即可进入，进入后可以切换到限时免费的GPU环境。第一阶段：环境检查 1. 验证CUDA驱动安装nvidia-smi输出：Thu Jun 5 16:57:18 2025 +-------------------------------------------------------------...

我使用的华为云开发者的AI Gallery提供的环境，选择一个Notebook案例运行即可进入，进入后可以切换到限时免费的GPU环境。

第一阶段：环境检查

1. 验证CUDA驱动安装

nvidia-smi

输出：

Thu Jun  5 16:57:18 2025       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:00:0D.0 Off |                    0 |
| N/A   32C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2. 检查CUDA工具包

nvcc --version

输出：

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

3. 验证CUDA示例运行

cd /usr/local/cuda/samples/1_Utilities/deviceQuery
make
./deviceQuery

该环境无示例。

第二阶段：第一个CUDA程序

1. 创建hello.cu文件

#include <stdio.h>

__global__ void helloFromGPU() {
    printf("Hello World from GPU thread %d!\n", threadIdx.x);
}

int main() {
    printf("Hello World from CPU!\n");
    
    helloFromGPU<<<1, 5>>>();
    cudaDeviceSynchronize();
    
    return 0;
}

2. 编译并运行

nvcc hello.cu -o hello
./hello

预期输出：

Hello World from CPU!
Hello World from GPU thread 0!
Hello World from GPU thread 1!
Hello World from GPU thread 2!
Hello World from GPU thread 3!
Hello World from GPU thread 4!

第三阶段：设备信息查询

1. 创建device_info.cu

#include <stdio.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    
    for (int i = 0; i < deviceCount; i++) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, i);
        
        printf("Device %d: %s\n", i, prop.name);
        printf("  Compute capability: %d.%d\n", prop.major, prop.minor);
        printf("  Global memory: %.2f GB\n", prop.totalGlobalMem/1024.0/1024.0/1024.0);
        printf("  Shared memory per block: %.2f KB\n", prop.sharedMemPerBlock/1024.0);
        printf("  Warp size: %d threads\n", prop.warpSize);
        printf("  Max threads per block: %d\n", prop.maxThreadsPerBlock);
        printf("  Max threads per multiprocessor: %d\n", prop.maxThreadsPerMultiProcessor);
        printf("  Multiprocessor count: %d\n", prop.multiProcessorCount);
    }
    
    return 0;
}

2. 编译运行

nvcc device_info.cu -o device_info
./device_info

对于Tesla P100应看到类似：

Device 0: Tesla P100-PCIE-16GB
  Compute capability: 6.0
  Global memory: 15.90 GB
  Shared memory per block: 48.00 KB
  Warp size: 32 threads
  Max threads per block: 1024
  Max threads per multiprocessor: 2048
  Multiprocessor count: 56

第四阶段：简单向量加法

1. 创建vector_add.cu

#include <stdio.h>
#include <cuda_runtime.h>

#define N 5

__global__ void vectorAdd(int *a, int *b, int *c) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) {
        c[tid] = a[tid] + b[tid];
    }
}

int main() {
    int a[N] = {1, 2, 3, 4, 5};
    int b[N] = {10, 20, 30, 40, 50};
    int c[N] = {0};
    
    int *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, N * sizeof(int));
    cudaMalloc(&d_b, N * sizeof(int));
    cudaMalloc(&d_c, N * sizeof(int));
    
    cudaMemcpy(d_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
    
    vectorAdd<<<1, N>>>(d_a, d_b, d_c);
    
    cudaMemcpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);
    
    for (int i = 0; i < N; i++) {
        printf("%d + %d = %d\n", a[i], b[i], c[i]);
    }
    
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    
    return 0;
}

2. 编译运行

nvcc vector_add.cu -o vector_add
./vector_add

输出：

1 + 10 = 11
2 + 20 = 22
3 + 30 = 33
4 + 40 = 44
5 + 50 = 55

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

NVIDIA CUDA编程实操一：环境检查与入门

第一阶段：环境检查

1. 验证CUDA驱动安装

2. 检查CUDA工具包

3. 验证CUDA示例运行

第二阶段：第一个CUDA程序

1. 创建hello.cu文件

2. 编译并运行

第三阶段：设备信息查询

1. 创建device_info.cu

2. 编译运行

第四阶段：简单向量加法

1. 创建vector_add.cu

2. 编译运行

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

NVIDIA CUDA编程实操一：环境检查与入门

第一阶段：环境检查

1. 验证CUDA驱动安装

2. 检查CUDA工具包

3. 验证CUDA示例运行

第二阶段：第一个CUDA程序

1. 创建hello.cu文件

2. 编译并运行

第三阶段：设备信息查询

1. 创建device_info.cu

2. 编译运行

第四阶段：简单向量加法

1. 创建vector_add.cu

2. 编译运行

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品