NVIDIA Driver 版本和 CUDA 版本兼容性

Driver 版本

执行 nvidia-smi 命令可以查看主机上安装的 NVIDIA 驱动版本,如下图当前主机的驱动版本为 470.141.03

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|

CUDA 版本

跨次要版本的兼容

NVIDIA 在 CUDA 11.0 版本后支持跨小版本号的向前兼容,无需任何额外的配置。对于任意支持 CUDA >= 11.0 版本的主机驱动(驱动版本 >= 450.80.02),均可兼容所有 CUDA 11.x 版本的应用;同理,对于任意支持 CUDA >= 12.0 版本的驱动(驱动版本 >= 525.60.13),均可兼容所有 CUDA 12.x 版本的应用。

跨任意版本的向后兼容(backward compatible)

执行 nvidia-smi 命令显示在驱动版本旁边的 CUDA Version 为当前驱动版本支持的最高 CUDA 版本,比如上面主机驱动 470.141.03 支持的最高 CUDA 版本为 11.8所有 CUDA 版本都是向后兼容的,即所有 <= 11.x CUDA 版本的应用都可以正常运行(结合上面的“跨次要版本的兼容”)。

跨主要版本的向前兼容(forward compatible)

跨大版本向前兼容是指最高支持 CUDA 11.x 版本的主机驱动兼容 CUDA 12.x 的应用,仅支持 NVIDIA Data Center GPUs,比如:V100、A100、H100。

高于当前主机驱动支持的 CUDA 版本且跨大版本号的应用是无法正常运行的,假设当前主机支持的最高 CUDA 版本为 11.8,则 12.x 版本的 CUDA 应用在机器上运行会报错,比如下面 pytorch 的报错提示 The NVIDIA driver on your system is too old

(cu121) [root@VM workspace]$ python
Python 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/root/miniconda3/envs/cu121/lib/python3.10/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /opt/conda/conda-bld/pytorch_1720538459595/work/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
>>> 

为了避免频繁升级主机驱动(需要重启机器),NVIDIA 在 CUDA 10.0 版本后支持跨大版本的向前兼容,通过安装 cuda-compat 包,可以在较低的驱动版本上运行较新 CUDA 版本的应用。cuda-compat 包的默认安装路径为 /usr/local/cuda-xx.x/compat/,可以通过配置 LD_LIBRARY_PATH 环境变量来使用。比如,通过执行 export LD_LIBRARY_PATH=/usr/local/cuda-12.1/compat:"$LD_LIBRARY_PATH",即可在 470.141.03 驱动版本下正常运行 CUDA 12.x 版本的应用(结合上面的“跨次要版本的兼容”)。

(cu121) [root@VM workspace]$ rpm -ivh https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-compat-12-1-530.30.02-1.x86_64.rpm

(cu121) [root@VM workspace]$ ll /usr/local/cuda-12.1/compat/
total 144692
lrwxrwxrwx 1 root root       28 Feb 22  2023 libcudadebugger.so.1 -> libcudadebugger.so.530.30.02
-rwxr-xr-x 1 root root 10488824 Feb 22  2023 libcudadebugger.so.530.30.02
lrwxrwxrwx 1 root root       12 Feb 22  2023 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 Feb 22  2023 libcuda.so.1 -> libcuda.so.530.30.02
-rwxr-xr-x 1 root root 29900840 Feb 22  2023 libcuda.so.530.30.02
lrwxrwxrwx 1 root root       27 Feb 22  2023 libnvidia-nvvm.so -> libnvidia-nvvm.so.530.30.02
lrwxrwxrwx 1 root root       27 Feb 22  2023 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.530.30.02
-rwxr-xr-x 1 root root 85979712 Feb 22  2023 libnvidia-nvvm.so.530.30.02
lrwxrwxrwx 1 root root       37 Feb 22  2023 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.530.30.02
-rwxr-xr-x 1 root root 21784224 Feb 22  2023 libnvidia-ptxjitcompiler.so.530.30.02

(cu121) [root@VM workspace]$ LD_LIBRARY_PATH=/usr/local/cuda-12.1/compat:$LD_LIBRARY_PATH python
Python 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'12.1'
>>> torch.cuda.is_available()
True
>>>

(cu124) [root@VM workspace]$ LD_LIBRARY_PATH=/usr/local/cuda-12.1/compat:$LD_LIBRARY_PATH python
Python 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'12.4'
>>> torch.cuda.is_available()
True
>>> 

(cu121) [root@VM workspace]$ LD_LIBRARY_PATH=/usr/local/cuda-12.1/compat:$LD_LIBRARY_PATH nvidia-smi | head -n8
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|

参考资料

  1. https://docs.nvidia.com/deploy/cuda-compatibility
  2. https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/
  3. https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/
Comment