报错现象

最近机器查看显卡经常报错驱动版本不匹配,然后每次卸载重装过了1天就又报错
nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

原因分析

查看了依赖库版本发现没有我安装的的驱动版本,问了一圈没有人更新过系统内核和依赖库

#查看当前系统安装的驱动版本
cat /proc/driver/nvidia/version
-----------------------------------------------------------------------------------------
NVRM version: NVIDIA UNIX x86_64 Kernel Module  515.105.01  Mon Feb 27 12:49:44 UTC 2023
GCC version:  gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04) 
-----------------------------------------------------------------------------------------
#查看库版本
dpkg --list | grep nvidia-
-----------------------------------------------------------------------------------------------------------------------------------------------------------
ii  libnvidia-compute-495:amd64               510.108.03-0ubuntu0.22.04.1             amd64        Transitional package for libnvidia-compute-510
ii  libnvidia-compute-510:amd64               525.147.05-0ubuntu2.22.04.1             amd64        Transitional package for libnvidia-compute-535
ii  libnvidia-compute-535:amd64               535.183.01-0ubuntu0.22.04.1             amd64        NVIDIA libcompute package
ii  libnvidia-container-tools                 1.15.0-1                                amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                1.15.0-1                                amd64        NVIDIA container runtime library
ii  libnvidia-ml-dev:amd64                    11.5.50~11.5.1-1ubuntu1                 amd64        NVIDIA Management Library (NVML) development files
ii  nvidia-container-runtime                  3.14.0-1                                all          NVIDIA Container Toolkit meta-package
ii  nvidia-container-toolkit                  1.15.0-1                                amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base             1.15.0-1                                amd64        NVIDIA Container Toolkit Base
ii  nvidia-cuda-dev:amd64                     11.5.1-1ubuntu1                         amd64        NVIDIA CUDA development files
ii  nvidia-cuda-gdb                           11.5.114~11.5.1-1ubuntu1                amd64        NVIDIA CUDA Debugger (GDB)
rc  nvidia-cuda-toolkit                       11.5.1-1ubuntu1                         amd64        NVIDIA CUDA development toolkit
ii  nvidia-cuda-toolkit-doc                   11.5.1-1ubuntu1                         all          NVIDIA CUDA and OpenCL documentation
ii  nvidia-docker2                            2.14.0-1                                all          NVIDIA Container Toolkit meta-package
ii  nvidia-fabricmanager-515                  525.147.05-0ubuntu2.22.04.1             amd64        Fabric Manager for NVSwitch based systems. (transitional package)
ii  nvidia-fabricmanager-535                  535.183.01-0ubuntu0.22.04.1             amd64        Fabric Manager for NVSwitch based systems.
ii  nvidia-opencl-dev:amd64                   11.5.1-1ubuntu1                         amd64        NVIDIA OpenCL development files
ii  nvidia-profiler                           11.5.114~11.5.1-1ubuntu1                amd64        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-visual-profiler                    11.5.114~11.5.1-1ubuntu1                amd64        NVIDIA Visual Profiler for CUDA and OpenCL
-----------------------------------------------------------------------------------------------------------------------------------------------------------

查看系统日志发现有个定时自动更新任务,这个任务启动后我的驱动也挂了

Jun 28 06:11:38 ubuntu-server-A203 systemd[1]: Starting Daily apt upgrade and clean activities...
Jun 28 06:11:45 ubuntu-server-A203 kernel: [2646161.007373] NVRM: API mismatch: the client has the version 535.183.01, but
Jun 28 06:11:45 ubuntu-server-A203 kernel: [2646161.007373] NVRM: this kernel module has the version 515.105.01.  Please
Jun 28 06:11:45 ubuntu-server-A203 kernel: [2646161.007373] NVRM: make sure that this kernel module and all NVIDIA driver
Jun 28 06:11:45 ubuntu-server-A203 kernel: [2646161.007373] NVRM: components have the same version.
Jun 28 06:11:45 ubuntu-server-A203 kernel: [2646161.135308] NVRM: API mismatch: the client has the version 535.183.01, but
Jun 28 06:11:45 ubuntu-server-A203 kernel: [2646161.135308] NVRM: this kernel module has the version 515.105.01.  Please
Jun 28 06:11:45 ubuntu-server-A203 kernel: [2646161.135308] NVRM: make sure that this kernel module and all NVIDIA driver
Jun 28 06:11:45 ubuntu-server-A203 kernel: [2646161.135308] NVRM: components have the same version.
Jun 28 06:11:45 ubuntu-server-A203 kernel: [2646161.264986] NVRM: API mismatch: the client has the version 535.183.01, but

解决方法

关闭掉定时更新,重新安装一下驱动即可

#查看定时更新任务
systemctl list-timers  apt-daily.timer
#停止定时更新任务
systemctl stop apt-daily.timer
#关闭开机自启动
sudo systemctl disable apt-daily.timer

验证

结论

由于系统每日定时自动更新并清除旧安装包,导致库将原有的驱动版本清除了,从而导致我机器的驱动挂掉,所以关闭这个每日自动更新即可

GitHub 加速计划 / nv / nvm
78.06 K
7.82 K
下载
nvm-sh/nvm: 是一个 Node.js 版本管理器,用于在不同的 Node.js 版本之间进行切换。它可以帮助开发者轻松管理多个 Node.js 版本,方便进行开发和测试。特点包括轻量级、易于使用、支持跨平台等。
最近提交(Master分支:2 个月前 )
9c9ff4ba Moved issue template into ISSUE_TEMPLATE folder 13 天前
51ea809d - 12 天前
Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐