torch.utils.cpp_extension.load卡住无响应
问题描述
今天在跑实验时碰到标题所述的问题,具体代码片段如下:
### chamfer_3D.py
chamfer_found = importlib.find_loader("chamfer_3D") is not None
if not chamfer_found:
## Cool trick from https://github.com/chrdiller
print("Jitting Chamfer 3D")
from torch.utils.cpp_extension import load
chamfer_3D = load(name="chamfer_3D",
sources=[
"/".join(os.path.abspath(__file__).split('/')[:-1] + ["chamfer_cuda.cpp"]),
"/".join(os.path.abspath(__file__).split('/')[:-1] + ["chamfer3D.cu"]),
])
print("Loaded JIT 3D CUDA chamfer distance")
else:
import chamfer_3D
print("Loaded compiled 3D CUDA chamfer distance")
这段代码的含义是如果在python环境中检测到chamfer_3D包就直接引入,否则调用torch.utils.cpp_extension.load,手动加载外部C++库。
运行这段代码时,由于没有chamfer_3D包,所以程序运行load函数,发现程序会卡住,长时间一直无输出,命令行输出界面如下:
> (atlasnet) user@ubuntu: ~/chamfer3D$ python chamfer_3D.py
Jitting Chamfer 3D
按Ctrl+C强行结束掉程序时,输出如下:
> (atlasnet) user@ubuntu: ~/chamfer3D$ python chamfer_3D.py
Jitting Chamfer 3D
^CTraceback (most recent call last):
File "dist_chamfer_3D.py", line 15, in <module>
"/".join(os.path.abspath(__file__).split('/')[:-1] + ["chamfer3D.cu"]),
File "/home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 974, in load
keep_intermediates=keep_intermediates)
File "/home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1183, in _jit_compile
baton.wait()
File "/home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/file_baton.py", line 49, in wait
time.sleep(self.wait_seconds)
KeyboardInterrupt
问题分析
出现这一问题的原因是存在互斥锁。出问题的代码片段如下:
### torch/utils/cpp_extension.py
if version != old_version:
baton = FileBaton(os.path.join(build_directory, 'lock'))
if baton.try_acquire():
try:
with GeneratedFileCleaner(keep_intermediates=keep_intermediates) as clean_ctx:
if IS_HIP_EXTENSION and (with_cuda or with_cudnn):
hipify_python.hipify(
project_directory=build_directory,
output_directory=build_directory,
includes=os.path.join(build_directory, '*'),
extra_files=[os.path.abspath(s) for s in sources],
show_detailed=verbose,
is_pytorch_extension=True,
clean_ctx=clean_ctx
)
_write_ninja_file_and_build_library(
name=name,
sources=sources,
extra_cflags=extra_cflags or [],
extra_cuda_cflags=extra_cuda_cflags or [],
extra_ldflags=extra_ldflags or [],
extra_include_paths=extra_include_paths or [],
build_directory=build_directory,
verbose=verbose,
with_cuda=with_cuda)
finally:
baton.release()
else:
baton.wait()
通过这个代码大致可以看出来,pytorch的cpp_extension在加载外部库的时候会给这个库文件加上一个”读锁“,这个读锁是通过新建一个"lock"文件来做的。如果程序探测到有“lock”文件,就认为此时有其它进程正在使用相同的文件,发生读写冲突,导致baton.try_acquire()返回False,进入wait()函数,直到锁被释放。
锁的存在,导致同一时刻其它进程不能读取此文件。如果在之前运行这个程序时,趁加锁之后突然kill掉这个程序,导致它还没来得及释放锁,这样锁就会一直存在,导致后续所有程序都无法读取该库文件。我分析这次碰到的Jitting卡住的问题就是上述原因引起的。
解决方案
首先要找到锁在哪里。
进入库函数torch/utils/cpp_extension.py文件,在第1156行打上一个断点,也就是这一句:
baton = FileBaton(os.path.join(build_directory, 'lock'))
当程序运行到这里时,查看变量build_directory的值,lock文件应该就存在这里。进入这个文件夹删掉lock文件,之后再次运行该程序就不会卡住了。
windows下如果使用PyCharm,打断点和查看变量值的操作比较容易,在这里演示一下linux上使用pdb调试python程序的操作,如下:
(atlasnet) zhangwenyuan@ubuntu:~/atlas/AtlasNet/auxiliary/ChamferDistancePytorch/chamfer3D$
cd ~/atlas/AtlasNet
(atlasnet) zhangwenyuan@ubuntu:~/atlas/AtlasNet$ python -m pdb train.py --shapenet13
> /home/zhangwenyuan/atlas/AtlasNet/train.py(1)<module>()
-> import sys
(Pdb) b /home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/cpp_extension.py:1156
Breakpoint 1 at /home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/cpp_extension.py:1156
(Pdb) c
Jitting Chamfer 3D
> /home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/cpp_extension.py(1156)_jit_compile()
-> baton = FileBaton(os.path.join(build_directory, 'lock'))
(Pdb) p build_directory
'/home/zhangwenyuan/.cache/torch_extensions/chamfer_3D'
因此知道lock文件在"/home/zhangwenyuan/.cache/torch_extensions/chamfer_3D"目录下。进入该目录删掉lock文件,再次运行程序,不会再碰到该问题了。
更多推荐
所有评论(0)