18.5使用tensorrt加速tensorflow的预测/前向传播速度

ming.zhang

5836人浏览 · 2019-01-12 23:48:57

ming.zhang · 2019-01-12 23:48:57 发布

本文接着前面的18.1至18.3博客。

nvidia推出的tensorrt可以加速前向传播的速度。本文采用tensorflow训练好的mobilenetv2模型进行测试，按照前面博客我们已经可以把训练好的模型转为.pb的格式了，这里的tensorrt就是对.pb文件进行加速。

一、tensorrt安装

安装方法参考https://developer.download.nvidia.com/compute/machine-learning/tensorrt/docs/5.0/GA_5.0.2.6/TensorRT-Installation-Guide.pdf或者https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html

我的电脑是ubuntu16.4, cuda9.0，tensorrt下载的 trt4.0(从这里下载(需要注册nvidia账号)https://developer.nvidia.com/nvidia-tensorrt-4x-download)。其他版本可以根据自己的电脑选择。

安装tensorrt时采用下载的.deb包进行安装，我下载的包名字为nv-tensorrt-repo-ubuntu1604-cuda9.0-ga-trt4.0.1.6-20180612_1-1_amd64.deb，依次运行下面的命令即可安装：

sudo dpkg -i nv-tensorrt-repo-ubuntu1604-cuda9.0-ga-trt4.0.1.6-20180612_1-1_amd64.deb

sudo apt-key add /var/nv-tensorrt-repo-cuda9.0-ga-trt4.0.1.6-20180612/7fa2af80.pub

sudo apt-get update

sudo apt-get install tensorrt

sudo apt-get install python-libnvinfer-dev (use python2.7)

# sudo apt-get install python3-libnvinfer-dev (use python3)

sudo apt-get install uff-converter-tf

最终运行：dpkg -l | grep TensorRT

输出下面的内容则安装正确：

ii graphsurgeon-tf 4.1.2-1+cuda9.0 amd64 GraphSurgeon for TensorRT package

ii libnvinfer-dev 4.1.2-1+cuda9.0 amd64 TensorRT development libraries and headers

ii libnvinfer-samples 4.1.2-1+cuda9.0 amd64 TensorRT samples and documentation

ii libnvinfer4 4.1.2-1+cuda9.0 amd64 TensorRT runtime libraries

ii tensorrt 4.0.1.6-1+cuda9.0 amd64 Meta package of TensorRT

ii uff-converter-tf 4.1.2-1+cuda9.0 amd64 UFF converter for TensorRT package

二、tensorrt加速

1. 官网示例加速.pb

官网提供了一个使用tensorrt加速的demo，可以从这里下载(https://developer.download.nvidia.com/devblogs/tftrt_sample.tar.xz)。解压完后执行./run_all.sh，即可运行，但是这个demo不容易理解。下面我从该demo中提取了关键代码进行tensorrt加速。

2.tensorrt加速.pb

不知道怎么调用.pb文件的？请看https://blog.csdn.net/u010397980/article/details/84932538，从代码中可以看出tensorrt的核心部分在trt.create_inference_graph这个函数，把.pb进行了转化，其中的precision_mode可以更改转化后的精度。

#coding:utf-8
from PIL import Image
import sys
import os
import urllib
import glob
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

import tensorflow as tf
import tensorflow.contrib.tensorrt as trt
from tensorflow.python.platform import gfile

os.environ["CUDA_VISIBLE_DEVICES"]="0" #selects a specific device


def get_trt_graph(batch_size=128,workspace_size=1<<30):
  # conver pb to FP32pb
  with gfile.FastGFile(model_name,'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())
    print("load .pb")
  trt_graph = trt.create_inference_graph(input_graph_def=graph_def, outputs=[output_name],
                                         max_batch_size=batch_size,
                                         max_workspace_size_bytes=workspace_size,
                                         precision_mode=precision_mode)  # Get optimized graph
  print("create trt model done...")
  with gfile.FastGFile("model_tf_FP32.pb",'wb') as f:
    f.write(trt_graph.SerializeToString())
    print("save TRTFP32.pb")
  return trt_graph


def get_tf_graph():
  with gfile.FastGFile(model_name,'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())
    print("load .pb")
  return graph_def


if "__main__" in __name__:
  model_name = "mobilenetv2_model_tf.pb"
  input_name = "input_1"
  #output_name = "softmax_out"
  output_name = "Logits/Softmax"
  use_tensorrt = True
  precision_mode = "FP32" #"FP16"
  batch_size = 1
  tf_config = tf.ConfigProto()
  tf_config.gpu_options.allow_growth = True
  img_list = glob.glob("/media/xxxxxxx/*.jpg")

  if use_tensorrt:
    print("[INFO] converting pb to FP32pb...")
    graph = get_trt_graph(batch_size)
  else:
    print("[INFO] use pb model")
    graph = get_tf_graph()

  sess = tf.Session(config=tf_config)
  tf.import_graph_def(graph, name='')
  tf_input = sess.graph.get_tensor_by_name(input_name + ':0') #or use: tf_input = tf.get_default_graph().get_tensor_by_name(input_name + ':0')
  tf_output = sess.graph.get_tensor_by_name(output_name + ':0')
  #tf_output = sess.graph.get_tensor_by_name('Logits/Softmax:0')
  width = int(tf_input.shape.as_list()[1])
  height = int(tf_input.shape.as_list()[2])
  print("input: size:", tf_input.shape.as_list())
  import time
  t=[]
  for img_path in img_list[:1000]:
    t1 = time.time()
    image = Image.open(img_path)
    image = np.array(image.resize((width, height)))

    output = sess.run(tf_output, feed_dict={tf_input: image[None, ...]})
    #print("cost:", time.time()-t1)
    t.append(float(time.time()-t1))
    scores = output[0]
    #print("output shape:", np.shape(scores))
    index = np.argmax(scores)
    #print("index:{}, predict:{}".format(index, scores[index]))
  if use_tensorrt:
    print("use tensorrt, image num: {}, all time(s): {}, avg time(s): {}".format(len(t), np.sum(t), np.mean(t)))
  else:
    print("not use tensorrt, image num: {}, all time(s): {}, avg time(s): {}".format(len(t), np.sum(t), np.mean(t)))
  sess.close()

三、加速效果

测试tensorRT的加速效果：在1080gpu上，测试1000张图像耗时，结果取均值：1. 使用resnet50(224x224输入)使用tensorflow调pb文件，一张图像耗时11.5ms，使用tensorrt加速一张图像耗时6.8ms ；2.使用mobilenetv2一张图像耗时5.5ms，使用tensorrt加速一张图像耗时4ms

下一篇：使用官方的slim进行训练并实现finetune：https://blog.csdn.net/u010397980/article/details/89439714

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m