一、前言

在基于kernel的IO模型中,所有的设备IO都要经过内核处理,在高并发的网络数据包收发的情况下,大量硬件中断会降低内核数据包处理能力,内核和用户空间的数据拷贝也会造成大量的计算资源浪费。所以,作为高并发大流量网络开发框架的DPDK,必须要找到一个能够避免内核中断爆炸和大量数据拷贝的方法,在用户空间能够直接和硬件进行交互。

Linux的UIO就是这样一个将硬件操作映射到用户空间的kernel bypass方案。

转载自https://blog.csdn.net/cloudvtech

二、Linux UIO简介

在Linux系统中,一个驱动的作用主要是处理硬件产生的中断以及存取硬件的内存区域数据。下图展示了UIO驱动的内核部分,用户空间部分,和UIO框架以及内核的关系:



在UIO中,使用read/mmap在user space存取设备对应的内存区域;但是UIO还是有一小部分中断处理在内核中,这个个中断处理的主要职责是开关中断,并将中断计数值加一。户空间驱动要监测一个设备中断,它只需阻塞在对/dev/uioX的read()操作上, 当设备产生中断时,read()操作立即返回。

内核态的职责:

  • 分配和记录设备需要的资源和注册uio设备
  • 使能设备
  • 申请资源
  • 读取并记录配置信息
  • 注册uio设备
  • 必须*在内核空间实现的小部分中断应答函数

用户态职责:

  • 获取中断事件(read/poll)
  • 处理中断(读写数据)

转载自https://blog.csdn.net/cloudvtech

三、igb_uio分析

在进行DPDK的测试中,使用的都是igb_uio这个驱动;igb_uio是Intel igb网卡驱动的UIO实现,分成igb_uio内核驱动、内核uio框架、uio用户态三部分。

3.1 内核驱动

igb_uio驱动主要做的就是注册一个pci设备,在DPDK工具dpdk_nic_bind.py绑定NIC的时候这个驱动会probe到这个设备,进行相关配置。之后会注册一个UIO设备,probe函数会将记录设备的资源比如PCI设备BAR空间的物理地址、大小等信息记录下来传给用户态。注册的UIO设备名为igb_uio,内核态中断处理函数为igbuio_pci_irqhandler,中断控制函数igbuio_pci_irqcontrol。注册的主要工作如下:

  • 初始化uio_device结构体指针,主要包括等待队列wait、中断事件计数event、次设备号minor等。
  • 在/dev目录下创建了一个uio设备,设备名为uioX,X对应的就是次设备号minor。
  • 在/sys/class/uio/uioX/目录下创建maps和portio接口。
  • 注册中断和中断处理函数uio_interrupt



3.2 testmpd使用igb_uio

rte_eth_rx_queue_setup

Allocate and set up a receive queue for an Ethernet device.

The function allocates a contiguous block of memory for nb_rx_desc receive descriptors from a memory zone associated with socket_id and initializes each receive descriptor with a network buffer allocated from the memory pool mb_pool.

Parameters
port_id	The port identifier of the Ethernet device.
rx_queue_id	The index of the receive queue to set up. The value must be in the range [0, nb_rx_queue - 1] previously supplied to rte_eth_dev_configure().
nb_rx_desc	The number of receive descriptors to allocate for the receive ring.
socket_id	The socket_id argument is the socket identifier in case of NUMA. The value can be SOCKET_ID_ANY if there is no NUMA constraint for the DMA memory allocated for the receive descriptors of the ring.
rx_conf	The pointer to the configuration data to be used for the receive queue. NULL value is allowed, in which case default RX configuration will be used. The rx_conf structure contains an rx_thresh structure with the values of the Prefetch, Host, and Write-Back threshold registers of the receive ring. In addition it contains the hardware offloads features to activate using the DEV_RX_OFFLOAD_* flags. If an offloading set in rx_conf->offloads hasn't been set in the input argument eth_conf->rxmode.offloads to rte_eth_dev_configure(), it is a new added offloading, it must be per-queue type and it is enabled for the queue. No need to repeat any bit in rx_conf->offloads which has already been enabled in rte_eth_dev_configure() at port level. An offloading enabled at port level can't be disabled at queue level.
mb_pool	The pointer to the memory pool from which to allocate rte_mbuf network memory buffers to populate each descriptor of the receive ring.
Returns
0: Success, receive queue correctly set up.
-EIO: if device is removed.
-EINVAL: The size of network buffers which can be allocated from the memory pool does not fit the various buffer sizes allowed by the device controller.
-ENOMEM: Unable to allocate the receive ring descriptors or to allocate network memory buffers from the memory pool when initializing receive descriptors.

rte_eth_rx_burst

Retrieve a burst of input packets from a receive queue of an Ethernet device. The retrieved packets are stored in rte_mbuf structures whose pointers are supplied in the rx_pkts array.

The rte_eth_rx_burst() function loops, parsing the RX ring of the receive queue, up to nb_pkts packets, and for each completed RX descriptor in the ring, it performs the following operations:

Initialize the rte_mbuf data structure associated with the RX descriptor according to the information provided by the NIC into that RX descriptor.
Store the rte_mbuf data structure into the next entry of the rx_pkts array.
Replenish the RX descriptor with a new rte_mbuf buffer allocated from the memory pool associated with the receive queue at initialization time.
When retrieving an input packet that was scattered by the controller into multiple receive descriptors, the rte_eth_rx_burst() function appends the associated rte_mbuf buffers to the first buffer of the packet.

The rte_eth_rx_burst() function returns the number of packets actually retrieved, which is the number of rte_mbuf data structures effectively supplied into the rx_pkts array. A return value equal to nb_pkts indicates that the RX queue contained at least rx_pkts packets, and this is likely to signify that other received packets remain in the input queue. Applications implementing a "retrieve as much received packets as possible" policy can check this specific case and keep invoking the rte_eth_rx_burst() function until a value less than nb_pkts is returned.

This receive method has the following advantages:

It allows a run-to-completion network stack engine to retrieve and to immediately process received packets in a fast burst-oriented approach, avoiding the overhead of unnecessary intermediate packet queue/dequeue operations.
Conversely, it also allows an asynchronous-oriented processing method to retrieve bursts of received packets and to immediately queue them for further parallel processing by another logical core, for instance. However, instead of having received packets being individually queued by the driver, this approach allows the caller of the rte_eth_rx_burst() function to queue a burst of retrieved packets at a time and therefore dramatically reduce the cost of enqueue/dequeue operations per packet.
It allows the rte_eth_rx_burst() function of the driver to take advantage of burst-oriented hardware features (CPU cache, prefetch instructions, and so on) to minimize the number of CPU cycles per packet.
To summarize, the proposed receive API enables many burst-oriented optimizations in both synchronous and asynchronous packet processing environments with no overhead in both cases.

The rte_eth_rx_burst() function does not provide any error notification to avoid the corresponding overhead. As a hint, the upper-level application might check the status of the device link once being systematically returned a 0 value for a given number of tries.

Parameters
port_id	The port identifier of the Ethernet device.
queue_id	The index of the receive queue from which to retrieve input packets. The value must be in the range [0, nb_rx_queue - 1] previously supplied to rte_eth_dev_configure().
rx_pkts	The address of an array of pointers to rte_mbuf structures that must be large enough to store nb_pkts pointers in it.
nb_pkts	The maximum number of packets to retrieve.
Returns
The number of packets actually retrieved, which is the number of pointers to rte_mbuf structures effectively supplied to the rx_pkts array.

转载自https://blog.csdn.net/cloudvtech







GitHub 加速计划 / li / linux-dash
6
1
下载
A beautiful web dashboard for Linux
最近提交(Master分支:3 个月前 )
186a802e added ecosystem file for PM2 4 年前
5def40a3 Add host customization support for the NodeJS version 4 年前
Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐