以konrad的dom0 tree ( http://git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/ ) 为base tree,分析下netback这两年的代码变化和相应patch

首先一个比较大的变化在于netback不再依赖xen foreign page这个feature。我之前的文章 http://blog.csdn.net/majieyue/article/details/10749283 中提到,报文发送完毕之后,skb和frag对应的page需要被回收,由于这些page都是mmap_pages数组里的page,需要被重复使用,因此xen通过引入一种foreign page来解决这个问题,foreign page重载了page destructor函数,由之前的put_page变为netif_page_release,该函数把page挂到dealloc_ring数组里,由发送软中断在下一次调用时重用该page


One major change from xen.git is that the guest transmit path (i.e. what looks like receive to netback) has been significantly reworked to remove the dependency on the out of tree PageForeign page flag (a core kernel patch which enables a per page destructor callback on the final put_page). This page flag was used in order to implement a grant map based transmit path (where guest pages are mapped directly into SKB frags). Instead this version of netback uses grant copy operations into regular memory belonging to the backend domain. Reinstating the grant map functionality is something which I would like to revisit in the future.


因此dom0去掉了foreign page之后,前后端的page就只有gnttab_copy这一种方式了


另一个大的变化在于netback使用kthread机制来处理报文的收发,原来的机制是直接触发tasklet处理,但由于同一个tasklet永远只能串行执行,导致无法有效利用SMP多核。新的kthread机制对每一个cpu都产生一个kthread并通过kthread_bind绑定到这个cpu上,每个netback设备(xenvif)都会有固定的kthread线程来处理。无论是收包(xenvif_start_xmit)或者发包(xenvif_interrupt),都会触发kthread线程来处理(xen_netbk_kick_thread)


首先来分析kthread机制引入的变化,该机制引入了struct xen_netbk结构体,原来的各种static数组被封装到了xen_netbk结构体中。xen_netbk和kernel thread是一一对应的关系。

struct xen_netbk {
    wait_queue_head_t wq;
    struct task_struct *task;

    struct sk_buff_head rx_queue;
    struct sk_buff_head tx_queue;

    struct timer_list net_timer;

    struct page *mmap_pages[MAX_PENDING_REQS];

    pending_ring_idx_t pending_prod;
    pending_ring_idx_t pending_cons;
    struct list_head net_schedule_list;

    /* Protect the net_schedule_list in netif. */
    spinlock_t net_schedule_list_lock;

    atomic_t netfront_count;

    struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
    struct gnttab_copy tx_copy_ops[MAX_PENDING_REQS];

    u16 pending_ring[MAX_PENDING_REQS];

    /*
     * Given MAX_BUFFER_OFFSET of 4096 the worst case is that each
     * head/fragment page uses 2 copy operations because it
     * straddles two buffers in the frontend.
     */
    struct gnttab_copy grant_copy_op[2*XEN_NETIF_RX_RING_SIZE];
    struct netbk_rx_meta meta[2*XEN_NETIF_RX_RING_SIZE];
};


pending_ring, pending_tx_info, mmap_pages这些数组,以及pending_prod, pending_cons都被设置为per xen_netbk的参数,同时增加了tx_copy_ops,因为tx方向的gnttab_map变为了gnttab_copy。


xen_netbk数组是在netback_init时被初始化好的

    xen_netbk_group_nr = num_online_cpus();
    xen_netbk = vmalloc(sizeof(struct xen_netbk) * xen_netbk_group_nr);
    if (!xen_netbk) {
        printk(KERN_ALERT "%s: out of memory\n", __func__);
        return -ENOMEM;
    }
    memset(xen_netbk, 0, sizeof(struct xen_netbk) * xen_netbk_group_nr);

接下来对每个xen_netbk初始化,包括xen_netbk->rx_queue, xen_netbk->tx_queue, xen_netbk->wq,创建kthread, 绑定kthread,最后唤醒kthread

    for (group = 0; group < xen_netbk_group_nr; group++) {
        struct xen_netbk *netbk = &xen_netbk[group];
        skb_queue_head_init(&netbk->rx_queue);
        skb_queue_head_init(&netbk->tx_queue);

        init_timer(&netbk->net_timer);
        netbk->net_timer.data = (unsigned long)netbk;
        netbk->net_timer.function = xen_netbk_alarm;

        netbk->pending_cons = 0;
        netbk->pending_prod = MAX_PENDING_REQS;
        for (i = 0; i < MAX_PENDING_REQS; i++)
            netbk->pending_ring[i] = i;

        init_waitqueue_head(&netbk->wq);
        netbk->task = kthread_create(xen_netbk_kthread,
                         (void *)netbk,
                         "netback/%u", group);

        if (IS_ERR(netbk->task)) {
            printk(KERN_ALERT "kthread_run() fails at netback\n");
            del_timer(&netbk->net_timer);
            rc = PTR_ERR(netbk->task);
            goto failed_init;
        }
        kthread_bind(netbk->task, group);

        INIT_LIST_HEAD(&netbk->net_schedule_list);

        spin_lock_init(&netbk->net_schedule_list_lock);

        atomic_set(&netbk->netfront_count, 0);

        wake_up_process(netbk->task);
    }


kthread首先会block在netbk->wq这个wait_queue_t上,直到有人来kick它才会继续工作,其工作内容无非是当有报文需要收发时,分别调用xen_netbk_rx_action, xen_netbk_tx_action来处理

static int xen_netbk_kthread(void *data)
{
    struct xen_netbk *netbk = data;
    while (!kthread_should_stop()) {
        wait_event_interruptible(netbk->wq,
                rx_work_todo(netbk) ||
                tx_work_todo(netbk) ||
                kthread_should_stop());
        cond_resched();

        if (kthread_should_stop())
            break;

        if (rx_work_todo(netbk))
            xen_netbk_rx_action(netbk);

        if (tx_work_todo(netbk))
            xen_netbk_tx_action(netbk);
    }

    return 0;
}


Tx场景下,netfront Tx中断处理例程xenvif_interrupt会调用xen_netbk_schedule_xenvif来kick这个kthread。Rx场景下,xenvif_start_xmit会调用xen_netbk_queue_tx_skb来kick kthread。xen_netbk_queue_tx_skb就是把skb加到netbk->rx_queue队列里,然后让netbk上的kthread线程开始工作

void xen_netbk_queue_tx_skb(struct xenvif *vif, struct sk_buff *skb)
{
    struct xen_netbk *netbk = vif->netbk;

    skb_queue_tail(&netbk->rx_queue, skb);

    xen_netbk_kick_thread(netbk);
}

理论上,对单个vif而言,kthread机制的性能应该还会差于tasklet,因为tasklet机制下,CPU相应会快于kthread,毕竟kthread随时可能会被调度出去


下面来看去掉foreign page之后的影响,这个只对Tx场景有影响,因为Rx一直都是gnttab_copy的模式

xen_netbk_tx_build_gops,这个函数从原来通过gnttab_map映射前端的page变为了现在通过gnttab_copy把前端skb的数据拷贝到后端再发送出去,而且每次拷贝之前都要重新alloc_page,代价实在很高

由于有了xen_netbk结构,发送需要的几个数组都被移到了xen_netbk里面,包括struct page* mmap_pages[MAX_PENDING_REQS], u16 pending_ring[MAX_PENDING_REQS]以及对应的pending_prod, pending_cons,struct gnttab_copy tx_copy_ops[MAX_PENDING_REQS], struct pending_tx_info pending_tx_info[MAX_PENDING_REQS]


首先来看xen_netbk_tx_build_gops,该函数作用是配置好xen_netbk->tx_copy_ops,让后面的hypercall把前端要发送的skb page拷贝过来

static unsigned xen_netbk_tx_build_gops(struct xen_netbk *netbk)
{
    struct gnttab_copy *gop = netbk->tx_copy_ops, *request_gop;
    struct sk_buff *skb;
    int ret;

    while (((nr_pending_reqs(netbk) + MAX_SKB_FRAGS) < MAX_PENDING_REQS) &&
        !list_empty(&netbk->net_schedule_list)) {
        /*
          nr_pending_reqs(netbk)里是pending还没有发送完毕的xen_netif_tx_requests(slots)这个判断说明xen_netbk剩下的坑如果还可以至少发一个skb,同时net_schedule_list里有可被调度xenvif,那么就干活,否则歇着等下一次kick kthread
        */
        struct xenvif *vif;
        struct xen_netif_tx_request txreq;
        struct xen_netif_tx_request txfrags[MAX_SKB_FRAGS];
        struct page *page;
        struct xen_netif_extra_info extras[XEN_NETIF_EXTRA_TYPE_MAX-1];
        u16 pending_idx;
        RING_IDX idx;
        int work_to_do;
        unsigned int data_len;
        pending_ring_idx_t index;
        
        /* Get a netif from the list with work to do. */
        /* 从net_schedule_list中取出一个可被调度的xenvif,poll_net_schedule_list会自增xenvif的引用计数。如果没有xenvif那么回到while判断 */
        vif = poll_net_schedule_list(netbk);
        if (!vif)
            continue;
        
        /* 查看vif->tx ring中前端是否有包要发送,如果没有xenvif引用自减1,回到while判断。work_to_do表示tx.req_prod - tx.req_cons的数目 */
        RING_FINAL_CHECK_FOR_REQUESTS(&vif->tx, work_to_do);
        if (!work_to_do) {
            xenvif_put(vif);
            continue;
        }
        
        /* vif->tx.req_cons到vif->tx.req_prod表示要发送的xen_netif_tx_requests,先把skb head拷贝到txreq中 */
        idx = vif->tx.req_cons;
        rmb(); /* Ensure that we see the request before we copy it. */
        memcpy(&txreq, RING_GET_REQUEST(&vif->tx, idx), sizeof(txreq));

        /* Credit-based scheduling. */
        if (txreq.size > atomic64_read(&vif->remaining_credit) &&
            tx_credit_exceeded(vif, txreq.size)) {
            xenvif_put(vif);
            continue;
        }
        atomic64_sub(txreq.size, &vif->remaining_credit);

        work_to_do--;
        vif->tx.req_cons = ++idx;

        /*
          处理完毕skb head之后,需要检查是不是GSO(看txreq.flags有没有XEN_NETTXF_extra_info标志),如果是GSO,那么下一个slot留给extra info,同时计算work_to_do为下面frag的个数, idx指向第一个skb frag所在的tx.req_cons
        */
        memset(extras, 0, sizeof(extras));
        if (txreq.flags & XEN_NETTXF_extra_info) {
            work_to_do = xen_netbk_get_extras(vif, extras,
                              work_to_do);
            idx = vif->tx.req_cons;
            if (unlikely(work_to_do < 0)) {
                netbk_tx_err(vif, &txreq, idx);
                continue;
            }
        }

        /* netbk_count_requests用来把skb frags对应的xen_netif_tx_request一个个拷贝到txfrags数组里,最后返回拷贝的frags个数。idx往后移动相应slot个数 */
        ret = netbk_count_requests(vif, &txreq, txfrags, work_to_do);
        if (unlikely(ret < 0)) {
            netbk_tx_err(vif, &txreq, idx - ret);
            continue;
        }
        idx += ret;

        if (unlikely(txreq.size < ETH_HLEN)) {
            pr_debug("Bad packet size: %d\n", txreq.size);
            netbk_tx_err(vif, &txreq, idx);
            continue;
        }
        /* No crossing a page as the payload mustn't fragment. */
        /* skb head的数据不能小于一个二层头部,不能跨一个4K page */
        if (unlikely((txreq.offset + txreq.size) > PAGE_SIZE)) {
            pr_debug("txreq.offset: %x, size: %u, end: %lu\n",
                 txreq.offset, txreq.size,
                 (txreq.offset&~PAGE_MASK) + txreq.size);
            netbk_tx_err(vif, &txreq, idx);
            continue;
        }

        /* 通过pending_cons计算出pending_ring可用slot的索引pending_idx,tx_copy_ops, pending_tx_info这些数组的相应位置都是可用的。下面开始构造一个发送出去的skb */
        index = pending_index(netbk->pending_cons);
        pending_idx = netbk->pending_ring[index];

        data_len = (txreq.size > PKT_PROT_LEN &&
                ret < MAX_SKB_FRAGS) ?
            PKT_PROT_LEN : txreq.size;

        skb = alloc_skb(data_len + NET_SKB_PAD + NET_IP_ALIGN,
                GFP_ATOMIC | __GFP_NOWARN);
        if (unlikely(skb == NULL)) {
            pr_debug("Can't allocate a skb in start_xmit.\n");
            netbk_tx_err(vif, &txreq, idx);
            break;
        }

        /* Packets passed to netif_rx() must have some headroom. */
        /* skb的head部分最多只放报文头部最大可能的长度 */
        skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);

        /* 设置skb的gso特性gso_size, gso_type,这里预留了gso_segs让后面去计算 */
        if (extras[XEN_NETIF_EXTRA_TYPE_GSO - 1].type) {
            struct xen_netif_extra_info *gso;
            gso = &extras[XEN_NETIF_EXTRA_TYPE_GSO - 1];

            if (netbk_set_skb_gso(vif, skb, gso)) {
                kfree_skb(skb);
                netbk_tx_err(vif, &txreq, idx);
                continue;
            }
        }

        /*
          xen_netbk_alloc_page调用alloc_page新生成一个page,并放到netbk->mmap_pages[pending_idx]下,同时调用set_page_ext,设置了page相关的group, idx,这样从一个page就可以直接找到对应的netbk group,以及netbk->mmap_pages的idx
        */
        page = xen_netbk_alloc_page(netbk, skb, pending_idx);
        if (!page) {
            kfree_skb(skb);
            netbk_tx_err(vif, &txreq, idx);
            continue;
        }

        netbk->mmap_pages[pending_idx] = page;

        /* 填充pending_idx对应的gop,这个page属于txreq,如果headlen大于了PKT_PROT_LEN才会用到(skb_headlen - PKT_PROT_LEN剩余的内容拷贝到这个page中) */
        gop->source.u.ref = txreq.gref;
        gop->source.domid = vif->domid;
        gop->source.offset = txreq.offset;

        gop->dest.u.gmfn = virt_to_mfn(page_address(page));
        gop->dest.domid = DOMID_SELF;
        gop->dest.offset = txreq.offset;

        gop->len = txreq.size;
        gop->flags = GNTCOPY_source_gref;

        gop++;

        /* 
          填充pending_idx对应的pending_tx_info数组,把txreq和vif填到pending_tx_info结构体里,这里用了skb->data开头的2个字节存放pending_idx,便于下面从skb直接拿到pending_tx_info对应的信息
        */
        memcpy(&netbk->pending_tx_info[pending_idx].req,
               &txreq, sizeof(txreq));
        netbk->pending_tx_info[pending_idx].vif = vif;
        *((u16 *)skb->data) = pending_idx;

        __skb_put(skb, data_len);
        skb_shinfo(skb)->nr_frags = ret;
        /* 如果skb_headlen大于PKT_PROT_LEN长度,那么多一个frag:frags[0]。之前的第一个page就用在这里skb_shinfo(skb)->frags[0].page */
        if (data_len < txreq.size) {
            skb_shinfo(skb)->nr_frags++;
            skb_shinfo(skb)->frags[0].page =
                (void *)(unsigned long)pending_idx;
        } else {
            /* Discriminate from any valid pending_idx value. */
            skb_shinfo(skb)->frags[0].page = (void *)~0UL;
        }

        netbk->pending_cons++;

        /*
            xen_netbk_get_requests遍历skb所有的frags,如果frag[0].page已经有了那么跳过frag[0]
            对每一个frag,计算出pending_idx,调xen_netbk_alloc_page新增一个page,放到mmap_pages[pending_idx]中,接下来填充gop,pending_tx_info,最后frags[i].page = pending_idx
            循环做,直到netbk->tx_copy_ops,netbk->pending_tx_info都已经填充完成
        */
        request_gop = xen_netbk_get_requests(netbk, vif,
                             skb, txfrags, gop);
        if (request_gop == NULL) {
            kfree_skb(skb);
            netbk_tx_err(vif, &txreq, idx);
            continue;
        }
        gop = request_gop;

        /* 把skb加到netbk->tx_queue队列中,后续xen_netbk_tx_submit会取出skb后发送 */
        __skb_queue_tail(&netbk->tx_queue, skb);

        vif->tx.req_cons = idx;
        xen_netbk_check_rx_xenvif(vif);

        if ((gop - netbk->tx_copy_ops) >= ARRAY_SIZE(netbk->tx_copy_ops))
            break;
    }
    return gop - netbk->tx_copy_ops;
}


总结下来,xen_netbk_tx_build_gops主要做的事情包括:构建skb加到netbk->tx_queue,对前端的每个xen_netif_tx_request,配置对应的netbk->mmap_pages, netbk->tx_copy_ops,netbk->pending_tx_info,自增netbk->pending_cons。新建的skb有很多hack的地方,其frag[0].page有可能为NULL,取决于skb headlen是否大于PKT_PROT_LEN。所有frag[i].page都存放的是pending_idx的值,而skb->data的头2个字节也存放pending_idx的值。


调用完hypercall之后,通过xen_netbk_tx_submit把包发送出去

static void xen_netbk_tx_submit(struct xen_netbk *netbk)
{
    struct gnttab_copy *gop = netbk->tx_copy_ops;
    struct sk_buff *skb;

    while ((skb = __skb_dequeue(&netbk->tx_queue)) != NULL) {
        struct xen_netif_tx_request *txp;
        struct xenvif *vif;
        u16 pending_idx;
        unsigned data_len;

        /* 首先拿到skb header fragment的pending_idx */
        pending_idx = *((u16 *)skb->data);

        vif = netbk->pending_tx_info[pending_idx].vif;
        txp = &netbk->pending_tx_info[pending_idx].req;

        /* Check the remap error code. */
        /* xen_netbk_tx_check_gop用来检查gop->status */
        if (unlikely(xen_netbk_tx_check_gop(netbk, skb, &gop))) {
            pr_debug("netback grant failed.\n");
            skb_shinfo(skb)->nr_frags = 0;
            kfree_skb(skb);
            continue;
        }

        /* 拷贝data_len长度的skb header到skb->data */
        data_len = skb->len;
        memcpy(skb->data,
               (void *)(idx_to_kaddr(netbk, pending_idx)|txp->offset),
               data_len);

        if (data_len < txp->size) {
            /* Append the packet payload as a fragment. */
            txp->offset += data_len;
            txp->size -= data_len;
            /* txp不能被释放,因为frag[0].page还是要用到 */
        } else {
            /* Schedule a response immediately. */
            /* 
              xen_netbk_idx_release释放pending_idx对应的资源,发送xen_netif_tx_response,自增pending_prod从而增加pending ring空闲slot,
              最后释放netbk->mmap_pages[pending_idx]对应的page
            */
            xen_netbk_idx_release(netbk, pending_idx);
        }

        if (txp->flags & XEN_NETTXF_csum_blank)
            skb->ip_summed = CHECKSUM_PARTIAL;
        else if (txp->flags & XEN_NETTXF_data_validated)
            skb->ip_summed = CHECKSUM_UNNECESSARY;

        /* 填充skb的frags数组,把page里的pending_idx指向真正的page, offset, size。最后调用xen_netbk_idx_release */
        xen_netbk_fill_frags(netbk, skb);

        /*
         * If the initial fragment was < PKT_PROT_LEN then
         * pull through some bytes from the other fragments to
         * increase the linear region to PKT_PROT_LEN bytes.
         */
        
        /* 如果skb_headlen不够PKT_PROT_LEN,调用__pskb_pull_tail,保证skb header fragment有足够的报头长度 */
        if (skb_headlen(skb) < PKT_PROT_LEN && skb_is_nonlinear(skb)) {
            int target = min_t(int, skb->len, PKT_PROT_LEN);
            __pskb_pull_tail(skb, target - skb_headlen(skb));
        }

        skb->dev      = vif->dev;
        skb->protocol = eth_type_trans(skb, skb->dev);

        if (checksum_setup(vif, skb)) {
            pr_debug("Can't setup checksum in net_tx_action\n");
            kfree_skb(skb);
            continue;
        }

        vif->dev->stats.rx_bytes += skb->len;
        vif->dev->stats.rx_packets++;

        /* 最后调用netif_rx发送包 */
        xenvif_receive_skb(vif, skb);
    }
}


当然对于Rx场景而言,也有了不少变化:

1)用另一种方式来处理skb linear空间跨多个page的场景,详细见代码 start_new_rx_buffer, xen_netbk_count_skb_slots, netbk_gop_skb

    data = skb->data;
    while (data < skb_tail_pointer(skb)) {
        unsigned int offset = offset_in_page(data);
        unsigned int len = PAGE_SIZE - offset;

        if (data + len > skb_tail_pointer(skb))
            len = skb_tail_pointer(skb) - data;

        netbk_gop_frag_copy(vif, skb, npo,
                    virt_to_page(data), len, offset, &head);
        data += len;
    }

skb->data 到 skb_tail_pointer(skb) 之间的线性空间有可能跨多个page,因此用一个while循环来处理


2) 新的数据结构体

增加skb_cb_overlay结构体

struct skb_cb_overlay {
    int meta_slots_used;
};

meta_slots_used表示该skb会占用的meta slot个数,可以理解为等同于Rx ring里slot个数,每一个slot可以理解为一个gnttab_copy要处理的page,如何处理用一个netrx_pending_operations来封装

增加union page_ext结构体

/* extra field used in struct page */
union page_ext {
    struct {
#if BITS_PER_LONG < 64
#define IDX_WIDTH   8
#define GROUP_WIDTH (BITS_PER_LONG - IDX_WIDTH)
        unsigned int group:GROUP_WIDTH;
        unsigned int idx:IDX_WIDTH;
#else
        unsigned int group, idx;
#endif
    } e;
    void *mapping;
};

page_ext.e主要用于Tx场景,而page_ext.mapping主要用于Rx场景


3) PKT_PROT_LEN变更为最大可能的包头长度

#define PKT_PROT_LEN    (ETH_HLEN + \
             VLAN_HLEN + \
             sizeof(struct iphdr) + MAX_IPOPTLEN + \
             sizeof(struct tcphdr) + MAX_TCP_OPTION_SPACE)

4) netrx_pending_operations结构体变化

struct netrx_pending_operations {
    unsigned copy_prod, copy_cons;
    unsigned meta_prod, meta_cons;
    struct gnttab_copy *copy;
    struct netbk_rx_meta *meta;
    int copy_off;
    grant_ref_t copy_gref;
};

其中copy_prod, copy_cons, struct gnttab_copy* copy都对应xen_netbk中的struct gnttab_copy grant_copy_op[2*XEN_NETIF_RX_RING_SIZE],而gnttab_copy* copy应该是该数组的头指针

meta_prod, meta_cons, struct netbk_rx_meta* meta对应xen_netbk中的struct netbk_rx_meta meta[2*XEN_NETIF_RX_RING_SIZE],而netbk_rx_meta* meta应该是该数组的头指针

copy_gref对应page的GR,copy_off对应要拷贝的page offset


5) get_next_rx_buffer函数,基于struct netrx_pending_operations结构,拿到下一个空闲的Rx req,初始化好的netbk_rx_meta,返回netbk_rx_meta指针

static struct netbk_rx_meta *get_next_rx_buffer(struct xenvif *vif,
                        struct netrx_pending_operations *npo)
{
    struct netbk_rx_meta *meta;
    struct xen_netif_rx_request *req;

    req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++);

    meta = npo->meta + npo->meta_prod++;
    meta->gso_size = 0;
    meta->size = 0;
    meta->id = req->id;

    npo->copy_off = 0;
    npo->copy_gref = req->gref;

    return meta;
}


6) start_new_rx_buffer函数,判断是不是需要使用一个新的buffer,如果是首个frag (head = 1) 那么一律不用新的buffer,否则尽量把frag放到一个buffer里。这里的buffer指的是rx->req_cons指向的xen_netif_rx_request,即为netfront接收netback数据拷贝的page,以及meta_prod指向的netbk_rx_meta结构

/*
 * Returns true if we should start a new receive buffer instead of
 * adding 'size' bytes to a buffer which currently contains 'offset'
 * bytes.
 */
static bool start_new_rx_buffer(int offset, unsigned long size, int head)
{
    /* simple case: we have completely filled the current buffer. */
    if (offset == MAX_BUFFER_OFFSET)
        return true;
    /*
     * complex case: start a fresh buffer if the current frag
     * would overflow the current buffer but only if:
     *     (i)   this frag would fit completely in the next buffer
     * and (ii)  there is already some data in the current buffer
     * and (iii) this is not the head buffer.
     *
     * Where:
     * - (i) stops us splitting a frag into two copies
     *   unless the frag is too large for a single buffer.
     * - (ii) stops us from leaving a buffer pointlessly empty.
     * - (iii) stops us leaving the first buffer
     *   empty. Strictly speaking this is already covered
     *   by (ii) but is explicitly checked because
     *   netfront relies on the first buffer being
     *   non-empty and can crash otherwise.
     *
     * This means we will effectively linearise small
     * frags but do not needlessly split large buffers
     * into multiple copies tend to give large frags their
     * own buffers as before.
     */
    if ((offset + size > MAX_BUFFER_OFFSET) &&
        (size <= MAX_BUFFER_OFFSET) && offset && !head)
        return true;
    return false;
}


下面分析两个重要的函数netbk_gop_frag_copy, netbk_gop_skb,注意这里多了一个gso_prefix的bit,是citrix为其windows pv driver定制的,分析时就当这个bit一直为false好了

netbk_gop_skb为skb配置一个struct netrx_pending_operations结构体,返回这个skb占用的buffer个数

static int netbk_gop_skb(struct sk_buff *skb,
             struct netrx_pending_operations *npo)
{
    struct xenvif *vif = netdev_priv(skb->dev);
    int nr_frags = skb_shinfo(skb)->nr_frags;
    int i;
    struct xen_netif_rx_request *req;
    struct netbk_rx_meta *meta;
    unsigned char *data;
    int head = 1;
    int old_meta_prod;

    /* 先保存老的meta_prod指针索引到old_meta_prod */
    old_meta_prod = npo->meta_prod;

    /* Set up a GSO prefix descriptor, if necessary */
    if (skb_shinfo(skb)->gso_size && vif->gso_prefix) {
        req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++);
        meta = npo->meta + npo->meta_prod++;
        meta->gso_size = skb_shinfo(skb)->gso_size;
        meta->size = 0;
        meta->id = req->id;
    }
    /* 取出第一个xen_netif_rx_request */
    req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++);

    /* 找到第一个可用buffer,buffer由xen_netif_rx_request, netbk_rx_meta组成 */
    meta = npo->meta + npo->meta_prod++;

    /* 如果是GSO的skb包,设置下meta->gso_size */
    if (!vif->gso_prefix)
        meta->gso_size = skb_shinfo(skb)->gso_size;
    else
        meta->gso_size = 0;

    /* 初始化好meta, npo结构 */
    meta->size = 0;
    meta->id = req->id;
    npo->copy_off = 0;
    npo->copy_gref = req->gref;

    /*
        首先对skb linearize空间配置netrx_pending_operations,第一次调用netbk_gop_frag_copy时head为1,之后为0。
        如果skb linearize空间跨了多个page,每一个page都会调用一次netbk_gop_frag_copy
    */
    data = skb->data;
    while (data < skb_tail_pointer(skb)) {
        unsigned int offset = offset_in_page(data);
        unsigned int len = PAGE_SIZE - offset;

        if (data + len > skb_tail_pointer(skb))
            len = skb_tail_pointer(skb) - data;

        /*
            netbk_gop_frag_copy函数用来把设置gnttab_copy,把dom0的skb page内容拷贝到domU的page中,由于dom0可能有compound page,
            因此前后端的page不是一一对应的关系,这里会尽量把后端skb page的数据往前端的buffer里填,填满之后才去用下一个前端buffer,算是比之前的做法有了点优化
        */
        netbk_gop_frag_copy(vif, skb, npo,
                    virt_to_page(data), len, offset, &head);

        data += len;
    }

    /* 之后对skb每一个frag,配置netrx_pending_operations */
    for (i = 0; i < nr_frags; i++) {
        netbk_gop_frag_copy(vif, skb, npo,
                    skb_shinfo(skb)->frags[i].page,
                    skb_shinfo(skb)->frags[i].size,
                    skb_shinfo(skb)->frags[i].page_offset,
                    &head);
    }

    return npo->meta_prod - old_meta_prod;
}


netbk_gop_frag_copy为每个skb fragment配置struct netrx_pending_operations里对应项

static void netbk_gop_frag_copy(struct xenvif *vif, struct sk_buff *skb,
                struct netrx_pending_operations *npo,
                struct page *page, unsigned long size,
                unsigned long offset, int *head)
{
    struct gnttab_copy *copy_gop;
    struct netbk_rx_meta *meta;
    /*
     * These variables a used iff get_page_ext returns true,
     * in which case they are guaranteed to be initialized.
     */
    unsigned int uninitialized_var(group), uninitialized_var(idx);
    int foreign = get_page_ext(page, &group, &idx);
    unsigned long bytes;

    /* Data must not cross a page boundary. */
    /* size+offset不能超过大页的规格 */
    BUG_ON(size + offset > PAGE_SIZE<<compound_order(page));

    /* 要处理的buffer对应的netbk_rx_meta */
    meta = npo->meta + npo->meta_prod - 1;

    /* Skip unused frames from start of page */
    page += offset >> PAGE_SHIFT;
    /* 这里offset可能会大于一个或者多个PAGE_SIZE,因此找到特定的4K page,并计算出小于4K的offset size */
    offset &= ~PAGE_MASK;

    while (size > 0) {
        BUG_ON(offset >= PAGE_SIZE);
        BUG_ON(npo->copy_off > MAX_BUFFER_OFFSET);

        bytes = PAGE_SIZE - offset;

        /* bytes > size说明此时没有超过一个page,那么拷贝size长度即可。否则下一次循环再继续拷贝。这里是从offset开始计算size是否超过了一个page */
        if (bytes > size)
            bytes = size;

        /* 如果domU里page要拷贝的长度跨过了一个page,那么需要一个新的buffer。get_next_rx_buffer会自增rx.req_cons值 */
        if (start_new_rx_buffer(npo->copy_off, bytes, *head)) {
            /*
             * Netfront requires there to be some data in the head
             * buffer.
             */
            BUG_ON(*head);
            meta = get_next_rx_buffer(vif, npo);
        }

        /* 
          再次从npo->copy_off开始计算,判断是否跨了一个page(这里MAX_BUFFER_OFFSET为4K),超过了则进行截取。
          注意offset是dom0里的页的偏移,而npo->copy_off则是domU里页的偏移。 
        */
        if (npo->copy_off + bytes > MAX_BUFFER_OFFSET)
            bytes = MAX_BUFFER_OFFSET - npo->copy_off;

        copy_gop = npo->copy + npo->copy_prod++;
        copy_gop->flags = GNTCOPY_dest_gref;
        if (foreign) {
            struct xen_netbk *netbk = &xen_netbk[group];
            struct pending_tx_info *src_pend;

            src_pend = &netbk->pending_tx_info[idx];

            copy_gop->source.domid = src_pend->vif->domid;
            copy_gop->source.u.ref = src_pend->req.gref;
            copy_gop->flags |= GNTCOPY_source_gref;
        } else {
            /* 填充gnttab_copy结构体,copy_gop->source.domid是dom0,copy_gop->source.u.gmfn是源page的物理地址 */ 
            void *vaddr = page_address(page);
            copy_gop->source.domid = DOMID_SELF;
            copy_gop->source.u.gmfn = virt_to_mfn(vaddr);
        }

        /* copy_gop->source.offset是dom0源page的偏移 */
        copy_gop->source.offset = offset;

        /* copy_gop->dest.domid为domU的domain id,copy_gop->dest.offset为domU页的偏移,copy_gop->dest.u.ref为domU页的GR,copy_gop->len为拷贝的字节数 */
        copy_gop->dest.domid = vif->domid;
        copy_gop->dest.offset = npo->copy_off;
        copy_gop->dest.u.ref = npo->copy_gref;
        copy_gop->len = bytes;

        /* OK,至此meta->size增加bytes字节,dom0, domU的页偏移增加bytes字节,要拷贝的字节数减少bytes字节 */
        npo->copy_off += bytes;
        meta->size += bytes;
        offset += bytes;
        size -= bytes;

        /* Next frame */
        /* 如果offset已经填满了一个PAGE_SIZE,但size还不为0,此时dom0给的是一个大页。此时page指针指向下一个4k页,offset置0。当然如果发现page不是PageCompound那么报错 */
        if (offset == PAGE_SIZE && size) {
            BUG_ON(!PageCompound(page));
            page++;
            offset = 0;
        }

        /* Leave a gap for the GSO descriptor. */
        /* 如果head为1,且gso_size不为0,那么自增rx.req_cons值,留一个xen_netif_rx_request填充gso信息 */
        if (*head && skb_shinfo(skb)->gso_size && !vif->gso_prefix)
            vif->rx.req_cons++;
        *head = 0; /* There must be something in this buffer now. */
    }
}


函数xen_netbk_count_skb_slots和netbk_gop_frag_copy类似,只是少了填充gnttab_copy,netrx_pending_operations等结构体的事情,和之前的版本相比,这个函数变得相当复杂,其根本原因在于每一个frag都可能是个大页,因此之前那种前后端page一一对应的关系没有了,后端的一个大page,有可能要填掉前端好几个4k page,因此多了一层计算offset, copy_off的工作


最后来看xen_netbk_rx_action函数

static void xen_netbk_rx_action(struct xen_netbk *netbk)
{
    struct xenvif *vif = NULL, *tmp;
    s8 status;
    u16 irq, flags;
    struct xen_netif_rx_response *resp;
    struct sk_buff_head rxq;
    struct sk_buff *skb;
    LIST_HEAD(notify);
    int ret;
    int nr_frags;
    int count;
    unsigned long offset;
    struct skb_cb_overlay *sco;

    /* 初始化好的netrx_pending_operations,其meta_prod, meta_cons, copy_prod, copy_cons都为0,copy指针指向netbk->grant_copy_op数组,meta指针指向netbk->meta数组 */
    struct netrx_pending_operations npo = {
        .copy  = netbk->grant_copy_op,
        .meta  = netbk->meta,
    };

    skb_queue_head_init(&rxq);

    count = 0;
    while ((skb = skb_dequeue(&netbk->rx_queue)) != NULL) {
        vif = netdev_priv(skb->dev);
        nr_frags = skb_shinfo(skb)->nr_frags;

        /*
            sco->meta_slots_used记录了这个skb使用了多少个IO ring的slots,
            即会消耗了多少xen_netif_rx_request,产生了同样多的xen_netif_rx_response,这些slots的元数据存放在npo->meta数组中,gnttab_copy数据存放在npo->copy数组中
        */
        sco = (struct skb_cb_overlay *)skb->cb;
        sco->meta_slots_used = netbk_gop_skb(skb, &npo);

        /* count记录的是skb占用的frag个数(包括第一个线性空间的frag)*/
        count += nr_frags + 1;

        __skb_queue_tail(&rxq, skb);

        /* Filled the batch queue? */
        if (count + MAX_SKB_FRAGS >= XEN_NETIF_RX_RING_SIZE)
            break;
    }

    BUG_ON(npo.meta_prod > ARRAY_SIZE(netbk->meta));

    if (!npo.copy_prod)
        return;

    /*
        调用hypercall,做gnttab_copy的操作。这里npo.copy_prod为0则表示没有page需要拷贝,直接返回,如果大于了netbk->grant_copy_op数组大小则是有bug了。
        同样如果npo.meta_prod如果大于netbk->meta数组大小也是有bug了
    */
    BUG_ON(npo.copy_prod > ARRAY_SIZE(netbk->grant_copy_op));
    ret = HYPERVISOR_grant_table_op(GNTTABOP_copy, &netbk->grant_copy_op,
                    npo.copy_prod);
    BUG_ON(ret != 0);

    while ((skb = __skb_dequeue(&rxq)) != NULL) {
        sco = (struct skb_cb_overlay *)skb->cb;

        vif = netdev_priv(skb->dev);

        if (netbk->meta[npo.meta_cons].gso_size && vif->gso_prefix) {
            resp = RING_GET_RESPONSE(&vif->rx,
                        vif->rx.rsp_prod_pvt++);

            resp->flags = XEN_NETRXF_gso_prefix | XEN_NETRXF_more_data;

            resp->offset = netbk->meta[npo.meta_cons].gso_size;
            resp->id = netbk->meta[npo.meta_cons].id;
            resp->status = sco->meta_slots_used;

            npo.meta_cons++;
            sco->meta_slots_used--;
        }

        vif->dev->stats.tx_bytes += skb->len;
        vif->dev->stats.tx_packets++;

        /* netbk_check_gop主要是消耗npo->copy_cons */
        status = netbk_check_gop(sco->meta_slots_used,
                     vif->domid, &npo);

        /* 如果meta_slots_used为1,那么说明没有分片,甚至linearize区域都没有跨过一个page,否则要加上XEN_NETRXF_more_data的标志位 */
        if (sco->meta_slots_used == 1)
            flags = 0;
        else
            flags = XEN_NETRXF_more_data;

        if (skb->ip_summed == CHECKSUM_PARTIAL) /* local packet? */
            flags |= XEN_NETRXF_csum_blank | XEN_NETRXF_data_validated;
        else if (skb->ip_summed == CHECKSUM_UNNECESSARY)
            /* remote but checksummed. */
            flags |= XEN_NETRXF_data_validated;
        offset = 0;
        /*
            make_rx_response比较简单,就是填充xen_netif_rx_response的一些字段,最后自增rx.rsp_prod_pvt。由于是skb的第一个frag,此时offset从0开始。
            后续可以看到,netbk_add_frag_responses对skb剩余的每个frag,其offset也都是0,而且除了最后一个frag之外,之前的都带有XEN_NETRXF_more_data标志位
        */
        resp = make_rx_response(vif, netbk->meta[npo.meta_cons].id,
                    status, offset,
                    netbk->meta[npo.meta_cons].size,
                    flags);

        /* 如果skb是GSO,那么下一个xen_netif_rx_response填充GSO相关信息 */
        if (netbk->meta[npo.meta_cons].gso_size && !vif->gso_prefix) {
            struct xen_netif_extra_info *gso =
                (struct xen_netif_extra_info *)
                RING_GET_RESPONSE(&vif->rx,
                          vif->rx.rsp_prod_pvt++);

            resp->flags |= XEN_NETRXF_extra_info;

            gso->u.gso.size = netbk->meta[npo.meta_cons].gso_size;
            gso->u.gso.type = XEN_NETIF_GSO_TYPE_TCPV4;
            gso->u.gso.pad = 0;
            gso->u.gso.features = 0;

            gso->type = XEN_NETIF_EXTRA_TYPE_GSO;
            gso->flags = 0;
        }

        netbk_add_frag_responses(vif, status,
                     netbk->meta + npo.meta_cons + 1,
                     sco->meta_slots_used);

        /* 由于netbk->rx_queue中的skb可能是不同xenvif发送的,因此需要把这些xenvif都挂到一个notify链表中一起通知前端 */
        RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&vif->rx, ret);
        irq = vif->irq;
        if (ret && list_empty(&vif->notify_list))
            list_add_tail(&vif->notify_list, ¬ify);

        xenvif_notify_tx_completion(vif);

        xenvif_put(vif);
        npo.meta_cons += sco->meta_slots_used;
        dev_kfree_skb(skb);
    }

    /* 对notify_list表里的所有xenvif,触发irq中断,让前端去收包 */
    list_for_each_entry_safe(vif, tmp, ¬ify, notify_list) {
        notify_remote_via_irq(vif->irq);
        list_del_init(&vif->notify_list);
    }

    /* More work to do? */
    /* 如果rx_queue还有skb没处理完,而且此时net_timer定时器没有触发,那么再次调用kthread继续处理 */
    if (!skb_queue_empty(&netbk->rx_queue) &&
            !timer_pending(&netbk->net_timer))
        xen_netbk_kick_thread(netbk);
}



GitHub 加速计划 / li / linux-dash
10.39 K
1.2 K
下载
A beautiful web dashboard for Linux
最近提交(Master分支:2 个月前 )
186a802e added ecosystem file for PM2 4 年前
5def40a3 Add host customization support for the NodeJS version 4 年前
Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐