sk_buff 定义及其操作
1. sk_buff 结构体
可以看出 sk_buff 结构体很重要,
sk_buff --- 套接字缓冲区,用来在linux网络子系统中各层之间数据传递,起到了“神经中枢”的作用。
当发送数据包时,linux内核的网络模块必须建立一个包含要传输的数据包的sk_buff,然后将sk_buff传递给下一层,各层在 sk_buff 中添加不同的协议头,直到交给网络设备发送。
同样,当接收数据包时,网络设备从物理媒介层接收到数据后,他必须将接收到的数据转换为sk_buff,并传递给上层,各层剥去相应的协议头后直到交给用户。
sk_buff结构如下图所示:
sk_buff定义如下:
- /**
- * struct sk_buff - socket buffer
- * @next: Next buffer in list
- * @prev: Previous buffer in list
- * @sk: Socket we are owned by
- * @tstamp: Time we arrived
- * @dev: Device we arrived on/are leaving by
- * @transport_header: Transport layer header
- * @network_header: Network layer header
- * @mac_header: Link layer header
- * @_skb_refdst: destination entry (with norefcount bit)
- * @sp: the security path, used for xfrm
- * @cb: Control buffer. Free for use by every layer. Put private vars here
- * @len: Length of actual data
- * @data_len: Data length
- * @mac_len: Length of link layer header
- * @hdr_len: writable header length of cloned skb
- * @csum: Checksum (must include start/offset pair)
- * @csum_start: Offset from skb->head where checksumming should start
- * @csum_offset: Offset from csum_start where checksum should be stored
- * @local_df: allow local fragmentation
- * @cloned: Head may be cloned (check refcnt to be sure)
- * @nohdr: Payload reference only, must not modify header
- * @pkt_type: Packet class
- * @fclone: skbuff clone status
- * @ip_summed: Driver fed us an IP checksum
- * @priority: Packet queueing priority
- * @users: User count - see {datagram,tcp}.c
- * @protocol: Packet protocol from driver
- * @truesize: Buffer size
- * @head: Head of buffer
- * @data: Data head pointer
- * @tail: Tail pointer
- * @end: End pointer
- * @destructor: Destruct function
- * @mark: Generic packet mark
- * @nfct: Associated connection, if any
- * @ipvs_property: skbuff is owned by ipvs
- * @peeked: this packet has been seen already, so stats have been
- * done for it, don't do them again
- * @nf_trace: netfilter packet trace flag
- * @nfctinfo: Relationship of this skb to the connection
- * @nfct_reasm: netfilter conntrack re-assembly pointer
- * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c
- * @skb_iif: ifindex of device we arrived on
- * @rxhash: the packet hash computed on receive
- * @queue_mapping: Queue mapping for multiqueue devices
- * @tc_index: Traffic control index
- * @tc_verd: traffic control verdict
- * @ndisc_nodetype: router type (from link layer)
- * @dma_cookie: a cookie to one of several possible DMA operations
- * done by skb DMA functions
- * @secmark: security marking
- * @vlan_tci: vlan tag control information
- */
- struct sk_buff {
- /* These two members must be first. */
- struct sk_buff *next; //链表指针,指向后一个和前一个
- struct sk_buff *prev;
- ktime_t tstamp; //socket 到达时的时间戳
- struct sock *sk; //socket的所有者
- struct net_device *dev; //发送或接受该缓冲区的网络设备
- /*
- * This is the control buffer. It is free to use for every
- * layer. Please put your private variables there. If you
- * want to keep them across layers you have to do a skb_clone()
- * first. This is owned by whoever has the skb queued ATM.
- */
- char cb[48] __aligned(8);
- unsigned long _skb_refdst;
- #ifdef CONFIG_XFRM
- struct sec_path *sp;
- #endif
- unsigned int len,
- data_len;
- __u16 mac_len,
- hdr_len;
- union {
- __wsum csum;
- struct {
- __u16 csum_start;
- __u16 csum_offset;
- };
- };
- __u32 priority;
- kmemcheck_bitfield_begin(flags1);
- __u8 local_df:1,
- cloned:1,
- ip_summed:2, //对数据包的校验策略
- nohdr:1,
- nfctinfo:3;
- __u8 pkt_type:3,
- fclone:2,
- ipvs_property:1,
- peeked:1,
- nf_trace:1;
- kmemcheck_bitfield_end(flags1);
- __be16 protocol;
- void (*destructor)(struct sk_buff *skb);
- #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
- struct nf_conntrack *nfct;
- struct sk_buff *nfct_reasm;
- #endif
- #ifdef CONFIG_BRIDGE_NETFILTER
- struct nf_bridge_info *nf_bridge;
- #endif
- int skb_iif;
- #ifdef CONFIG_NET_SCHED
- __u16 tc_index; /* traffic control index */
- #ifdef CONFIG_NET_CLS_ACT
- __u16 tc_verd; /* traffic control verdict */
- #endif
- #endif
- __u32 rxhash;
- kmemcheck_bitfield_begin(flags2);
- __u16 queue_mapping:16;
- #ifdef CONFIG_IPV6_NDISC_NODETYPE
- __u8 ndisc_nodetype:2,
- deliver_no_wcard:1;
- #else
- __u8 deliver_no_wcard:1;
- #endif
- kmemcheck_bitfield_end(flags2);
- /* 0/14 bit hole */
- #ifdef CONFIG_NET_DMA
- dma_cookie_t dma_cookie;
- #endif
- #ifdef CONFIG_NETWORK_SECMARK
- __u32 secmark;
- #endif
- union {
- __u32 mark;
- __u32 dropcount;
- };
- __u16 vlan_tci;
- sk_buff_data_t transport_header; //传输层协议头
- sk_buff_data_t network_header; //网络层协议头
- sk_buff_data_t mac_header; //链路层协议头
- /* These elements must be at the end, see alloc_skb() for details. */
- sk_buff_data_t tail;
- sk_buff_data_t end;
- unsigned char *head,
- *data;
- unsigned int truesize;
- atomic_t users;
- };
1.1 各层协议头:
--- transport_header : 传输层协议头,如 TCP, UDP , ICMP, IGMP等协议头
--- network_header : 网络层协议头, 如IP, IPv6, ARP 协议头
--- mac_header : 链路层协议头。
--- sk_buff_data_t 原型就是一个char 指针
- #ifdef NET_SKBUFF_DATA_USES_OFFSET
- typedef unsigned int sk_buff_data_t;
- #else
- typedef unsigned char *sk_buff_data_t;
- #endif
--- *head : 指向内存中已分配的用于存放网络数据缓冲区的起始地址, sk_buff和相关数据被分配后,该指针值就固定了
--- *data : 指向对应当前协议层有效数据的起始地址。
每个协议层的有效数据内容不一样,各层有效数据的内容如下:
a. 对于传输层,有效数据包括用户数据和传输层协议头
b. 对于网络层,有效数据包括用户数据、传输层协议和网络层协议头。
c. 对于数据链路层,有效数据包括用户数据、传输层协议、网络层协议和链路层协议。
因此,data指针随着当前拥有sk_buff的协议层的变化而进行相应的移动。
--- tail : 指向对应当前协议层有效数据的结尾地址,与data指针相对应。
--- end : 指向内存中分配的网络数据缓冲区的结尾,与head指针相对应。和head一样,sk_buff被分配后,end指针就固定了。
head, data, tail, end 关系如下图所示:
1.3 长度信息 len, data_len, truesize
--- len : 指网络数据包的有效数据的长度,包括协议头和负载(payload).
--- data_len : 记录分片的数据长度
--- truesize : 表述缓存区的整体长度, 一般为 sizeof(sk_buff).
1.4 数据包类型
--- pkt_type : 指定数据包类型。驱动程序负责将其设置为:
PACKET_HOST --- 该数据包是给我的。
PACKET_OTHERHOST --- 该数据包不是给我的。
PACKET_BROADCAST --- 广播类型的数据包
PACKET_MULTICAST --- 组播类型的数据包
驱动程序不必显式的修改pkt_type,因为eth_type_trans会完成该工作。
2. 套接字缓冲区的操作
2.1 分配套接字缓冲区
struct sk_buff *alloc_skb(unsigned intlen, int priority);
alloc_skb()函数 分配一个套接字缓冲区和一个数据缓冲区。
--- len : 为数据缓冲区的大小
--- priority : 内存分配的优先级
- static inline struct sk_buff *alloc_skb(unsigned int size,
- gfp_t priority)
- {
- return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
- }
- /**
- * __alloc_skb - allocate a network buffer
- * @size: size to allocate
- * @gfp_mask: allocation mask
- * @fclone: allocate from fclone cache instead of head cache
- * and allocate a cloned (child) skb
- * @node: numa node to allocate memory on
- *
- * Allocate a new &sk_buff. The returned buffer has no headroom and a
- * tail room of size bytes. The object has a reference count of one.
- * The return is the buffer. On a failure the return is %NULL.
- *
- * Buffers may only be allocated from interrupts using a @gfp_mask of
- * %GFP_ATOMIC.
- */
- struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone, int node)
- {
- struct kmem_cache *cache;
- struct skb_shared_info *shinfo;
- struct sk_buff *skb;
- u8 *data;
- cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
- /* Get the HEAD */
- skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); //分配套接字缓冲区
- if (!skb)
- goto out;
- prefetchw(skb);
- size = SKB_DATA_ALIGN(size);
- data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info), //分配数据缓冲区
- gfp_mask, node);
- if (!data)
- goto nodata;
- prefetchw(data + size);
- /*
- * Only clear those fields we need to clear, not those that we will
- * actually initialise below. Hence, don't put any more fields after
- * the tail pointer in struct sk_buff!
- */
- memset(skb, 0, offsetof(struct sk_buff, tail));
- skb->truesize = size + sizeof(struct sk_buff);
- atomic_set(&skb->users, 1);
- skb->head = data;
- skb->data = data;
- skb_reset_tail_pointer(skb);
- skb->end = skb->tail + size;
- #ifdef NET_SKBUFF_DATA_USES_OFFSET
- skb->mac_header = ~0U;
- #endif
- /* make sure we initialize shinfo sequentially */
- shinfo = skb_shinfo(skb);
- memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
- atomic_set(&shinfo->dataref, 1);
- if (fclone) {
- struct sk_buff *child = skb + 1;
- atomic_t *fclone_ref = (atomic_t *) (child + 1);
- kmemcheck_annotate_bitfield(child, flags1);
- kmemcheck_annotate_bitfield(child, flags2);
- skb->fclone = SKB_FCLONE_ORIG;
- atomic_set(fclone_ref, 1);
- child->fclone = SKB_FCLONE_UNAVAILABLE;
- }
- out:
- return skb;
- nodata:
- kmem_cache_free(cache, skb);
- skb = NULL;
- goto out;
- }
- EXPORT_SYMBOL(__alloc_skb);
dev_alloc_skb()函数以GFP_ATOMIC 优先级调用上面的alloc_skb()函数。
并保存skb->dead 和 skb->data之间的16个字节
- /**
- * dev_alloc_skb - allocate an skbuff for receiving
- * @length: length to allocate
- *
- * Allocate a new &sk_buff and assign it a usage count of one. The
- * buffer has unspecified headroom built in. Users should allocate
- * the headroom they think they need without accounting for the
- * built in space. The built in space is used for optimisations.
- *
- * %NULL is returned if there is no free memory. Although this function
- * allocates memory it can be called from an interrupt.
- */
- struct sk_buff *dev_alloc_skb(unsigned int length)
- {
- /*
- * There is more code here than it seems:
- * __dev_alloc_skb is an inline
- */
- return __dev_alloc_skb(length, GFP_ATOMIC);
- }
- EXPORT_SYMBOL(dev_alloc_skb);
- /**
- * __dev_alloc_skb - allocate an skbuff for receiving
- * @length: length to allocate
- * @gfp_mask: get_free_pages mask, passed to alloc_skb
- *
- * Allocate a new &sk_buff and assign it a usage count of one. The
- * buffer has unspecified headroom built in. Users should allocate
- * the headroom they think they need without accounting for the
- * built in space. The built in space is used for optimisations.
- *
- * %NULL is returned if there is no free memory.
- */
- static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
- gfp_t gfp_mask)
- {
- struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
- if (likely(skb))
- skb_reserve(skb, NET_SKB_PAD);
- return skb;
- }
2.2 释放套接字缓冲区
void kfree_skb(struct sk_buff *skb);
- /**
- * kfree_skb - free an sk_buff
- * @skb: buffer to free
- *
- * Drop a reference to the buffer and free it if the usage count has
- * hit zero.
- */
- void kfree_skb(struct sk_buff *skb)
- {
- if (unlikely(!skb))
- return;
- if (likely(atomic_read(&skb->users) == 1))
- smp_rmb();
- else if (likely(!atomic_dec_and_test(&skb->users)))
- return;
- trace_kfree_skb(skb, __builtin_return_address(0));
- __kfree_skb(skb);
- }
- EXPORT_SYMBOL(kfree_skb);
void dev_kfree_skb(struct sk_buff *skb);
--- dev_kfree_skb()用于非中断上下文。
- #define dev_kfree_skb(a) consume_skb(a)
- /**
- * consume_skb - free an skbuff
- * @skb: buffer to free
- *
- * Drop a ref to the buffer and free it if the usage count has hit zero
- * Functions identically to kfree_skb, but kfree_skb assumes that the frame
- * is being dropped after a failure and notes that
- */
- void consume_skb(struct sk_buff *skb)
- {
- if (unlikely(!skb))
- return;
- if (likely(atomic_read(&skb->users) == 1))
- smp_rmb();
- else if (likely(!atomic_dec_and_test(&skb->users)))
- return;
- trace_consume_skb(skb);
- __kfree_skb(skb);
- }
- EXPORT_SYMBOL(consume_skb);
--- dev_kfree_skb_irq() 用于中断上下文。
- void dev_kfree_skb_irq(struct sk_buff *skb)
- {
- if (atomic_dec_and_test(&skb->users)) {
- struct softnet_data *sd;
- unsigned long flags;
- local_irq_save(flags);
- sd = &__get_cpu_var(softnet_data);
- skb->next = sd->completion_queue;
- sd->completion_queue = skb;
- raise_softirq_irqoff(NET_TX_SOFTIRQ);
- local_irq_restore(flags);
- }
- }
- EXPORT_SYMBOL(dev_kfree_skb_irq);
--- dev_kfree_skb_any() 在中断或非中断上下文中都能使用。
- void dev_kfree_skb_any(struct sk_buff *skb)
- {
- if (in_irq() || irqs_disabled())
- dev_kfree_skb_irq(skb);
- else
- dev_kfree_skb(skb);
- }
- EXPORT_SYMBOL(dev_kfree_skb_any);
Linux套接字缓冲区中的指针移动操作有:put(放置), push(推), pull(拉) 和 reserve(保留) 等。
2.3.1 put操作
unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
将 tail 指针下移,增加 sk_buff 的 len 值,并返回 skb->tail 的当前值。
将数据添加在buffer的尾部。
- /**
- * skb_put - add data to a buffer
- * @skb: buffer to use
- * @len: amount of data to add
- *
- * This function extends the used data area of the buffer. If this would
- * exceed the total buffer size the kernel will panic. A pointer to the
- * first byte of the extra data is returned.
- */
- unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
- {
- unsigned char *tmp = skb_tail_pointer(skb); // tmp = skb->tail
- SKB_LINEAR_ASSERT(skb);
- skb->tail += len;
- skb->len += len;
- if (unlikely(skb->tail > skb->end))
- skb_over_panic(skb, len, __builtin_return_address(0)); //检测放入缓冲区的数据
- return tmp;
- }
- EXPORT_SYMBOL(skb_put);
- static inline unsigned char *skb_tail_pointer(const struct sk_buff *skb)
- {
- return skb->tail;
- }
__skb_put() 与 skb_put()的区别在于 skb_put()会检测放入缓冲区的数据, 而__skb_put()不会检查
- static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len)
- {
- unsigned char *tmp = skb_tail_pointer(skb);
- SKB_LINEAR_ASSERT(skb);
- skb->tail += len;
- skb->len += len;
- return tmp;
- }
2.3.2 push操作:
unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
skb_push()会将data指针上移,也就是将数据添加在buffer的起始点,因此也要增加sk_buff的len值。
- /**
- * skb_push - add data to the start of a buffer
- * @skb: buffer to use
- * @len: amount of data to add
- *
- * This function extends the used data area of the buffer at the buffer
- * start. If this would exceed the total buffer headroom the kernel will
- * panic. A pointer to the first byte of the extra data is returned.
- */
- unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
- {
- skb->data -= len;
- skb->len += len;
- if (unlikely(skb->data<skb->head))
- skb_under_panic(skb, len, __builtin_return_address(0));
- return skb->data;
- }
- EXPORT_SYMBOL(skb_push);
- static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len)
- {
- skb->data -= len;
- skb->len += len;
- return skb->data;
- }
push操作在缓冲区的头部增加一段可以存储网络数据包的空间,而put操作在缓冲区的尾部增加一段可以存储网络数据包的空间。
2.3.3 pull操作:
unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
skb_pull()将data指针下移,并减少skb的len值, 这个操作与skb_push()对应。
这个操作主要用于下层协议向上层协议移交数据包,使data指针指向上一层协议头
- /**
- * skb_pull - remove data from the start of a buffer
- * @skb: buffer to use
- * @len: amount of data to remove
- *
- * This function removes data from the start of a buffer, returning
- * the memory to the headroom. A pointer to the next data in the buffer
- * is returned. Once the data has been pulled future pushes will overwrite
- * the old data.
- */
- unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
- {
- return skb_pull_inline(skb, len);
- }
- EXPORT_SYMBOL(skb_pull);
- static inline unsigned char *skb_pull_inline(struct sk_buff *skb, unsigned int len)
- {
- return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
- }
- static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
- {
- skb->len -= len;
- BUG_ON(skb->len < skb->data_len);
- return skb->data += len;
- }
void skb_reserve(struct sk_buff *skb, unsigned int len);
skb_reserve()将data指针 和 tail 指针同时下移。
这个操作用于在缓冲区头部预留len长度的空间
- /**
- * skb_reserve - adjust headroom
- * @skb: buffer to alter
- * @len: bytes to move
- *
- * Increase the headroom of an empty &sk_buff by reducing the tail
- * room. This is only allowed for an empty buffer.
- */
- static inline void skb_reserve(struct sk_buff *skb, int len)
- {
- skb->data += len;
- skb->tail += len;
- }
Linux处理 一个UDP数据包的接收流程,来说明对sk_buff的操作过程。
这一过程绝大部分工作会在内核完成,驱动中只需要完成涉及数据链路层部分。
假设网卡收到一个UDP数据包,Linux处理流程如下:
3.1 网卡收到一个UDP数据包后,驱动程序需要创建一个sk_buff结构体和数据缓冲区,将接收到的数据全部复制到data指向的空间,并将skb->mac_header指向data。
此时有效数据的开始位置data是一个以太网头部,即链路层协议头。
示例代码如下:
//分配新的套接字缓冲区和数据缓冲区
- skb = dev_alloc_skb(length + 2);
- if(skb == NULL) {
- ... //分配失败
- return ;
- }
- skb_reserve(skb, 2); //在缓冲区头部预留空间,以使网络层协议头对齐。
- //将硬件接收到的数据复制到数据缓冲区
- readwords(ioaddr, RX_FRAME_PORT, skb_put(skb, length), length >> 1);
- if(length & 1){
- skb->data[length - 1] = readword(ioaddr, RX_FRAME_PORT);
- }
工作内容如下图所示:
3.2 数据链路层通过调用 skb_pull() 剥掉以太网协议头,向网络层IP传送数据包。
在剥离过程中,data指针会下移 一个 以太网头部的长度 sizeof(struct ethhdr), 而len 也减去 sizeof(struct ethhdr)长度。
此时有效数据的开始位置是一个IP协议头,skb->network_head指向data,即IP协议头, 而 skb->mac_header 依旧指向以太网头, 即链路层协议头。
内容如下图所示:
3.3 网络层通过skb_pull()剥掉IP协议头,向UDP传输层传递数据包。
剥离过程中,data指针会下移一个IP协议头长度 sizeof(struct iphdr), 而len也会减少sizeof(struct iphdr)长度。
此时有效数据开始位置是一个UDP协议头, skb->transport_header指向data,即UDP协议头。
而skb->network_header继续指向IP协议头, skb->mac_header 继续指向链路层协议头。
如下图所示:
3.4 应用程序在调用 recv() 接收数据时,从 skb->data + sizeof(struct udphdr) 的位置开始复制到应用层缓冲区。
可见,UPD协议头到最后也没有被剥离。
更多推荐
所有评论(0)