再读BUDDY算法
快乐虾
http://blog.csdn.net/lights_joy/
lights@hb165.com
本文适用于
ADI bf561 DSP
uclinux-2008r1-rc8 (移植到vdsp5)
Visual DSP++ 5.0
欢迎转载,但请保留作者信息
buddy算法是用来做内存管理的经典算法,目的是为了解决内存的外碎片。
避免外碎片的方法有两种:
1,利用分页单元把一组非连续的空闲页框映射到非连续的线性地址区间。
2,开发适当的技术来记录现存的空闲连续页框块的情况,以尽量避免为满足对小块的请求而把大块的空闲块进行分割。
内核选择第二种方法。这个算法的时间效率更高,但是由于使用 best-fit 方法的缘故,会产生内存浪费。
buddy算法将所有空闲页框分组为11个块链表,每个块链表的每个块元素分别包含1,2,4,8,16,32,64,128,256,512,1024个连续的页框,每个块的第一个页框的物理地址是该块大小的整数倍。如,大小为16个页框的块,其起始地址是16*2^12(一个页框的大小为4k,16个页框的大小为16*4K,1k=1024=2的10次方,4k=2的12次方)的倍数。
例,假设要请求一个128个页框的块,算法先检查128个页框的链表是否有空闲块,如果没有则查256个页框的链表,有则将256个页框的块分裂两份,一份使用,一份插入128个页框的链表。如果还没有,就查512个页框的链表,有的话就分裂为128,128,256,一个128使用,剩余两个插入对应链表。如果在512还没查到,则返回出错信号。
回收过程相反,内核试图把大小为b的空闲伙伴合并为一个大小为2b的单独快,满足以下条件的两个块称为伙伴:1,两个块具有相同的大小,记做b;2,它们的物理地址是连续的,3,第一个块的第一个页框的物理地址是2*b*2^12的倍数,该算法迭代,如果成功合并所释放的块,会试图合并2b的块来形成更大的块。
1.1.1 查找buddy页
在buddy算法中,每一页都有一个buddy页与之相对应,使用__page_find_buddy函数可以找到指定页面对应的buddy页面:
/*
* Locate the struct page for both the matching buddy in our
* pair (buddy1) and the combined O(n+1) page they form (page).
*
* 1) Any buddy B1 will have an order O twin B2 which satisfies
* the following equation:
* B2 = B1 ^ (1 << O)
* For example, if the starting buddy (buddy2) is #8 its order
* 1 buddy is #10:
* B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10
*
* 2) Any buddy B will have an order O+1 parent P which
* satisfies the following equation:
* P = B & ~(1 << O)
*
* Assumption: *_mem_map is contiguous at least up to MAX_ORDER
*/
static inline struct page *
__page_find_buddy(struct page *page, unsigned long page_idx, unsigned int order)
{
unsigned long buddy_idx = page_idx ^ (1 << order);
return page + (buddy_idx - page_idx);
}
由于每一页都可能属于不同order的buddy块,所以此函数同时接受page_idx和order做为参数。
下表给出几个不同的页号在不同的order下的buddy序号。
页号 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | -1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 |
10 | 1 | -2 | 4 | -8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 |
100 | 1 | 2 | -4 | 8 | 16 | -32 | -64 | 128 | 256 | 512 | 1024 | 2048 |
1000 | 1 | 2 | 4 | -8 | 16 | -32 | -64 | -128 | -256 | -512 | 1024 | 2048 |
从上表中可以看出,负号表示这个页对应的BUDDY页在此页的前面,正号则表示其BUDDY页在此页的后面。
从上表还可以看出,不论输入的页号是否为2的整数倍,其BUDDY页都与其相差2的整数倍。
由于此函数只是计算BUDDY页的页面差距,它并不判断这两个页面是否符合BUDDY的条件,因此还需要一个独立的函数来判断这两个页面是否真的互为BUDDY。
1.1.2 确认两个页面是否互为buddy
要确认两个指定的页面是否互为buddy,可以使用page_is_buddy函数:
/*
* This function checks whether a page is free && is the buddy
* we can do coalesce a page and its buddy if
* (a) the buddy is not in a hole &&
* (b) the buddy is in the buddy system &&
* (c) a page and its buddy have the same order &&
* (d) a page and its buddy are in the same zone.
*
* For recording whether a page is in the buddy system, we use PG_buddy.
* Setting, clearing, and testing PG_buddy is serialized by zone->lock.
*
* For recording page's order, we use page_private(page).
*/
static inline int page_is_buddy(struct page *page, struct page *buddy,
int order)
{
if (!pfn_valid_within(page_to_pfn(buddy))) // 恒为false
return 0;
if (page_zone_id(page) != page_zone_id(buddy))
return 0;
if (PageBuddy(buddy) && page_order(buddy) == order) {
BUG_ON(page_count(buddy) != 0);
return 1;
}
return 0;
}
在这里PageBuddy用于判断一个页面是否带有PG_buddy标记。注意:在初始化的时候,所有的页面都只有PG_reserved标记。
#define PageBuddy(page) test_bit(PG_buddy, &(page)->flags)
1.1.3 内存分配
1.1.3.1 数据分配方式
在使用BUDDY分配数据之前,必须先明确数据的分配方式,这是通过GFP_xxx一系列的宏定义来控制的。可以将这些宏取或,再将值传递到内存分配函数中。
/*
* GFP bitmasks..
*
* Zone modifiers (see linux/mmzone.h - low three bits)
*
* Do not put any conditional on these. If necessary modify the definitions
* without the underscores and use the consistently. The definitions here may
* be used in bit comparisons.
*/
#define __GFP_DMA ((__force gfp_t)0x01u)
#define __GFP_HIGHMEM ((__force gfp_t)0x02u)
#define __GFP_DMA32 ((__force gfp_t)0x04u)
/*
* Action modifiers - doesn't change the zoning
*
* __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
* _might_ fail. This depends upon the particular VM implementation.
*
* __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
* cannot handle allocation failures.
*
* __GFP_NORETRY: The VM implementation must not retry indefinitely.
*/
#define __GFP_WAIT ((__force gfp_t)0x10u) /* Can wait and reschedule? */
#define __GFP_HIGH ((__force gfp_t)0x20u) /* Should access emergency pools? */
#define __GFP_IO ((__force gfp_t)0x40u) /* Can start physical IO? */
#define __GFP_FS ((__force gfp_t)0x80u) /* Can call down to low-level FS? */
#define __GFP_COLD ((__force gfp_t)0x100u) /* Cache-cold page required */
#define __GFP_NOWARN ((__force gfp_t)0x200u) /* Suppress page allocation failure warning */
#define __GFP_REPEAT ((__force gfp_t)0x400u) /* Retry the allocation. Might fail */
#define __GFP_NOFAIL ((__force gfp_t)0x800u) /* Retry for ever. Cannot fail */
#define __GFP_NORETRY ((__force gfp_t)0x1000u)/* Do not retry. Might fail */
#define __GFP_COMP ((__force gfp_t)0x4000u)/* Add compound page metadata */
#define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* if you forget to add the bitmask here kernel will crash, period */
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| /
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| /
__GFP_NOFAIL|__GFP_NORETRY|__GFP_COMP| /
__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
/* GFP_ATOMIC means both !wait (__GFP_WAIT not set) and use emergency pool */
#define GFP_ATOMIC (__GFP_HIGH)
#define GFP_NOIO (__GFP_WAIT)
#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | /
__GFP_HIGHMEM)
#define GFP_THISNODE ((__force gfp_t)0)
/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
platforms, used as appropriate on others */
#define GFP_DMA __GFP_DMA
/* 4GB DMA on some platforms */
#define GFP_DMA32 __GFP_DMA32
1.1.3.2 __alloc_pages
这个函数是BUDDY算法的核心,它用于分配内存页,其实现位于mm/page_alloc.c中:
/*
* This is the 'heart' of the zoned buddy allocator.
*/
struct page * fastcall
__alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
struct zone **z;
struct page *page;
struct reclaim_state reclaim_state;
struct task_struct *p = current;
int do_retry;
int alloc_flags;
int did_some_progress;
might_sleep_if(wait);
if (should_fail_alloc_page(gfp_mask, order)) // 空调用
return NULL;
restart:
z = zonelist->zones; /* the list of zones suitable for gfp_mask */
if (unlikely(*z == NULL)) {
/* Should this ever happen?? */
return NULL;
}
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
goto got_pg;
/*
* GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
* __GFP_NOWARN set) should not cause reclaim since the subsystem
* (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
* using a larger set of nodes after it has established that the
* allowed per node queues are empty and that nodes are
* over allocated.
*/
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;
for (z = zonelist->zones; *z; z++)
wakeup_kswapd(*z, order);
/*
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
* to how we want to proceed.
*
* The caller may dip into page reserves a bit more if the caller
* cannot run direct reclaim, or if the caller has realtime scheduling
* policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
alloc_flags = ALLOC_WMARK_MIN;
if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
alloc_flags |= ALLOC_HARDER;
if (gfp_mask & __GFP_HIGH)
alloc_flags |= ALLOC_HIGH;
if (wait)
alloc_flags |= ALLOC_CPUSET;
/*
* Go through the zonelist again. Let __GFP_HIGH and allocations
* coming from realtime tasks go deeper into reserves.
*
* This is the last chance, in general, before the goto nopage.
* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
if (page)
goto got_pg;
#ifndef CONFIG_MMU
drop_pagecache();
#endif
/* This allocation should allow future memory freeing. */
rebalance:
if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
&& !in_interrupt()) {
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
page = get_page_from_freelist(gfp_mask, order,
zonelist, ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
if (gfp_mask & __GFP_NOFAIL) {
congestion_wait(WRITE, HZ/50);
goto nofail_alloc;
}
}
goto nopage;
}
/* Atomic allocations - we can't balance anything */
if (!wait)
goto nopage;
cond_resched();
/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
p->flags |= PF_MEMALLOC;
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
did_some_progress = try_to_free_pages(zonelist->zones, gfp_mask);
p->reclaim_state = NULL;
p->flags &= ~PF_MEMALLOC;
cond_resched();
if (likely(did_some_progress)) {
page = get_page_from_freelist(gfp_mask, order,
zonelist, alloc_flags);
if (page)
goto got_pg;
} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
/*
* Go through the zonelist yet one more time, keep
* very high watermark here, this is only to catch
* a parallel oom killing, we must fail if we're still
* under heavy pressure.
*/
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
if (page)
goto got_pg;
out_of_memory(zonelist, gfp_mask, order);
goto restart;
}
/*
* Don't let big-order allocations loop unless the caller explicitly
* requests that. Wait for some write requests to complete then retry.
*
* In this implementation, __GFP_REPEAT means __GFP_NOFAIL for order
* <= 3, but that may not be true in other implementations.
*/
do_retry = 0;
if (!(gfp_mask & __GFP_NORETRY)) {
#ifndef CONFIG_MMU
drop_pagecache();
#endif
if ((order <= CONFIG_BIG_ORDER_ALLOC_NOFAIL_MAGIC) || (gfp_mask & __GFP_REPEAT))
do_retry = 1;
if (gfp_mask & __GFP_NOFAIL)
do_retry = 1;
}
if (do_retry) {
congestion_wait(WRITE, HZ/50);
goto rebalance;
}
nopage:
if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
printk(KERN_WARNING "%s: page allocation failure."
" order:%d, mode:0x%x/n",
p->comm, order, gfp_mask);
dump_stack();
show_mem();
}
got_pg:
return page;
}
这个函数虽然很长,但是实际上在大部分情况下,第一次调用get_page_from_freelist函数即可得到可用内存,根本不会执行后面的代码。
1.1.3.3 get_page_from_freelist
这个函数的定义如下:
/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
*/
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, int alloc_flags)
{
struct zone **z;
struct page *page = NULL;
int classzone_idx = zone_idx(zonelist->zones[0]);
struct zone *zone;
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
zonelist_scan:
/*
* Scan zonelist, looking for a zone with enough free.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
z = zonelist->zones;
do {
if (NUMA_BUILD && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
zone = *z;
if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
break;
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
goto try_next_zone;
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
if (alloc_flags & ALLOC_WMARK_MIN)
mark = zone->pages_min;
else if (alloc_flags & ALLOC_WMARK_LOW)
mark = zone->pages_low;
else
mark = zone->pages_high;
if (!zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags)) {
if (!zone_reclaim_mode ||
!zone_reclaim(zone, gfp_mask, order))
goto this_zone_full;
}
}
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
if (page)
break;
this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
try_next_zone:
if (NUMA_BUILD && !did_zlc_setup) {
/* we do zlc_setup after the first zone is tried */
allowednodes = zlc_setup(zonelist, alloc_flags);
zlc_active = 1;
did_zlc_setup = 1;
}
} while (*(++z) != NULL);
if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
/* Disable zlc cache for second zonelist scan */
zlc_active = 0;
goto zonelist_scan;
}
return page;
}
和__alloc_pages函数一样,这个函数看着长,但是实际上如果抛开前面的参数检验部分和分配失败的处理部分,这个函数实际调用了buffered_rmqueue这个函数来分配内存页。
1.1.3.4 buffered_rmqueue
此函数代码如下:
/*
* Really, prep_compound_page() should be called from __rmqueue_bulk(). But
* we cheat by calling it from here, in the order > 0 path. Saves a branch
* or two.
*/
static struct page *buffered_rmqueue(struct zonelist *zonelist,
struct zone *zone, int order, gfp_t gfp_flags)
{
unsigned long flags;
struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);
int cpu;
again:
cpu = get_cpu();
if (likely(order == 0)) {
struct per_cpu_pages *pcp;
pcp = &zone_pcp(zone, cpu)->pcp[cold];
local_irq_save(flags);
if (!pcp->count) {
pcp->count = rmqueue_bulk(zone, 0,
pcp->batch, &pcp->list);
if (unlikely(!pcp->count))
goto failed;
}
page = list_entry(pcp->list.next, struct page, lru);
list_del(&page->lru);
pcp->count--;
} else {
spin_lock_irqsave(&zone->lock, flags);
page = __rmqueue(zone, order);
spin_unlock(&zone->lock);
if (!page)
goto failed;
}
__count_zone_vm_events(PGALLOC, zone, 1 << order);
zone_statistics(zonelist, zone);
local_irq_restore(flags);
put_cpu();
VM_BUG_ON(bad_range(zone, page));
if (prep_new_page(page, order, gfp_flags))
goto again;
return page;
failed:
local_irq_restore(flags);
put_cpu();
return NULL;
}
这个函数的结构很简单,当只需要分配一个页时,即order=0,此时内核将首先在缓存中查找,如果缓存中已经没有内存页了,则从内存中取一些页放在缓存中,再从缓存中分配。对于大于2个页的情况,则直接调用__rmqueue函数进行分配。
1.1.3.5 __rmqueue
此函数的实现为:
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
static struct page *__rmqueue(struct zone *zone, unsigned int order)
{
struct free_area * area;
unsigned int current_order;
struct page *page;
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
area = zone->free_area + current_order;
if (list_empty(&area->free_list))
continue;
page = list_entry(area->free_list.next, struct page, lru);
list_del(&page->lru);
rmv_page_order(page);
area->nr_free--;
__mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order));
expand(zone, page, order, current_order, area);
return page;
}
return NULL;
}
直接从指定order的块链表往上找,直到找到一个合适的块,返回这个块的第一个页面指针。如果这个块的大小大于所要求的order,那么将调用expand将这个块中的其余部分插入到合适的order的链表中去。
1.1.3.6 expand
这个函数用于将一个大的块分拆为几个小块并将之插入到合适的链表中去:
/*
* The order of subdivision here is critical for the IO subsystem.
* Please do not alter this order without good reasons and regression
* testing. Specifically, as large blocks of memory are subdivided,
* the order in which smaller blocks are delivered depends on the order
* they're subdivided in this function. This is the primary factor
* influencing the order in which pages are delivered to the IO
* subsystem according to empirical testing, and this is also justified
* by considering the behavior of a buddy system containing a single
* large block of memory acted on by a series of small allocations.
* This behavior is a critical factor in sglist merging's success.
*
* -- wli
*/
static inline void expand(struct zone *zone, struct page *page,
int low, int high, struct free_area *area)
{
unsigned long size = 1 << high;
while (high > low) {
area--;
high--;
size >>= 1;
VM_BUG_ON(bad_range(zone, &page[size]));
list_add(&page[size].lru, &area->free_list);
area->nr_free++;
set_page_order(&page[size], high);
}
}
1.1.4 内存回收
buddy算法的内存回收由__free_page和free_page两个宏及两个回收函数完成。其定义为:
#define __free_page(page) __free_pages((page), 0)
#define free_page(addr) free_pages((addr),0)
extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
extern void FASTCALL(free_pages(unsigned long addr, unsigned int order));
1.1.4.1 __free_pages与free_pages
这两个函数的实现为:
fastcall void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page)) {
if (order == 0)
free_hot_page(page);
else
__free_pages_ok(page, order);
}
}
fastcall void free_pages(unsigned long addr, unsigned int order)
{
if (addr != 0) {
VM_BUG_ON(!virt_addr_valid((void *)addr));
__free_pages(virt_to_page((void *)addr), order);
}
}
从这两个函数的实现可以看出,它们只是接收的参数不同而已,最后的处理方法则是一样的。free_pages函数接收一个绝对地址做为参数,而__free_pages函数则直接接收page这个结构体的指针做为参数。
这其中put_page_testzero函数用于将page->_count减一并判断其是否为0,如果为0则说明此页已经没有用户使用,此时就释放这个页。
从这里还可以看出,当order为0时,此函数试图先将内存页放到所谓的热高速缓存中,否则就将其回收到指定阶数的链表中。
1.1.4.2 __free_pages_ok
这个函数位于mm/page_alloc.c:
static void __free_pages_ok(struct page *page, unsigned int order)
{
unsigned long flags;
int i;
int reserved = 0;
// 判断是否应该释放这些连续的页,通常reserved为0
for (i = 0 ; i < (1 << order) ; ++i)
reserved += free_pages_check(page + i);
if (reserved)
return;
if (!PageHighMem(page)) // 恒为true
debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order); // 空语句
arch_free_page(page, order); // 空语句
kernel_map_pages(page, 1 << order, 0); // 空语句
local_irq_save(flags);
__count_vm_events(PGFREE, 1 << order); // 空语句
free_one_page(page_zone(page), page, order);
local_irq_restore(flags);
}
很简单,就是调用free_one_page函数进行释放工作。
1.1.4.3 free_one_page
static void free_one_page(struct zone *zone, struct page *page, int order)
{
spin_lock(&zone->lock);
zone->all_unreclaimable = 0;
zone->pages_scanned = 0;
__free_one_page(page, zone, order);
spin_unlock(&zone->lock);
}
简单调用__free_one_page函数,继续跟踪。
1.1.4.4 __free_one_page
/*
* Freeing function for a buddy system allocator.
*
* The concept of a buddy system is to maintain direct-mapped table
* (containing bit values) for memory blocks of various "orders".
* The bottom level table contains the map for the smallest allocatable
* units of memory (here, pages), and each level above it describes
* pairs of units from the levels below, hence, "buddies".
* At a high level, all that happens here is marking the table entry
* at the bottom level available, and propagating the changes upward
* as necessary, plus some accounting needed to play nicely with other
* parts of the VM system.
* At each level, we keep a list of pages, which are heads of continuous
* free pages of length of (1 << order) and marked with PG_buddy. Page's
* order is recorded in page_private(page) field.
* So when we are allocating or freeing one, we can derive the state of the
* other. That is, if we allocate a small block, and both were
* free, the remainder of the region must be split into blocks.
* If a block is freed, and its buddy is also free, then this
* triggers coalescing into a block of larger size.
*
* -- wli
*/
static inline void __free_one_page(struct page *page,
struct zone *zone, unsigned int order)
{
unsigned long page_idx;
int order_size = 1 << order;
if (unlikely(PageCompound(page)))
destroy_compound_page(page, order);
page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
VM_BUG_ON(page_idx & (order_size - 1));
VM_BUG_ON(bad_range(zone, page));
__mod_zone_page_state(zone, NR_FREE_PAGES, order_size);
while (order < MAX_ORDER-1) {
unsigned long combined_idx;
struct free_area *area;
struct page *buddy;
buddy = __page_find_buddy(page, page_idx, order);
if (!page_is_buddy(page, buddy, order))
break; /* Move the buddy up one level. */
list_del(&buddy->lru);
area = zone->free_area + order;
area->nr_free--;
rmv_page_order(buddy);
combined_idx = __find_combined_index(page_idx, order);
page = page + (combined_idx - page_idx);
page_idx = combined_idx;
order++;
}
set_page_order(page, order);
list_add(&page->lru, &zone->free_area[order].free_list);
zone->free_area[order].nr_free++;
}
这个注释中已经很清楚地说明了buddy算法的核心,它就是将内存分成不同大小的块,每个块都由数量不等的页组成。最小的块只有一个页,其次是2个页、4个页、8个页…组成的块。最多由1024个页组成。
在上述函数中,while循环尽可能地将页面放在更大的块中,在查找到最大的块后,将这个内存块放到相应的块链表中。
参考资料
uClinux2.6(bf561)中的CPLB( 2008/2/19 )
uclinux2.6(bf561)中的bootmem分析(1):猜测( 2008/5/9 )
uclinux2.6(bf561)中的bootmem分析(2):调用前的参数分析( 2008/5/9 )
uclinux2.6(bf561)中的bootmem分析(3):init_bootmem_node( 2008/5/9 )
uclinux2.6(bf561)中的bootmem分析(4):alloc_bootmem_pages( 2008/5/9 )
uclinux2.6(bf561)内核中的paging_init( 2008/5/12 )
uclinux-2008r1(bf561)内核的icache支持(1):寄存器配置初始化( 2008/5/16 )
uclinux-2008r1(bf561)内核的icache支持(2):icplb_table的生成( 2008/5/16 )
uclinux-2008r1(bf561)内核的icache支持(3):__fill_code_cplbtab( 2008/5/16 )
uclinux-2008r1(bf561)内核的icache支持(4):换页问题( 2008/5/16 )
再读uclinux-2008r1(bf561)内核中的bootmem( 2008/6/3 )
uclinux-2008r1(bf561)内核中与存储管理相关的几个全局变量( 2008/6/4 )
uclinux-2008r1(bf561)内核存储区域初探( 2008/6/4 )
uclinux-2008r1(bf561)内核中的zonelist初始化( 2008/6/5 )
uclinux-2008r1(bf561)内核中内存管理相关的几个结构体( 2008/6/5 )
再读内核存储管理(1):相关的全局变量( 2008/6/17 )
再读内核存储管理(2):相关的数据结构( 2008/6/17 )
再读内核存储管理(3):bootmem分配策略( 2008/6/17 )
再读内核存储管理(4):存储区域管理( 2008/6/17 )
再读内核存储管理(5):buddy算法( 2008/6/17 )
再读内核存储管理(6):高速缓存的应用( 2008/6/17 )
再读内核存储管理(7):icache支持( 2008/6/17 )
再读内核存储管理(8):片内SRAM的使用( 2008/6/17 )
初读SLAB( 2008/6/26 )
三读bootmem( 2008/7/24 )
再读uclinux-2008r1(bf561)内核存储区域管理(1):相关数据结构( 2008/7/25 )
再读uclinux-2008r1(bf561)内核存储区域管理(2):可用页表初始化( 2008/7/25 )
再读uclinux-2008r1(bf561)内核存储区域管理(3):zone初始化( 2008-7-25 )
再读uclinux-2008r1(bf561)内核存储区域管理(4):zonelist初始化( 2008-7-25 )
再读uclinux-2008r1(bf561)内核存储区域管理(5):page初始化( 2008-7-25 )
更多推荐
所有评论(0)