slab 系统调用 buddy 分配器分配所需要的内存页，作为 slab 使用。

和 slab 系统不同， buddy 系统主要响应较大 ( 至少为一个内存页 ) 的内存分配请求，本文仍然从 kmalloc 函数的实现入手，结合伙伴系统的核心函数 __alloc_pages_nodemask ，说明伙伴系统分配页框的过程。

由于伙伴系统分配页框的流程十分复杂，本文只介绍分配页框的第一次尝试，即 __alloc_pages_nodemask 函数第一次调用 get_page_from_freelist 的具体流程； __alloc_pages_slowpath 函数放在下一篇文章。

1. kmalloc_large

kmalloc 函数执行时，如果请求的内存大小大于 KMALLOC_MAX_CACHE_SIZE ，就会调用 kmalloc_large 函数完成内存分配操作。

kmalloc_large 函数先根据请求的内存的大小调用 get_order 函数得出请求的内存的大小对应内存页的数量 order ，然后调用 kmalloc_order ：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
{
    void *ret;
    struct page *page;

    /*
     * 添加复合页面元数据标志，即释放内存时需要的信息。复合页面的一个应用是
     * 实现 transhugepage ，即透明的 hugepage 。
     * 复合页面用连续的多个页面组成一个 hugepage ，其中的第一个页面称作 
     * head page ，其余的页面成为 tail page 。
     * head page 中保存着 hugepage 的元数据。 
     */
    flags |= __GFP_COMP;
    page = alloc_kmem_pages(flags, order);
    ret = page ? page_address(page) : NULL;
    kmemleak_alloc(ret, size, 1, flags);
    return ret;
}
EXPORT_SYMBOL(kmalloc_order);

1.1. alloc_kmem_pages

内核里这个函数的注释写道： alloc_kmem_pages 会将新分配的页 charge 到 kmem 资源计数器。

这里的 charge ，即 cgroup 中的 charge ，结合 charge 字面的意思充电，可以理解为分配；对应的， uncharge 可以理解为释放。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


/*
 * alloc_kmem_pages charges newly allocated pages to the kmem resource counter
 * of the current memory cgroup.
 *
 * It should be used when the caller would like to use kmalloc, but since the
 * allocation is large, it has to fall back to the page allocator.
 */
struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order)
{
    struct page *page;
    struct mem_cgroup *memcg = NULL;

    /* 判断 memcg 是否支持新的页面分配操作 */
    if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order))
        return NULL;
    page = alloc_pages(gfp_mask, order);
    memcg_kmem_commit_charge(page, memcg, order);
    return page;
}

1.2. alloc_pages_current

NUMA 系统中， alloc_pages 函数最终调用 mm/mempolicy.c 中的 alloc_pages_current 完成页面分配的操作。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


struct page *alloc_pages_current(gfp_t gfp, unsigned order)
{
    struct mempolicy *pol = get_task_policy(current);
    struct page *page;
    unsigned int cpuset_mems_cookie;

    if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
        pol = &default_policy;

retry_cpuset:
    cpuset_mems_cookie = read_mems_allowed_begin();

    /*
     * No reference counting needed for current->mempolicy
     * nor system default_policy
     */
    if (pol->mode == MPOL_INTERLEAVE)
        page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
    else
        page = __alloc_pages_nodemask(gfp, order,
                policy_zonelist(gfp, pol, numa_node_id()),
                policy_nodemask(gfp, pol));

    if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
        goto retry_cpuset;

    return page;
}
EXPORT_SYMBOL(alloc_pages_current);

alloc_pages_current 根据当前进程绑定的内存策略，执行不同的函数路径。

一种是 MPOL_INTERLEAVE ，执行 alloc_page_interleave ，首先获取可用的 zonelist ，最终通过 __alloc_pages_nodemask 分配内存页。
否则直接调用 __alloc_pages_nodemask 函数分配内存页，第一种情况下 nodemask 参数为空。

2. __alloc_pages_nodemask

__alloc_pages_nodemask 函数即 zoned buddy allocator ，是伙伴系统的核心，执行内存页的分配操作。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89


struct page *__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
            struct zonelist *zonelist, nodemask_t *nodemask)
{
    /* 根据申请内存时提供的gfp标志获取zone类型 */
    enum zone_type high_zoneidx = gfp_zone(gfp_mask);
    struct zone *preferred_zone;
    struct zoneref *preferred_zoneref;
    struct page *page = NULL;

    /* 根据 gfp 标志得到页面迁移的类型 */
    int migratetype = allocflags_to_migratetype(gfp_mask);
    unsigned int cpuset_mems_cookie;

    /* alloc_flags 供第一次分配使用 */
    int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
    int classzone_idx;

    /* 移除不支持的标志 */
    gfp_mask &= gfp_allowed_mask;

    lockdep_trace_alloc(gfp_mask);

    might_sleep_if(gfp_mask & __GFP_WAIT);

    /*
     * 根据 mm/page_alloc.c 中的变量 fail_page_alloc 判断
     * 当前的设置是否符合分配失败的情况 
     */
    if (should_fail_alloc_page(gfp_mask, order))
        return NULL;

    /*
     * 初始化过后， zonelist->_zonerefs 中包含所有可用的 zone，
     * 并且以 NULL 作为结束标志 
     */
    if (unlikely(!zonelist->_zonerefs->zone))
        return NULL;

retry_cpuset:
    cpuset_mems_cookie = read_mems_allowed_begin();

    /* The preferred zone is used for statistics later */
    /*
     * 本次获取 preferred_zone 时，如果设置了 nodemask ，就采用
     * 传入的 nodemask ，否则就使用 cpuset 限制下，当前进程可用
     * 的内存节点 
     */
    preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
                nodemask ? : &cpuset_current_mems_allowed,
                &preferred_zone);
    if (!preferred_zone)
        goto out;
    classzone_idx = zonelist_zone_idx(preferred_zoneref);

#ifdef CONFIG_CMA
    if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
        alloc_flags |= ALLOC_CMA;
#endif
    /* First allocation attempt */
    page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
            zonelist, high_zoneidx, alloc_flags,
            preferred_zone, classzone_idx, migratetype);
    if (unlikely(!page)) {
        /*
         * Runtime PM, block IO and its error handling path
         * can deadlock because I/O on the device might not
         * complete.
         */
        gfp_mask = memalloc_noio_flags(gfp_mask);
        page = __alloc_pages_slowpath(gfp_mask, order,
                zonelist, high_zoneidx, nodemask,
                preferred_zone, classzone_idx, migratetype);
    }

    trace_mm_page_alloc(page, order, gfp_mask, migratetype);

out:
    /*
     * When updating a task's mems_allowed, it is possible to race with
     * parallel threads in such a way that an allocation can fail while
     * the mask is being updated. If a page allocation is about to fail,
     * check if the cpuset changed during allocation and if so, retry.
     */
    if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
        goto retry_cpuset;

    return page;
}
EXPORT_SYMBOL(__alloc_pages_nodemask);

__alloc_pages_nodemask 通过 first_zones_zonelist 获取可以使用的第一个 zoneref ，成功后执行进一步操作 —— get_page_from_freelist 。
如果分配失败，移除可能存在的 PF_MEMALLOC_NOIO 标志，然后调用 __alloc_pages_slowpath 函数，再次尝试分配内存页。

3. get_page_from_freelist

get_page_from_freelist 是 __alloc_pages_nodemask 分配页面的第一次尝试：

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181


/*
 * __alloc_pages_nodemask 调用这个函数时，传入的参数 gfp_mask 发生了变化，
 * 增加了 __GFP_HARDWALL 标志，执行 cpuset 检查时会用到 
 */
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
        struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
        struct zone *preferred_zone, int classzone_idx, int migratetype)
{
    struct zoneref *z;
    struct page *page = NULL;
    struct zone *zone;
    nodemask_t *allowednodes = NULL;
    int zlc_active = 0;
    int did_zlc_setup = 0;
    bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
                (gfp_mask & __GFP_WRITE);
    int nr_fair_skipped = 0;
    bool zonelist_rescan;

zonelist_scan:
    zonelist_rescan = false;

    /* 扫描所有类型小于 high_zoneidx 的 zone */
    for_each_zone_zonelist_nodemask(zone, z, zonelist,
                        high_zoneidx, nodemask) {
        unsigned long mark;

        /*
         * 如果 zlcache 已经生效，只有 bitmap 中没有标记为满，
         * 并且区域所属的节点包含在 allowednodes 中的 zone值得一试 
         */
        if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
            !zlc_zone_worth_trying(zonelist, z, allowednodes))
                continue;

        /*
         * 开启了 cpuset 功能， ALLOC_CPUSET 指明分配操作要考虑 
         * cpuset 的限制，通过 cpuset_zone_allowed_softwall
         * 判断当前的 zone 能否用于分配操作 
         */
        if (cpusets_enabled() &&
            (alloc_flags & ALLOC_CPUSET) &&
            !cpuset_zone_allowed_softwall(zone, gfp_mask))
                continue;

        /*
         * 如果设置了 ALLOC_FAIR ，并且当前的 zone 和传入的 preferred_zone
         * 不属于相同的节点，直接从当前的 zone 分配页面；否则增加由于 fair 被
         * 跳过的 zone 的计数 
         */
        if (alloc_flags & ALLOC_FAIR) {
            if (!zone_local(preferred_zone, zone))
                break;
            if (zone_is_fair_depleted(zone)) {
                nr_fair_skipped++;
                continue;
            }
        }

        /* dirty 页筛选，保证每个 zone 的 dirty 页数量在限制内 */
        if (consider_zone_dirty && !zone_dirty_ok(zone))
            continue;

        mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
        /* watermark is not OK */
        if (!zone_watermark_ok(zone, order, mark,
                       classzone_idx, alloc_flags)) {
            int ret;

            /* Checked here to keep the fast path fast */
            BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
            /*# don't care about watermark */
            if (alloc_flags & ALLOC_NO_WATERMARKS)
                goto try_this_zone;

            if (IS_ENABLED(CONFIG_NUMA) &&
                    !did_zlc_setup && nr_online_nodes > 1) {

                /*
                 * 建立 zlcache ，返回可用的节点。如果距离上一次
                 * zap 位图达到 1s ，执行 zap 操作 
                 */
                allowednodes = zlc_setup(zonelist, alloc_flags);
                zlc_active = 1;
                did_zlc_setup = 1;
            }

            /*
             * 系统不允许 zone 回收，或者允许 zone 回收，但是当前
             * zone 距离 perferred_zone 距离过大，标记为 full 
             */
            if (zone_reclaim_mode == 0 ||
                !zone_allows_reclaim(preferred_zone, zone))
                goto this_zone_full;

            /*
             * As we may have just activated ZLC, check if the first
             * eligible zone has failed zone_reclaim recently.
             */
            if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
                !zlc_zone_worth_trying(zonelist, z, allowednodes))
                continue;

            /* 获取可回收的页面数 */
            ret = zone_reclaim(zone, gfp_mask, order);
            switch (ret) {
            case ZONE_RECLAIM_NOSCAN:
                /* did not scan */
                continue;
            case ZONE_RECLAIM_FULL:
                /* scanned but unreclaimable */
                continue;
            default:
                /* did we reclaim enough */
                if (zone_watermark_ok(zone, order, mark,
                        classzone_idx, alloc_flags))
                    goto try_this_zone;

                /*
                 * Failed to reclaim enough to meet watermark.
                 * Only mark the zone full if checking the min
                 * watermark or if we failed to reclaim just
                 * 1<<order pages or else the page allocator
                 * fastpath will prematurely mark zones full
                 * when the watermark is between the low and
                 * min watermarks.
                 */
                /*
                 * 如果使用的已经是最小的 watermark，或者只能回收部分
                 * 页面，但是仍然超过了 watermark ，在 zlcache 标记
                 * zone 为 full。
                 * 这个标记动作只是在 zlcache 
                 */
                if (((alloc_flags & ALLOC_WMARK_MASK) == ALLOC_WMARK_MIN) ||
                    ret == ZONE_RECLAIM_SOME)
                    goto this_zone_full;

                continue;
            }
        }

try_this_zone:
        /*
         * wartermark 高于限制值，或者设置了 ALLOC_NO_WATERMARKS ，
         * 或者执行了回收操作后 watermark 高于限制值 
         */
        page = buffered_rmqueue(preferred_zone, zone, order,
                        gfp_mask, migratetype);
        if (page)
            break;
this_zone_full:
        if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
            zlc_mark_zone_full(zonelist, z);
    }

    if (page) {
        page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
        return page;
    }
    if (alloc_flags & ALLOC_FAIR) {
        alloc_flags &= ~ALLOC_FAIR;
        if (nr_fair_skipped) {
            zonelist_rescan = true;
            reset_alloc_batches(preferred_zone);
        }
        if (nr_online_nodes > 1)
            zonelist_rescan = true;
    }

    if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
        /* Disable zlc cache for second zonelist scan */
        zlc_active = 0;
        zonelist_rescan = true;
    }

    if (zonelist_rescan)
        goto zonelist_scan;

    return NULL;
}

__alloc_pages_nodemask 调用这个函数时，对 alloc_flags 进行了较多的限制，函数本身执行时，也会尽可能多的满足各种限制条件。

3.1. cpuset_zone_allowed_softwall

如果系统开启了 cpuset 功能， cpuset_zone_allowed_softwall 最终调用 __cpuset_node_allowed_softwall 实现 cpuset 的限制功能：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
{
    struct cpuset *cs;      /* current cpuset ancestors */
    int allowed;            /* is allocation in zone z allowed? */

    if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
        return 1;
    might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
    if (node_isset(node, current->mems_allowed))
        return 1;
    /*
     * Allow tasks that have access to memory reserves because they have
     * been OOM killed to get memory anywhere.
     */
    if (unlikely(test_thread_flag(TIF_MEMDIE)))
        return 1;
    if (gfp_mask & __GFP_HARDWALL)  
        /* If hardwall request, stop here */
        return 0;

    if (current->flags & PF_EXITING) /* Let dying task have memory */
        return 1;

    /* Not hardwall and node outside mems_allowed: scan up cpusets */
    mutex_lock(&callback_mutex);

    rcu_read_lock();
    cs = nearest_hardwall_ancestor(task_cs(current));
    allowed = node_isset(node, cs->mems_allowed);
    rcu_read_unlock();

    mutex_unlock(&callback_mutex);
    return allowed;
}

这里直接引用代码中的注释对函数进行说明：

If we’re in interrupt, yes, we can always allocate. If __GFP_THISNODE is set, yes, we can always allocate. If node is in our task’s mems_allowed, yes. If it’s not a __GFP_HARDWALL request and this node is in the nearest hardwalled cpuset ancestor to this task’s cpuset, yes. If the task has been OOM killed and has access to memory reserves as specified by the TIF_MEMDIE flag, yes.
Otherwise, no.

处于中断状态；设置了 __GFP_THISNODE ；节点包含在 mems_allowed ，三种情况下可以直接允许分配。
如果不是 __GFP_HARDWALL 请求，但是节点位于当前进程的 cpuset 的祖先 cpuset 内，允许分配。
如果进程由于 OOM 被杀死，允许分配。
其他情况不允许分配。

If __GFP_HARDWALL is set, cpuset_node_allowed_softwall() reduces to cpuset_node_allowed_hardwall(). Otherwise, cpuset_node_allowed_softwall() might sleep, and might allow a node from an enclosing cpuset.
cpuset_node_allowed_hardwall() only handles the simpler case of hardwall cpusets, and never sleeps.

设置 __GFP_HARDWALL 函数直接缩减为 cpuset_node_allowed_hardwall ；不设置的话， cpuset_node_allowed_software 可能休眠。

The __GFP_THISNODE placement logic is really handled elsewhere, by forcibly using a zonelist starting at a specified node, and by (in get_page_from_freelist()) refusing to consider the zones for any node on the zonelist except the first. By the time any such calls get to this routine, we should just shut up and say ‘yes’.

_GFP_THISNODE 标志在其他地方已经进行处理，因此直接返回允许。

GFP_USER allocations are marked with the __GFP_HARDWALL bit, and do not allow allocations outside the current tasks cpuset unless the task has been OOM killed as is marked TIF_MEMDIE. GFP_KERNEL allocations are not so marked, so can escape to the nearest enclosing hardwalled ancestor cpuset.

GFP_USER 标志设置了 __GFP_HARDWALL ，不允许从当前进程的 cpuset 以外的 zone 分配内存，除非设置了 TIF_MEMDIE 标志。 GFP_KERNEL 没有这个标志，因此可以从最近的祖先 cpuset 分配内存。

The first call here from mm/page_alloc:get_page_from_freelist() has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets, so no allocation on a node outside the cpuset is allowed (unless in interrupt, of course).

mm/page_alloc.c 中 __alloc_pages_nodemask 第一次调用 get_page_from_freelist 时，设置了 __GFP_HARDWALL 标志，不允许从 cpuset 外的节点分配内存(除非处于中断中)。

The second pass through get_page_from_freelist() doesn’t even call here for GFP_ATOMIC calls. For those calls, the __alloc_pages() variable ‘wait’ is not set, and the bit ALLOC_CPUSET is not set in alloc_flags. That logic and the checks below have the combined affect that:

in_interrupt - any node ok (current task context irrelevant)
GFP_ATOMIC - any node ok
TIF_MEMDIE - any node ok
GFP_KERNEL - any node in enclosing hardwalled cpuset ok
GFP_USER - only nodes in current tasks mems allowed ok.

__alloc_pages_nodemask 执行慢路径 __alloc_pages_slowpath 时，还会调用 get_page_from_freelist 。如果设置了 GFP_ATOMIC ，不会执行这个函数。这些函数调用不会设置 __alloc_pages 的 wait 变量， alloc_flags 的 ALLOC_CPUSET 也不会设置。

3.2. zone_watermark_ok

对于列表中的每个 zone ，如果 zlcache 没有标记为满，并且位于可用的内存节点，符合 cpuset 的限制，并且没有设置 ALLOC_FAIR 分配标志，通过了 dirty 检查，则调用 zone_watermark_ok 函数判断当前 zone 的空闲页是否高于给定的 watermark 。

zone_watermark_ok 函数通过 __zone_watermark_ok 函数实现：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


static bool __zone_watermark_ok(struct zone *z, unsigned int order,
            unsigned long mark, int classzone_idx, int alloc_flags,
            long free_pages)
{
    /* free_pages my go negative - that's OK */
    long min = mark;
    int o;
    long free_cma = 0;

    free_pages -= (1 << order) - 1;
    if (alloc_flags & ALLOC_HIGH)
        min -= min / 2;
    if (alloc_flags & ALLOC_HARDER)
        min -= min / 4;
#ifdef CONFIG_CMA    // x86默认未配置
    /* If allocation can't use CMA areas don't use free CMA pages */
    if (!(alloc_flags & ALLOC_CMA))
        free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
#endif

    /*
     * 如果分配所需的页面后剩余的空闲页面的数量小于
     * watermark + 每个zone的保留页的数量，返回false 
     */
    if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
        return false;
    for (o = 0; o < order; o++) {
        /* At the next order, this order's pages become unavailable */
        free_pages -= z->free_area[o].nr_free << o;

        /* Require fewer higher order pages to be free */
        min >>= 1;

        if (free_pages <= min)
            return false;
    }
    return true;
}

3.3. buffered_rmqueue

如果找到了可以使用的 zone ，就通过 buffered_rmqueue 从 zone 中分配所需的内存页：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89


static inline
struct page *buffered_rmqueue(struct zone *preferred_zone,
            struct zone *zone, unsigned int order,
            gfp_t gfp_flags, int migratetype)
{
    unsigned long flags;
    struct page *page;
    bool cold = ((gfp_flags & __GFP_COLD) != 0);

again:
    if (likely(order == 0)) {
        struct per_cpu_pages *pcp;
        struct list_head *list;

        local_irq_save(flags);
        pcp = &this_cpu_ptr(zone->pageset)->pcp;

        /* 获取当前 CPU 给定 migratetype 的 page 表 */
        list = &pcp->lists[migratetype];

        /*
         * per-cpu 页框高速缓存为空，调用 rmqueue_bulk 从伙伴系统
         * 申请 batch 个内存页进行补充 
         */
        if (list_empty(list)) {
            pcp->count += rmqueue_bulk(zone, 0,
                    pcp->batch, list,
                    migratetype, cold);
            if (unlikely(list_empty(list)))
                goto failed;
        }

        /* 从 list 头部取 hot page ，从尾部取 cold page */
        if (cold)
            page = list_entry(list->prev, struct page, lru);
        else
            page = list_entry(list->next, struct page, lru);

        list_del(&page->lru);
        pcp->count--;
    } else {
        if (unlikely(gfp_flags & __GFP_NOFAIL)) {

            /* 警告大于 2 个 page 的 nofail 申请 */
            WARN_ON_ONCE(order > 1);
        }
        spin_lock_irqsave(&zone->lock, flags);

        /* 多页请求通过伙伴系统申请 */
        page = __rmqueue(zone, order, migratetype);
        spin_unlock(&zone->lock);
        if (!page)
            goto failed;

        /* 减少 zone 的空闲页面计数 */
        __mod_zone_freepage_state(zone, -(1 << order),
                      get_freepage_migratetype(page));
    }

    /* 减去 zone 的 NR_ALLOC_BATCH 计数 */
    __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));

    /* 将 NR_ALLOC_BATCH 为 0 的 zon e也标记为 fair depleted */
    if (zone_page_state(zone, NR_ALLOC_BATCH) == 0 &&
        !zone_is_fair_depleted(zone))
        zone_set_flag(zone, ZONE_FAIR_DEPLETED);

    /* 增加 vm_event 计数 */
    __count_zone_vm_events(PGALLOC, zone, 1 << order);
    zone_statistics(preferred_zone, zone, gfp_flags);
    local_irq_restore(flags);

    VM_BUG_ON_PAGE(bad_range(zone, page), page);
    /*
     * 根据分配标志符初始化申请到的内存页，下列原因会导致检查失败：
     * 1. _mapcount != 0
     * 2. mapping != NULL
     * 3. _count != 0
     * 4. 一些 flag 不为0
     * 5. cgroup 检查失败  
     */
    if (prep_new_page(page, order, gfp_flags))
        goto again;
    return page;

failed:
    local_irq_restore(flags);
    return NULL;
}

3.3.1. rmqueue_bulk

作为补充 per-cpu 页框高速缓存的内存页的函数， rmqueue_bulk 通过 __rmqueue 依次从伙伴系统申请 batch 个页框，并根据传入的 cold 参数将申请到的页面依次添加到传入的 list ( 即 per-cpu 页框高速缓存的页框链表 ) 的头部 ( cold 为 false ) 或者尾部 ( cold 为 true ) 。

函数会修改 zone 的空闲页面统计数，返回真正申请到的页面数。

3.3.2. __rmqueue

__rmqueue 是“真正”的伙伴系统分配函数，函数第一次尝试分配请求的 migratetype 所需的页面数；如果失败，并且第一次请求的 migratetype 不是 MIGRATE_RESERVE ，再调用 __rmqueue_fallback 尝试分配；如果分配再次失败，则将 migratetype 置为 MIGRATE_RESERVE ，从保留内存区域进行分配。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


static struct page *__rmqueue(struct zone *zone, unsigned int order,
                        int migratetype)
{
    struct page *page;

retry_reserve:
    page = __rmqueue_smallest(zone, order, migratetype);

    if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
        page = __rmqueue_fallback(zone, order, migratetype);

        /*
         * Use MIGRATE_RESERVE rather than fail an allocation. goto
         * is used because __rmqueue_smallest is an inline function
         * and we want just one call site
         */
        if (!page) {
            migratetype = MIGRATE_RESERVE;
            goto retry_reserve;
        }
    }

    trace_mm_page_alloc_zone_locked(page, order, migratetype);
    return page;
}

3.3.2.1. __rmqueue_smallest

__rmqueue_smallest 是 __rmqueue 函数分配页框的第一次尝试。
函数从传入的 order 开始，从 zone 中最接近 order 的 free_area 中分配页框。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


static inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
                        int migratetype)
{
    unsigned int current_order;
    struct free_area *area;
    struct page *page;

    /* Find a page of the appropriate size in the preferred list */
    for (current_order = order; current_order < MAX_ORDER; ++current_order) {
        area = &(zone->free_area[current_order]);
        if (list_empty(&area->free_list[migratetype]))
            continue;

        page = list_entry(area->free_list[migratetype].next,
                            struct page, lru);
        list_del(&page->lru);
        rmv_page_order(page);
        area->nr_free--;
        expand(zone, page, order, current_order, area, migratetype);
        set_freepage_migratetype(page, migratetype);
        return page;
    }

    return NULL;
}

如果没有和给定的 order 相同的 free_area ，__rmqueue_smallest 从最接近 order 的 free_area 分配内存，然后调用 expand 函数将原来 order 较大的内存块切分成较小的内存块，并且添加到相应的链表中。

3.3.2.2. __rmqueue_fallback

如果 __rmqueue_smallest 分配失败，并且 migratetype 不是 reserve 类型，则调用 __rmqueue_fallback 函数尝试分配。
和 __rmqueue_smallest 不同， __rmqueue_fallback 从最大的 order 开始，从 fallback 表中，查找当前 migratetype 对应的可用 migratetype ，获取可用的 free_area ，然后从中分配页框。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61


static inline struct page *
__rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
{
    struct free_area *area;
    unsigned int current_order;
    struct page *page;
    int migratetype, new_type, i;

    /* Find the largest possible block of pages in the other list */
    for (current_order = MAX_ORDER-1;
                current_order >= order && current_order <= MAX_ORDER-1;
                --current_order) {
        for (i = 0;; i++) {
            migratetype = fallbacks[start_migratetype][i];

            /* MIGRATE_RESERVE handled later if necessary */

            /* MIGRATE_RESERVE 是最后一个元素 */
            if (migratetype == MIGRATE_RESERVE)
                break;

            area = &(zone->free_area[current_order]);
            if (list_empty(&area->free_list[migratetype]))
                continue;

            page = list_entry(area->free_list[migratetype].next,
                    struct page, lru);
            area->nr_free--;

            /* start_migratetype 是首选， migratetype 是后备选项 */
            new_type = try_to_steal_freepages(zone, page,
                              start_migratetype,
                              migratetype);

            /* Remove the page from the freelists */
            list_del(&page->lru);
            rmv_page_order(page);

            /*
             * expand 函数将
             * free_area[current_order].freelist[new_type]
             * 切分成较小 order 的内存块，保存到响应列表 
             */
            expand(zone, page, order, current_order, area,
                   new_type);
            /* The freepage_migratetype may differ from pageblock's
             * migratetype depending on the decisions in
             * try_to_steal_freepages. This is OK as long as it does
             * not differ for MIGRATE_CMA type.
             */
            set_freepage_migratetype(page, new_type);

            trace_mm_page_alloc_extfrag(page, order, current_order,
                start_migratetype, migratetype, new_type);

            return page;
        }
    }

    return NULL;
}

fallbacks 是定义在 mm/page_alloc.c 中的二维数组，保存着每种 migratetype 分配失败时，可以使用的其他 migratetype 。

如果 __rmqueue_fallback 也分配失败， __rmqueue 将 migratetype 设置成 reserve 类型，再次执行上面两个函数进行分配。

3.3.2.2.1. try_to_steal_freepages

找到可用的 free_area 之后， __rmqueue_fallback 调用 try_to_steal_freepages ，根据 fallback_type 移动空闲页框。

内核中函数的注释为：

切分一个大块的内存时，将所有空闲页框移动到首选的分配列表。如果 fallback_type 是可回收的内核分配，更积极主动的获取空闲页面的所有权。

另一方面，不要改变 MIGRATE_CMA 内存块的 migratetype ，也不要将 CMA 的页框移动到其他的空闲列表中，我们不想从 MIGRATE_CMA 区域分配不可用的页框。

函数中的 pageblock_order 变量定义在 include/linux/pageblock-flags.h ，如果开启了 CONFIG_HUGETLB_PAGE ( 默认开启 ) ，并且没有通过内核配置提供 order 值，就采用 HUGETLB_PAGE_ORDER ，即 9 ；否则为 MAX_ORDER - 1 ，即 10 。
page_group_by_mobility_disabled 变量在 build_all_zonelists 设置，如果系统中页框数量太低，将其置为 1 ，关闭“group by mobility”。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61


/*
 * @start_type为首选的迁移类型，即请求内存时的类型
 * @fallback_type为备选的类型，即通过fallbacks获取的类型 
 */
static int try_to_steal_freepages(struct zone *zone, struct page *page,
                  int start_type, int fallback_type)
{
    /* current_order 为当前找到的内存块的阶数 */
    int current_order = page_order(page);

    /*
     * When borrowing from MIGRATE_CMA, we need to release the excess
     * buddy pages to CMA itself. We also ensure the freepage_migratetype
     * is set to CMA so it is returned to the correct freelist in case
     * the page ends up being not actually allocated from the pcp lists.
     */
    if (is_migrate_cma(fallback_type))
        return fallback_type;

    /* Take ownership for orders >= pageblock_order */
    if (current_order >= pageblock_order) {

        /*
         * change_pageblock_range 将找到的内存块按照
         * pageblock_order 的大小分割，都设置 migratetype
         * 为 start_type ，即首选的迁移类型 
         */
        change_pageblock_range(page, current_order, start_type);
        return start_type;
    }

    /*
     * 下列情况还会移动页框，即更积极的获取空闲页框的所有权：
     * 1. 当前 order 大于等于 pageblock_order 的一半
     * 2. 首选为可回收的迁移类型
     * 3. 关闭了“group by mobility” 
     */
    if (current_order >= pageblock_order / 2 ||
        start_type == MIGRATE_RECLAIMABLE ||
        page_group_by_mobility_disabled) {
        int pages;

        /* 将内存块移动到 start_type 的列表 */
        pages = move_freepages_block(zone, page, start_type);

        /* Claim the whole block if over half of it is free */
        if (pages >= (1 << (pageblock_order-1)) ||
                page_group_by_mobility_disabled) {

            /*
             * move_freepages 中设置 migratetype 的操作，可能没有
             * 设置 page 的迁移类型 —— 对齐到 pageblock_nr_pages 
             */
            set_pageblock_migratetype(page, start_type);
            return start_type;
        }

    }

    return fallback_type;
}

从代码可以看出，对于内存块阶数大于 pageblock_order 的情况，只是修改页框的 migratetype ，并没有真正移动空闲页；而“更积极的获取空闲页框的所有权”则意味着将空闲页框移动到指定的迁移类型的列表中。

3.3.2.2.2. move_freepages

try_to_steal_freepages 调用 move_freepages_block 移动空闲页框，后者先获取内存块的起止物理页框号对应的 struct page * ，并进行合法性检查 ( 这里有个疑问，不知道为什么将起止 pfn 对齐到 pageblock_nr_pages ) ，然后调用 move_freepages 函数，移动页框到参数指定的迁移类型。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58


int move_freepages(struct zone *zone,
              struct page *start_page, struct page *end_page,
              int migratetype)
{
    struct page *page;
    unsigned long order;
    int pages_moved = 0;

#ifndef CONFIG_HOLES_IN_ZONE
    /*
     * page_zone is not safe to call in this context when
     * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
     * anyway as we check zone boundaries in move_freepages_block().
     * Remove at a later date when no bug reports exist related to
     * grouping pages by mobility
     */
    BUG_ON(page_zone(start_page) != page_zone(end_page));
#endif

    for (page = start_page; page <= end_page;) {
        /* Make sure we are not inadvertently changing nodes */
        VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);

        /*
         * 这个函数有两个调用者，一个是正在说明的伙伴系统的
         * move_freepages_block ，这种情况能够通过下列两项
         * 检查；另一个是 mm/page_isolation.c 中的函数，可能
         * 不会通过检查？ 
         */
        if (!pfn_valid_within(page_to_pfn(page))) {
            page++;
            continue;
        }

        if (!PageBuddy(page)) {
            page++;
            continue;
        }

        order = page_order(page);

        /* 移动内存块到相同阶数、传入的迁移类型 */
        list_move(&page->lru,
              &zone->free_area[order].free_list[migratetype]);

        /* 修改迁移类型 */
        set_freepage_migratetype(page, migratetype);

        /*
         * 如果 page 没有通过上面两项检查，就会导致 1 << order
         * 不等于 end_pfn - start_pfn 
         */
        page += 1 << order;
        pages_moved += 1 << order;
    }

    return pages_moved;
}

需要说明的是， move_freepages 每次都会获取 page 的 order ， order 有多种可能值。

4. 总结

__alloc_pages_nodemask 函数先将 alloc_flags 设置成 ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR ，调用 get_page_from_freelist ，尝试获取空闲页。
为了保证函数的执行速度， get_page_from_freelist 会使用 zlcache 加快空闲页框的查找速度。

get_page_from_freelist 会执行两次扫描，第一次扫描考虑所有的限制条件，包括 cpuset ，ALLOC_FAIR 标志， dirty 限制，不考虑远程节点；第二次扫描会忽略 ALLOC_FAIR 标志，考虑远程节点。

对于 zone 中空闲页框数低于 watermark 的情况， get_page_from_freelist 调用 zone_reclaim ，尝试从 zone 中回收页框，回收流程在《 mm - buddy_allocator 页框回收》介绍。

mm-buddy_allocator_page_allocation-1

Contents