本文从 Linux 内核的内存管理关键的数据结构出发，结合内核源码中的注释，说明 Linux 的内存管理用到的数据结构的初始化流程。本文以 x86-64 架构为例，假设系统类型为 NUMA ， sparse memory model 。

1. Memory Model

内存模型一部分内容主要来自网上， Documentation 中没有找到相关的内容。

内存模型是从 CPU 的角度看，系统中物理内存的分布情况；在 Linux 内核中，使用何种方式来管理这些物理内存。

Linux 支持三种内存模型， flat memory ， discontiguous memory 和 sparse memory 。

本文假设所有的CPU共享同一段物理地址空间。

1.1. Flat Memory Model

从系统的任意一个 CPU 来看，访问物理内存的时候，物理地址空间是连续的，没有空洞的地址空间，这种计算机系统的内存模型就是 flat memory model 。

这种情况下，节点数据 pg_data_t 只有一个，物理页框号和 struct page *mem_map 可以通过一个偏移量互相转化。将 mem_map 放在内存的直接映射区域，操作系统就不需要再为内存建立页表。

1
2
3


// include/asm-generic/memory_model.h
#define __pfn_to_page(pfn)    (mem_map + ((pfn) - ARCH_PFN_OFFSET))
#define __page_to_pfn(page)    ((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET)

1.2. Discontiguous Memory Model

如果物理内存的地址空间有空洞，这种内存模型就是 discontiguous memory model 。

这种情况下，节点数据 pg_data_t 有多个，每个节点管理的物理内存都保存在 pg_data_t 中的 node_mem_map ( 类似于flat模型中的 mem_map ) 成员中。从物理页框号转化为 struct page 需要先从 PFN 中得到节点 ID ，然后找到对应的 pg_data_t ，就可以像 flat 模型一样获得 struct page 数组。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


// include/asm-generic/memory_model.h
#define __pfn_to_page(pfn)          \
({  unsigned long __pfn = (pfn);        \
    unsigned long __nid = arch_pfn_to_nid(__pfn);  \
    NODE_DATA(__nid)->node_mem_map + arch_local_page_offset(__pfn, __nid);\
})

#define __page_to_pfn(pg)                       \
({  const struct page *__pg = (pg);                 \
    struct pglist_data *__pgdat = NODE_DATA(page_to_nid(__pg)); \
    (unsigned long)(__pg - __pgdat->node_mem_map) +         \
     __pgdat->node_start_pfn;                   \
})

1.2.1. __pfn_to_page

x86 下，只有 32 位 ( 选择 32 位内核才会出现 flat memory model ) 系统 pfn_to_nid 有定义：

1
2
3
4
5
6
7
8


static inline int pfn_to_nid(unsigned long pfn)
{
#ifdef CONFIG_NUMA
    return((int) physnode_map[(pfn) / PAGES_PER_SECTION]);
#else
    return 0;
#endif
}

本文主要着眼于 64 位系统， 32 位的有关内容不再详细介绍。

1.3. Sparse Memory Model

sparse 模型用来解决内存的热插拔可能导致的内存节点内的 mem_map 不连续的问题。这种模型将连续的地址空间按照 section ( x86_64 NUMA 架构下为 128M )分段，每一个 section 都是 hotplug 的。

整个物理内存的地址空间通过指针 struct mem_section * 数组来描述，每个 mem_section * 指向一个 page ， page 中包含若干个 struct mem_section 对象，每个对象描述一个 section 。

每一个 section 内部，内存地址都是连续的。因此， mem_map 的 page 数组依赖于 section 结构。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


#define __page_to_pfn(pg)                   \
({  const struct page *__pg = (pg);             \
    int __sec = page_to_section(__pg);          \
    (unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec))); \
})

#define __pfn_to_page(pfn)              \
({  unsigned long __pfn = (pfn);            \
    struct mem_section *__sec = __pfn_to_section(__pfn);    \
    __section_mem_map_addr(__sec) + __pfn;      \
})

如果开启了 CONFIG_SPARSEMEM_VMEMMAP 选项 ( 默认开启 ) ， PFN 和 struct page * 之间的转化十分简单：

1
2
3


/* memmap is virtually contiguous.  */
#define __pfn_to_page(pfn)  (vmemmap + (pfn))
#define __page_to_pfn(page) (unsigned long)((page) - vmemmap)

1.3.1. __page_to_pfn

如果开启了 CONFIG_SPARSEMEM ，但是没有开启 CONFIG_SPARSEMEM_VMEMMAP ，就会开启 SECTION_IN_PAGE_FLAGS 选项，即在页表中包含 section 信息，据此实现 page_to_section 。

__nr_to_section 的实现也很简单，但是需要了解变量 mem_section 的定义：

1
2
3
4
5
6
7
8


#ifdef CONFIG_SPARSEMEM_EXTREME
struct mem_section *mem_section[NR_SECTION_ROOTS]
    ____cacheline_internodealigned_in_smp;
#else
struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
    ____cacheline_internodealigned_in_smp;
#endif
EXPORT_SYMBOL(mem_section);

取得 section 的 index 后，进行二维数组的访问操作即可获得对应的 section 结构体：

1
2
3
4
5
6


static inline struct mem_section *__nr_to_section(unsigned long nr)
{
    if (!mem_section[SECTION_NR_TO_ROOT(nr)])
        return NULL;
    return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
}

从 section 中获取对应的 struct page * 的首地址后，用要查找的 struct page * 减去 section 的首地址，即可获得对应的 PFN 。

内核代码中有 __page_to_pfn 函数的注释如下：

setion’s mem_map is encoded to reflect its start_pfn.

section[i].section_mem_map = mem_map’s address = start_pfn.

1.3.2. __pfn_to_page

同理，由 PFN 可以得到所在的 section 的 index ，然后通过 __nr_to_section 获得 section ，再根据 section 中保存的 struct page * 的起始地址，获取 PFN 对应的 struct page * 。

1.3.3. VMEMMAP

如果开启 CONFIG_SPARSEMEM_VMEMMAP ，所有的 struct page * 都保存在连续的地址空间中，起始地址为 VMEMMAP_START ，x86架构定义在 arch/x86/include/asm/pgtable_64_types.h ，为 0xffffea0000000000UL 。

小结

内核的内存模型是为了描述物理内存，完成内存的物理页和 struct page* 之间的转换工作。

2. mem_section

使用 sparse 内存模型的 NUMA 系统，将所有的物理内存分成内存段，即 mem_section 。 mm/sparse.c 中定义的 struct mem_section *mem_section[NR_SECTION_ROOTS] 变量 ( x86 下默认开启 CONFIG_SPARSEMEM_EXTREME ) ，包含系统中所有的内存段。

根据定义可知， mem_section 变量是长度为 NR_SECTION_ROOTS 指针数组，而 NR_SECTION_ROOTS = 2K ，所以 mem_section 占用16B * 2K = 32KB的空间。

这个数组是静态的，无论对应的内存段是否存在。

mem_section 的初始化过程由 sparse_init 完成，函数的调用路径如下：

start_kernel ( init/main.c )
- setup_arch ( arch/x86/kernel/setup.c )
  - x86_init.paging.pagetable_init = native_pagetable_init ( arch/x86/kernel/x86_init.c )
    - paging_init ( arch/x86/mm/init_64.c )
      - sparse_init ( arch/x86/mm/sparse.c )
      - zone_sizes_init ( arch/x86/mm/init.c ) 初始化 max_zone_pfns 数组，包含各个zone可以包含的最大的page数
        
        free_area_init_nodes(max_zone_pfns) ( mm/page_alloc.c )
        
        free_area_init_node ( mm/page_alloc.c )

这里先说明 arch/x86/mm/init_64.c 中的 paging_init 函数：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


void __init paging_init(void)
{
    sparse_memory_present_with_active_regions(MAX_NUMNODES);
    sparse_init();

    /*
     * clear the default setting with node 0
     * note: don't use nodes_clear here, that is really clearing when
     *   numa support is not compiled in, and later node_set_state
     *   will not set it back.
     */  
    node_clear_state(0, N_MEMORY);
    if (N_MEMORY != N_NORMAL_MEMORY)
        node_clear_state(0, N_NORMAL_MEMORY);

    zone_sizes_init();
}

paging_init 首先调用 sparse_memory_present_with_active_region 将系统内所有内存节点的物理页框通过 memroy_present 保存到 mem_section ，并且初始化 mem_section 数组的成员大小为 SECTION_PER_ROOT * sizeof(struct mem_section) ( CONFIG_SPARSEMEM_EXTREME 的情况 ) ；然后调用 sparse_init 重新设置 section_mem_map 成员；最后通过 zone_sizes_init 初始化内存区域。

需要说明的是， memory_presents 函数不但将节点包含的物理页框添加到 mem_section ，还会设置每个 mem_section 的 section_mem_map 成员为 ( 所属的节点ID « SECTION_NID_SHIFT | SECTION_MARKED_PRESENT) 。

2.1. sparse_init

sparse_init 主要设置 mem_section 的 section_mem_map 成员，将 sparse_memory_present_with_active_regions 函数保存的内容替换为对应的 PFN ，以便第一部分内存模型中介绍的 pfn_to_page 和 page_to_pfn 工作正常。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85


void __init sparse_init(void)
{
    unsigned long pnum;
    struct page *map;
    unsigned long *usemap;
    unsigned long **usemap_map;
    int size;
/* 默认情况下为真 */
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
    int size2;
    struct page **map_map;
#endif

    /* see include/linux/mmzone.h 'struct mem_section' definition */
    BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section)));

    /* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
    set_pageblock_order();

    /*
     * map is using big page (aka 2M in x86 64 bit)
     * usemap is less one page (aka 24 bytes)
     * so alloc 2M (with 2M align) and 24 bytes in turn will
     * make next 2M slip to one more 2M later.
     * then in big system, the memory will have a lot of holes...
     * here try to allocate 2M pages continuously.
     *
     * powerpc need to call sparse_init_one_section right after each
     * sparse_early_mem_map_alloc, so allocate usemap_map at first.
     */
    /*
     * size = 8B * 512K = 4MB 
     * 为每一个 section 分配一个指针所需的空间 
     */
    size = sizeof(unsigned long *) * NR_MEM_SECTIONS;

    /*
     * 上面这段注释的意思是说如果轮流分配 usemap 和 map 的内存
     * 会留下许多内存空洞。
     * memblock_virt_alloc 从 memblock 中分配内存空间
     */
    usemap_map = memblock_virt_alloc(size, 0);
    if (!usemap_map)
        panic("can not allocate usemap_map\n");

    /*sparse_early_usemaps_alloc_node从给定的*/
    alloc_usemap_and_memmap(sparse_early_usemaps_alloc_node,
                            (void *)usemap_map);

#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
    size2 = sizeof(struct page *) * NR_MEM_SECTIONS;
    map_map = memblock_virt_alloc(size2, 0);
    if (!map_map)
        panic("can not allocate map_map\n");
    alloc_usemap_and_memmap(sparse_early_mem_maps_alloc_node,
                            (void *)map_map);
#endif

    for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
        if (!present_section_nr(pnum))
            continue;

        usemap = usemap_map[pnum];
        if (!usemap)
            continue;

#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
        map = map_map[pnum];
#else
        map = sparse_early_mem_map_alloc(pnum);
#endif
        if (!map)
            continue;

        sparse_init_one_section(__nr_to_section(pnum), pnum, map,
                                usemap);
    }

    vmemmap_populate_print_last();

#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
    memblock_free_early(__pa(map_map), size2);
#endif
    memblock_free_early(__pa(usemap_map), size);
}

sparse_init 函数的主要工作由 alloc_usemap_and_memmap 完成，后者负责遍历 mem_section 数组，实际的工作由参数 alloc_func 完成：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57


static void __init alloc_usemap_and_memmap(void (*alloc_func)
                    (void *, unsigned long, unsigned long,
                    unsigned long, int), void *data)
{
    unsigned long pnum;
    unsigned long map_count;
    int nodeid_begin = 0;
    unsigned long pnum_begin = 0;

    /* 遍历 mem_section 数组，寻找第一个标记为 present 的 section */
    for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
        struct mem_section *ms;

        /* 略过没有标记为 present 的 section */
        if (!present_section_nr(pnum))
            continue;
        /*
         * 找到了标记为 present 的 section ,
         * 根据 section 的 index 获取对应的 section 指针 
         */
        ms = __nr_to_section(pnum);
        nodeid_begin = sparse_early_nid(ms);
        pnum_begin = pnum;
        break;
    }
    map_count = 1;
    /* 
     * 从 present 的 section 开始，为属于同一个节点的所有 section
     * 调用 alloc_func 
     */
    for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) {
        struct mem_section *ms;
        int nodeid;

        // 跳过没有 present 的 section 
        if (!present_section_nr(pnum))
            continue;
        ms = __nr_to_section(pnum);
        nodeid = sparse_early_nid(ms);

        // 当前 section 和起始 section 属于相同的节点，增加 map count
        if (nodeid == nodeid_begin) {
            map_count++;
            continue;
        }
        /* ok, we need to take cake of from pnum_begin to pnum - 1*/
        alloc_func(data, pnum_begin, pnum,
                        map_count, nodeid_begin);
        /* new start, update count etc*/
        nodeid_begin = nodeid;
        pnum_begin = pnum;
        map_count = 1;
    }
    /* ok, last chunk */
    alloc_func(data, pnum_begin, NR_MEM_SECTIONS,
                        map_count, nodeid_begin);
}

sparse_init 函数先后两次调用 alloc_usemap_and_memmap 函数，传入的 alloc_func 分别为 sparse_early_usemaps_alloc_node 和 sparse_early_mem_maps_alloc_node ， data 分别为保存 unsigned long * 对象的 usemap_map 和 struct page * 对象的 map_map 。

2.1.1. sparse_early_usemaps_alloc_node

分配函数为 sparse_early_usemaps_alloc_node 时， data 参数为长度为 NR_MEM_SECTIONS 的 unsigned long * 数组 usemap_map 。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


static void __init sparse_early_usemaps_alloc_node(void *data,
                 unsigned long pnum_begin,
                 unsigned long pnum_end,
                 unsigned long usemap_count, int nodeid)
{
    void *usemap;
    unsigned long pnum;
    unsigned long **usemap_map = (unsigned long **)data;
    int size = usemap_size();
    /* 
     * 从指定节点的memblock中分配所需的内存空间，
     * usemap_count 为存在的 section 的数量 
     */
    usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nodeid),
                              size * usemap_count);
    if (!usemap) {
        printk(KERN_WARNING "%s: allocation failed\n", __func__);
        return;
    }

    for (pnum = pnum_begin; pnum < pnum_end; pnum++) {

        // 跳过不存在的 section
        if (!present_section_nr(pnum))
            continue;

        // 设置 section 对应的 usemap_map 数组元素指向分配的 usemap
        usemap_map[pnum] = usemap;
        usemap += size;
        check_usemap_section_nr(nodeid, usemap_map[pnum]);
    }
}

2.1.2. sparse_early_mem_maps_alloc_node

分配函数为 sparse_early_mem_maps_alloc_node 时， data 参数为长度为 NR_MEM_SECTIONS 的struct page * 数组 map_map 。

1
2
3
4
5
6
7
8
9


static void __init sparse_early_mem_maps_alloc_node(void *data,
                 unsigned long pnum_begin,
                 unsigned long pnum_end,
                 unsigned long map_count, int nodeid)
{
    struct page **map_map = (struct page **)data;
    sparse_mem_maps_populate_node(map_map, pnum_begin, pnum_end,
                     map_count, nodeid);
}

和 SPARSEMEM 相关的还有一个配置项，即前面说到的 CONFIG_SPARSEMEM_VMEMMAP ，开启此选项，系统中所有的 struct page * 对象保存在连续的内存地址空间中。对应的， sparse_mem_maps_populate_node 函数有两个定义。

2.1.2.1. non-vmemmap

non-vmemmap情况下 sparse_mem_maps_populate_node 函数定义在 mm/sparse.c 中：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65


void __init sparse_mem_maps_populate_node(struct page **map_map,
                      unsigned long pnum_begin,
                      unsigned long pnum_end,
                      unsigned long map_count, int nodeid)
{
    void *map;
    unsigned long pnum;

    // size 为描述每个 section 包含的页框所需的内存大小 
    unsigned long size = sizeof(struct page) * PAGES_PER_SECTION;

    // x86 下alloc_remap 函数返回 NULL
    map = alloc_remap(nodeid, size * map_count);
    if (map) {
        for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
            if (!present_section_nr(pnum))
                continue;
            map_map[pnum] = map;
            map += size;
        }
        return;
    }

    size = PAGE_ALIGN(size);
    /*
     * 从 memblock 中分配所需的内存，大小为描述节点内所有的
     * section 包含的页框所需的内存 
     */
    map = memblock_virt_alloc_try_nid(size * map_count,
                      PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
                      BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
    if (map) {
        for (pnum = pnum_begin; pnum < pnum_end; pnum++) {

            // 跳过不存在的section
            if (!present_section_nr(pnum))
                continue;

            /*
             * 将 map_map 中对应的元素指向该 section 包含的所有页框
             * 的 struct page 的地址，即 section 内第一个页面对应的
             * struct page 的地址
             */
            map_map[pnum] = map;
            map += size;
        }
        return;
    }

    /* fallback */
    // fallback只是再次执行上述相同的操作 
    for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
        struct mem_section *ms;

        if (!present_section_nr(pnum))
            continue;
        map_map[pnum] = sparse_mem_map_populate(pnum, nodeid);
        if (map_map[pnum])
            continue;
        ms = __nr_to_section(pnum);
        printk(KERN_ERR "%s: sparsemem memory map backing failed "
            "some memory will not be available.\n", __func__);
        ms->section_mem_map = 0;
    }
}

可以看到， non-vmemmap 情况下，每次调用 sparse_mem_maps_populate_node 函数只是从 memblock 中分配所需的内存空间，分配的内存空间很有可能不连续。

2.1.2.2. vmemmap

配置 vmemmap 的情况下， sparse_mem_maps_populate_node 定义在 mm/sprase-vmemmap.c 中：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60


void __init sparse_mem_maps_populate_node(struct page **map_map,
                      unsigned long pnum_begin,
                      unsigned long pnum_end,
                      unsigned long map_count, int nodeid)
{
    unsigned long pnum;
    unsigned long size = sizeof(struct page) * PAGES_PER_SECTION;
    void *vmemmap_buf_start;

    /* PMD_SIZE = 2MB，对齐到PMD_SIZE， 这里有个疑问，为什么
       要对齐到2MB */

    size = ALIGN(size, PMD_SIZE);

    // 从指定节点的 memblock 分配所需的内存空间
    vmemmap_buf_start = __earlyonly_bootmem_alloc(nodeid, size * map_count,
             PMD_SIZE, __pa(MAX_DMA_ADDRESS));

    /* 
     * 内存分配成功，保存起始地址和结束地址。
     * vmemmap_buf 用于分配下列操作中建立页表所需的内存空间，在
     * 页表建立完成后释放没有使用的缓冲区
     */
    if (vmemmap_buf_start) {
        vmemmap_buf = vmemmap_buf_start;
        vmemmap_buf_end = vmemmap_buf_start + size * map_count;
    }

    for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
        struct mem_section *ms;

        // 跳过不存在的 section
        if (!present_section_nr(pnum))
            continue;

        /*
         * 为 section 中包含的所有页框对应的 struct page 建立页表。
         * 建立页表时会使用第一部分中定义的 pfn_to_page 宏，从而将
         * section 内的所有 pfn 对应的 struct page * 都保存在连续
         * 的虚拟地址空间中，并且返回 section 中首个 pfn 对应的
         * struct page *，保存在 map_map 中 
         */
        map_map[pnum] = sparse_mem_map_populate(pnum, nodeid);
        if (map_map[pnum])
            continue;
        ms = __nr_to_section(pnum);
        printk(KERN_ERR "%s: sparsemem memory map backing failed "
            "some memory will not be available.\n", __func__);
        ms->section_mem_map = 0;
    }

    // 释放没有用到的 vmemmap 缓冲区
    if (vmemmap_buf_start) {
        /* need to free left buf */
        memblock_free_early(__pa(vmemmap_buf),
                    vmemmap_buf_end - vmemmap_buf);
        vmemmap_buf = NULL;
        vmemmap_buf_end = NULL;
    }
}

至此，系统中内存段的初始化完成，可以通过 pfn_to_page 和 page_to_pfn 将 PFN 和 struct page * 相互转化。

3. 总结

memsection 初始化时，分配的内存都是从 memblock 中获取，后者是内核 boot 的早期阶段 slab 等内存分配器还没有初始化时使用的内存分配器。

memsection 是内存管理时经常使用的 struct page* 和 PFN 之间相互转换的桥梁。

mm-mem_section

Contents