Chinaunix首页 | 论坛 | 博客
  • 博客访问: 249746
  • 博文数量: 128
  • 博客积分: 65
  • 博客等级: 民兵
  • 技术积分: 487
  • 用 户 组: 普通用户
  • 注册时间: 2012-07-24 17:43
个人简介

人生境界:明智、中庸、诚信、谦逊

文章分类

全部博文(128)

文章存档

2014年(12)

2013年(116)

我的朋友

分类: LINUX

2014-02-11 14:30:10


The build_all_zonelists()function splits up the memory according to the zone types ZONE_DMAZONE_NORMAL, and ZONE_HIGHMEM. zones are linear separations of physical memory that are used mainly to address hardware limitations. Suffice it to say that this is the function where these memory zones are built. After the zones are built, pages are stored in page frames that fall within zones.
The call to build_all_zonelists() introduces numnodes and NODE_DATA. The global variable numnodes holds the number of nodes (or partitions) of physical memory.
The partitions are determined according to CPU access time. Note that, at this point, the page tables have already been fully set up.

The task of the function is to establish a ranking order between the zones of the node currently being
processed and the other nodes in the system; memory is then allocated according to this order. This is
important if no memory is free in the desired node zone.


Let us look at an example in which the kernel wants to allocate high memory. It first attempts to find a
free segment of suitable size in the highmem area of the current node. If it fails, it looks at the regular
memory area of the node. If this also fails, it tries to perform allocation in the DMA zone of the node. If it
cannot find a free area in any of the three local zones, it looks at other nodes. In this case, the alternative
node should be as close as possible to the primary node to minimize performance loss caused as a result
of accessing non-local memory.


The kernel defines a memory hierarchy and first tries to allocate ‘‘cheap‘‘ memory. If this fails, it gradually
tries to allocate memory that is ‘‘more costly‘‘ in terms of access and capacity.


The kernel also defines a ranking order among the alternative nodes as seen by the current memory
nodes. This helps determine an alternative node when all zones of the current node are full
.

The kernel uses an array of zonelist elements in pg_data_t to represent the described hierarchy as a data structure.



/*
 * Called with zonelists_mutex held always
 * unless system_state == SYSTEM_BOOTING.
 */

void build_all_zonelists(void *data)
{
    set_zonelist_order();

    if (system_state == SYSTEM_BOOTING) {
        __build_all_zonelists(NULL);
        mminit_verify_zonelist();
        cpuset_init_current_mems_allowed();
    } else {
        /* we have to stop all cpus to guarantee there is no user
         of zonelist */

        stop_machine(__build_all_zonelists, data, NULL);
        /* cpuset refresh routine should be here */
    }
    vm_total_pages = nr_free_pagecache_pages();
    /*
     * Disable grouping by mobility if the number of pages in the
     * system is too low to allow the mechanism to work. It would be
     * more accurate, but expensive to check per-zone. This check is
     * made on memory-hotadd so a system can start with mobility
     * disabled and enable it later
     */

    if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES))
        page_group_by_mobility_disabled = 1;
    else
        page_group_by_mobility_disabled = 0;

    printk("Built %i zonelists in %s order, mobility grouping %s. "
        "Total pages: %ld\n",
            nr_online_nodes,
            zonelist_order_name[current_zonelist_order],
            page_group_by_mobility_disabled ? "off" : "on",
            vm_total_pages);
#ifdef CONFIG_NUMA
    printk("Policy zone: %s\n", zone_names[policy_zone]);
#endif
}





/*
 * Global mutex to protect against size modification of zonelists
 * as well as to serialize pageset setup for the new populated zone.
 */

DEFINE_MUTEX(zonelists_mutex);

/* return values int ....just for stop_machine() */
static __init_refok int __build_all_zonelists(void *data)
{
    int nid;
    int cpu;

#ifdef CONFIG_NUMA


/* node_load is a global variable defined in page_alloc.c

#define MAX_NODE_LOAD (nr_online_nodes)
static int node_load[MAX_NUMNODES];

*/

   memset(node_load, 0, sizeof(node_load));
#endif
    for_each_online_node(nid) {
        pg_data_t *pgdat = NODE_DATA(nid);

        build_zonelists(pgdat);
        build_zonelist_cache(pgdat);
    }

#ifdef CONFIG_MEMORY_HOTPLUG
    /* Setup real pagesets for the new zone */
    if (data) {
        struct zone *zone = data;
        setup_zone_pageset(zone);
    }
#endif

    /*
     * Initialize the boot_pagesets that are going to be used
     * for bootstrapping processors. The real pagesets for
     * each zone will be allocated later when the per cpu
     * allocator is available.
     *
     * boot_pagesets are used also for bootstrapping offline
     * cpus if the system is already booted because the pagesets
     * are needed to initialize allocators on a specific cpu too.
     * F.e. the percpu allocator needs the page allocator which
     * needs the percpu allocator in order to allocate its pagesets
     * (a chicken-egg dilemma).
     */

    for_each_possible_cpu(cpu) {
        setup_pageset(&per_cpu(boot_pageset, cpu), 0);

#ifdef CONFIG_HAVE_MEMORYLESS_NODES
        /*
         * We now know the "local memory node" for each node--
         * i.e., the node of the first zone in the generic zonelist.
         * Set up numa_mem percpu variable for on-line cpus. During
         * boot, only the boot cpu should be on-line; we'll init the
         * secondary cpus' numa_mem as they come on-line. During
         * node/memory hotplug, we'll fixup all on-line cpus.
         */

        if (cpu_online(cpu))
            set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
#endif
    }

    return 0;
}




build_zonelists operate zonelists in struct pg_data_t, why do this operation?
Because in NUMA system, each processor has its own NODE, when the memory has no enough space, then it can access the next zone of other NODE.

static void build_zonelists(pg_data_t *pgdat)
{
    int j, node, load;
    enum zone_type i;
    nodemask_t used_mask;
    int local_node, prev_node;
    struct zonelist *zonelist;
    int order = current_zonelist_order;

    /* initialize zonelists */
    for (i = 0; i < MAX_ZONELISTS; i++) {
        zonelist = pgdat->node_zonelists + i;
        zonelist->_zonerefs[0].zone = NULL;
        zonelist->_zonerefs[0].zone_idx = 0;
    }

    /* NUMA-aware ordering of nodes */
    local_node = pgdat->node_id;
    load = nr_online_nodes;
    prev_node = local_node;
    nodes_clear(used_mask);

    memset(node_order, 0, sizeof(node_order));
    j = 0;

    while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
        int distance = node_distance(local_node, node);

        /*
         * If another node is sufficiently far away then it is better
         * to reclaim pages in a zone before going off node.
         */

        if (distance > RECLAIM_DISTANCE)
            zone_reclaim_mode = 1;

        /*
         * We don't want to pressure a particular node.
         * So adding penalty to the first node in same
         * distance group to make it round-robin.
         */

        if (distance != node_distance(local_node, prev_node))
            node_load[node] = load;

        prev_node = node;
        load--;
        if (order == ZONELIST_ORDER_NODE)
            build_zonelists_in_node_order(pgdat, node);
        else
            node_order[j++] = node;    /* remember order */
    }

    if (order == ZONELIST_ORDER_ZONE) {
        /* calculate node order -- i.e., DMA last! */
        build_zonelists_in_zone_order(pgdat, j);
    }

    build_thisnode_zonelists(pgdat);
}



/* Construct the zonelist performance cache - see further mmzone.h */
static void build_zonelist_cache(pg_data_t *pgdat)
{
    struct zonelist *zonelist;
    struct zonelist_cache *zlc;
    struct zoneref *z;

    zonelist = &pgdat->node_zonelists[0];
    zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
    bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
    for (z = zonelist->_zonerefs; z->zone; z++)
        zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z);
}


阅读(966) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~