Chinaunix首页 | 论坛 | 博客
  • 博客访问: 5785450
  • 博文数量: 675
  • 博客积分: 20301
  • 博客等级: 上将
  • 技术积分: 7671
  • 用 户 组: 普通用户
  • 注册时间: 2005-12-31 16:15
文章分类

全部博文(675)

文章存档

2012年(1)

2011年(20)

2010年(14)

2009年(63)

2008年(118)

2007年(141)

2006年(318)

分类: LINUX

2008-11-05 01:59:36

HugeTLB - Large Page Support in the Linux Kernel

By

Abstract

This article is meant to be a primer to the HugeTLB feature of the Linux kernel, which enables one to use virtual memory pages of large sizes. First, we will go through an introduction of large page support in the kernel, then we will see how to enable large pages and how to use large pages from the application. Finally, we will look into the internals of the large page support in the Linux kernel.

We will be using terms such as "huge pages", "large pages", "HugeTLB", etc. interchangeably in this article. This article covers large page support for x86 based architecture, although most of it is directly applicable to other architectures.

Introduction

From a memory management perspective, the entire physical memory is divided into "frames" and the virtual memory is divided into "pages". The memory management unit performs a translation of virtual memory address to physical memory address. The information regarding which virtual memory page maps to which physical frame is kept in a data structure called the "Page Table". Page table lookups are costly. In order to avoid performance hits due to this lookup, a fast lookup cache called Translation Lookaside Buffer(TLB) is maintained by most architectures. This lookup cache contains the virtual memory address to physical memory address mapping. So any virtual memory address which requires translation to the physical memory address is first compared with the translation lookaside buffer for a valid mapping. When a valid address translation is not present in the TLB, it is called a "TLB miss". If a TLB miss occurs, the memory management unit will have to refer to the page tables to get the translation. This brings additional performance costs, hence it is important that we try to reduce the TLB misses.

On normal configurations of x86 based machines, the page size is 4K, but the hardware offers support for pages which are larger in size. For example, on x86 32-bit machines (Pentiums and later) there is support for 2Mb and 4Mb pages. Other architectures such as IA64 support multiple page sizes. In the past Linux did not support large pages, but with the advent of HugeTLB feature in the Linux kernel, applications can now benefit from large pages. By using large pages, the TLB misses are reduced. This is because when the page size is large, a single TLB entry can span a larger memory area. Applications which have heavy memory demands such as database applications, HPC applications, etc. can potentially benefit from this.

Enabling Large Page Support

Support for large pages can be included into the Linux kernel by choosing CONFIG_HUGETLB_PAGE and CONFIG_HUGETLBFS during kernel configuration. On a machine which has HugeTLB enabled in the kernel, information about the Hugepages can be seen from the /proc/meminfo. The following is an example taken from an AMD Semptron laptop, running kernel 2.6.20.7 with HugeTLB enabled. The information about large pages is contained in entries starting with string "Huge".

#cat /proc/meminfo | grep Huge
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 4096 kB

We have to tell the kernel the number of large pages that needs to be reserved for usage. An echo of the number of large pages to be reserved, to the nr_hugepages proc sys entry. In the following example, we reserve a maximum of 4 large pages:

#echo 4 > /proc/sys/vm/nr_hugepages

Now the kernel will have allocated the necessary large pages (depending on the availability of memory). We can once again see the /proc/meminfo and confirm that the kernel has indeed allocated the large pages.

#cat /proc/meminfo | grep Huge
HugePages_Total: 4
HugePages_Free: 4
HugePages_Rsvd: 0
Hugepagesize: 4096 kB

We can also enable the HugeTLB pages by giving "hugepages=" parameter at kernel boot. Also we can use 'sysctl' to set the number of large pages.

How to Use Large Pages?

An application can make use of large pages in two ways. One is by using a special shared memory region and another is by mmaping files from the hugetlb filesystem. Especially if we want to use private HugeTLB mapping, then mmaping files from hugetlb technique is recommended. In this article we will concentrate on the large page support via shared memory. We will see here how we can use an array which is mapped into large pages from an application.

#include 
#include
#include
#include

#define MB_1 (1024*1024)
#define MB_8 (8*MB_1)

char *a;
int shmid1;

void init_hugetlb_seg()
{
shmid1 = shmget(2, MB_8, SHM_HUGETLB
| IPC_CREAT | SHM_R
| SHM_W);
if ( shmid1 < 0 ) {
perror("shmget");
exit(1);
}
printf("HugeTLB shmid: 0x%x\n", shmid1);
a = shmat(shmid1, 0, 0);
if (a == (char *)-1) {
perror("Shared memory attach failure");
shmctl(shmid1, IPC_RMID, NULL);
exit(2);
}
}

void wr_to_array()
{
int i;
for( i=0 ; i a[i] = 'A';
}
}

void rd_from_array()
{
int i, count = 0;
for( i=0 ; i if (a[i] == 'A') count++;
if (count==i)
printf("HugeTLB read success :-)\n");
else
printf("HugeTLB read failed :-(\n");
}

int main(int argc, char *argv[])
{
init_hugetlb_seg();
printf("HugeTLB memory segment initialized !\n");
printf("Press any key to write to memory area\n");
getchar();
wr_to_array();
printf("Press any key to rd from memory area\n");
getchar();
rd_from_array();
shmctl(shmid1, IPC_RMID, NULL);
return 0;
}

The above program is just like any other program which uses shared memory. First, we initialize the shared memory segment with an additional flag SHM_HUGETLB for getting large page-based shared memory. Then we attach the shared memory segment to the program. Following this, we write to the shared memory area in the function call 'wr_to_array'. And finally we verify whether the data has been written properly by reading back the data in the function 'rd_from_array'.

Example program execution - using large pages

Now let us compile the program and run it.

#cc hugetlb-array.c -o hugetlb-array -Wall
#./hugetlb-array
HugeTLB shmid: 0x40000
HugeTLB memory segment initialized !
Press any key to write to memory area

At this point in time if we check the status of the HugeTLB pages in the /proc/meminfo, it will show that 2 pages, i.e. 8MB of memory area are reserved. All the large pages will still be shown as free, as we have not yet started using the memory area.

#cat /proc/meminfo | grep Huge
HugePages_Total: 4
HugePages_Free: 4
HugePages_Rsvd: 2
Hugepagesize: 4096 kB

Press key at the program input, which will result in the writing to the allocated HugeTLB memory location. Now the memory segment which was allocated will be used. This will move the 2 large pages to allocated state. We can see this in the /proc/meminfo as HugePages_Free shows only 2.

#cat /proc/meminfo | grep Huge
HugePages_Total: 4
HugePages_Free: 2
HugePages_Rsvd: 0
Hugepagesize: 4096 kB

The following message will appear now

Press any key to rd from memory area

Finally when we press a key at the program input, the program will check whether the data which was written is indeed present in the HugeTLB area. If everything goes fine we will get a hugetlb smiley.

HugeTLB read success :-)

Internals of large page support

Inside the Linux kernel, large page support is implemented in two parts. The first part consists of a global pool of large pages which are allocated and kept reserved for providing large pages support to applications. The global pool of large pages is built by allocating physically contiguous pages (of large page sizes) using normal kernel memory allocation APIs. Second part consists of the kernel itself allocating large pages from this pool to applications that request them.

We will first see the internals of how the large pages are initialized and how the global pools are filled up. Then we will see how shared memory can be used by application to leverage the large pages and how the physical pages actually get allocated by means of page fault. We will not perform a line-to-line code walk through; instead we will go through the main parts of the code relevant to large pages.

Large Page initialization

In the Linux kernel source code (in file mm/hugetlb.c) we have the function "hugetlb_init" which allocates multiple physically contiguous pages of normal page size to form clusters of pages which can be used for large page sizes. The number of pages which are allocated like this depends on the value of "max_huge_pages" variable. This number can be passed on as a kernel command line option by using the 'hugepages' parameter. The large page size allocated depends on the macro HUGETLB_PAGE_ORDER which in turn depends on HPAGE_SHIFT macro. For example this macro is assigned the value 22 (when PAE in not enabled) on an x86 based architecture. This means that the size of large page allocated will be 4Mb. Note that the large page size depends on architecture and corresponding supported page sizes.

The pages allocated as mentioned previously are enqueued into "hugepage_freelists" for the respective node, where the page is allocated from, by the function 'enqueue_huge_page'. Each memory node (in case of NUMA) will have one hugepage_freelists. When the large pages are allocated dynamically as in the example (by echoing the value to proc) or by other dynamic methods, a similar sequence of events occurs, as explained during the static allocation of large pages.

In order to use a shared memory area, we will have to create it. This, as we have seen before, is done by the 'shmget' system call. This system call will invoke the kernel function 'sys_shmget' which in turn calls 'newseg'. In 'newseg' a check is made to confirm if the user has asked for the creation of a HugeTLB shared memory area. If the user has specified the large page flag SHM_HUGETLB, then the file operations corresponding to this file structure will be assigned to 'hugetlbfs_file_operations'. The large pages gets reserved by the function 'hugetlb_reserve_pages' which will increment the reserve pages count - resv_huge_pages which shows up as 'HugePages_Rsvd'in the proc.

When the system call 'sys_shmat' is made, address alignment check and other sanity checks are done by using 'hugetlb_get_unmapped_area' function.

Large page fault and physical page allocation

When a page fault occurs, the "vma" which corresponds to the address is found. The vma which corresponds to a hugetlb shared memory location will have 'vma-> vm_flags' set as 'VM_HUGETLB', and is detected by calling 'is_vm_hugetlb_page'. When a hugetlb vma is found the 'hugetlb_fault' function is called. This procedure sets up large page flag in the page directory entry then allocates a huge page based on a copy-on-write logic from the global pool of large pages initialized previously. The large page size itself is set in the hardware by setting the _PAGE_PSE flag in the pgd(the 7th bit, starting from 0th bit, in cases without PAE for x86).

Where to go from here?

Detailed documentation with advanced examples can be found in the file Documentation/vm/hugetlbpage.txt which comes with Linux kernel source code.

The HugeTLB feature inside the kernel is not application transparent, in the sense that we need to explicitly make modifications (i.e. have to insert code which uses shared memory or HugeTLB fs) to the application to make use of large pages. For folks who are interested in application transparent implementations of large page support, an internet search for "Transparent superpages" will get you to Web sites containing details of such implementations.

Links

  1. Improving enterprise database performance on Linux:
  2. TLB wikipedia entry:
  3. HugeTLB kernel documentation link from kernel source online: http://lxr.linux.no/source/Documentation/vm/hugetlbpage.txt

Conclusion

We have seen how the Linux kernel provides applications with the ability to use large pages. We went through methods to enable and use large pages. After that we skimmed through the internals of the HugeTLB implementation inside the kernel.

Acknowledgements

I would like to extend my sincere thanks to Kenneth Chen for giving me better insights into HugeTLB code, for answering my questions with patience and for the review of an initial draft of this article. I would also like to thank Pramode Sir, Badri, Malay, Shijesta and Chitkala for review and feedback.


阅读(1572) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~