Chinaunix首页 | 论坛 | 博客
  • 博客访问: 610255
  • 博文数量: 197
  • 博客积分: 7001
  • 博客等级: 大校
  • 技术积分: 2155
  • 用 户 组: 普通用户
  • 注册时间: 2005-02-24 00:29
文章分类

全部博文(197)

文章存档

2022年(1)

2019年(2)

2015年(1)

2012年(100)

2011年(69)

2010年(14)

2007年(3)

2005年(7)

分类: LINUX

2011-10-07 21:31:34


1 基本概念

 

什么是tmem

From the perspective of an operating system, tmem is fast pseudo-RAM of indeterminate and varying size that is useful primarily when real RAM is in short supply and is accessible only via a somewhat quirky copy-based interface.

 

More formally, Transcendent Memory is both: (a) a collection of idle physical memory in a system and (b) an API for providing indirect access to that memory. A tmem host (such as a hypervisor in a virtualized system) maintains and manages one or more tmem pools of physical memory. One or more tmem clients (such as a guest OS in a virtualized system) can access this memory only indirectly via a well-defined tmem API which imposes a carefully-crafted set of rules and restrictions. Through proper use of the tmem API, a tmem client may utilize a tmem pool as an extension to its memory, thus reducing disk I/O and improving performance.

 

可以认为多了一层cache

From the perspective of the Linux kernel, tmem can be thought of as somewhere between a somewhat slow memory device and a very fast disk device.

 

WHERE DOES THE MEMORY FOR TRANSCENDENT MEMORY COME FROM?

Idle memory 比如(1native osClean page cache,(2VM环境下reclaim all hypervisor fallow memory + reclaim wasted guest memory (e.g. via self-ballooning)

 

Tmem的前端

Collectively these sources of suitable data for tmem can be referred to as "frontends" for tmem,对于Linux内核,目前的实现有cleancachefrontswap

 

Tmem的后端

There are multiple implementations of tmem which store data using different methods. We can refer to these data stores as "backends" for tmem. 目前的实现"Xen tmem,"zcache

 

Tmem的缺点

(1)    There is some overhead, which may result in a small negative performance impact on some workloads.

(2)    OS change -- paravirtualization -- is required, though the changes are surpisingly small and non-invasive.

(3)    And tmem doesn't help at all if all of main memory is truly in use (i.e. the sum of the working set of all active virtual machines exceeds the size of physical memory). But we believe the benefits of tmem will greatly outweigh these costs.

(4)    Ballooning+tmem quickly fragments all Xen RAM

2 How the kernel talks to transcendent memory

in some cases the tmem interface is completely internal to the kernel and is thus an "API"; in other cases it defines the boundary between two independent software components (e.g. Xen and a guest Linux kernel) so is properly called an "ABI".

 

Pool

Tmem organizes related chunks of data in a pool; within a pool, the kernel chooses a unique "handle" to represent the equivalent of an address for the chunk of data. When the kernel requests the creation of a pool, it specifies certain attributes to be described below. If pool creation is successful, tmem provides a "pool id". Handles are unique within pools, not across pools, and consist of a 192-bit "object id" and a 32-bit "index." The rough equivalent of an object is a "file" and the index is the rough equivalent of a page offset into the file.

 

Get and put

The two basic operations of tmem are "put" and "get". If the kernel wishes to save a chunk of data in tmem, it uses the "put" operation, providing a pool id, a handle, and the location of the data; if the put returns success, tmem has copied the data. If the kernel wishes to retrieve data, it uses the "get" operation and provides the pool id, the handle, and a location for tmem to place the data; if the get succeeds, on return, the data will be present at the specified location.

 

ephemeral and persistent

There are two basic pool types: . Pages successfully put to an ephemeral pool may or may not be present later when the kernel uses a subsequent get with a matching handle. Pages successfully put to a persistent pool are guaranteed to be present for a subsequent get. (Additionally, a pool may be "shared" or "private".)

 

3: Transcendent memory frontends: frontswap and cleancache

cleancache handles (clean) mapped pages that would otherwise be reclaimed by the kernel; frontswap handles (dirty) anonymous pages that would otherwise be swapped out by the kernel. When a successful cleancache_get happens, a disk read has been avoided. When a successful frontswap_put (or get) happens, a swap device write (or read) had been avoided.

Frontswap

讲得非常清楚

Preswap essentially provides a layer in the swap subsystem between the swap cache and disk.

 

With frontswap, whenever a page needs to be swapped out the swap subsystem asks tmem if it is willing to take the page of data. If tmem rejects it, the swap subsystem writes the page, as normal, to the swap device. If tmem accepts it, the swap subsystem can request the page of data back at any time and it is guaranteed to be retrievable from tmem. And, later, if the swap subsystem is certain the data is no longer valid (e.g. if the owning process has exited), it can flush the page of data from tmem. Note that tmem can reject any or every frontswap "put".

The frontswap patchset is non-invasive and does not impact the behavior of the swap subsystem at all when frontswap is disabled.

A few implementation notes: Frontswap requires one bit of metadata per page of enabled swap. (The Linux swap subsystem until recently required 16 bits, and now requires eight bits of metadata per page so frontswap increases this by 12.5%.) This bit-per-page records whether the page is in tmem or is on the physical swap device. Since, at any time, some pages may be in frontswap and some on the physical device, the swap subsystem "swapoff" code also requires some modification. And because in-use tmem is more valuable than swap device space, some additional modifications are provided by frontswap so that a "partial swapoff" can be performed. And, of course, hooks are at the read-page and write-page routines to divert data into tmem and a hook is added to flush the data when it is no longer needed. All told, the patch parts that affect core kernel components add up to less than 100 lines.

 

对于VM情况

Frontswap essentially acts an emergency memory safety valve, again using memory that is owned and managed by the hypervisor. Instead of swapping to disk, which as we saw can be extremely slow, you are swapping instead to much faster hypervisor memory.

 

Cleancache

 

Cleancache利用reclaimed pages

Cleancache allows tmem to be used to store clean page cache pages resulting in fewer refaults. When the kernel reclaims a page, rather than discard the data, it places the data into tmem, tagged as "ephemeral", which means that the page of data may be discarded if tmem chooses. Later, if the kernel determines it needs that page of data after all, it asks tmem to give it back. If tmem has retained the page, it gives it back; if tmem hasn't retained the page, the kernel proceeds with the refault, fetching the data from the disk as usual.

To function properly, cleancache "hooks" are placed where pages are reclaimed and where the refault occurs. The kernel is also responsible for ensuring coherency between the page cache, disk, and tmem, so hooks are also present where ever the kernel might invalidate the data. Since cleancache affects the kernel's VFS layer, and since not all filesystems use all VFS features, a filesystem must "opt in" to use cleancache whenever it mounts a filesystem.

 

对于VM

From the kernel point-of-view, it is "PUT"ing clean pages that it would otherwise have to evict into what is effectively a second-chancecache that resides in "special" RAM that is owned and managed by the hypervisor.

4: Transcendent memory backends

Currently only one backend may be configured though, in the future, some form of layering may be possible. In fact, a tmem backend must perform its functions fully synchronously, that is, it must not sleep and the scheduler may not be called. When a "put" completes, the kernels's page of data has been copied. And a successful "get" may not complete until the page of data has been copied to the kernel's data page.

Zcache

zcache combines an in-kernel implementation of tmem with in-kernel compression code to reduce the space requirements for data provided through a tmem frontend.

The zcache uses the in-kernel lzo1x routines to compress/decompress the data contained in the pages. Space for persistent pages is obtained through a shim to xvmalloc, a memory allocator in the zram staging driver designed to store compressed pages. Space for ephemeral pages is obtained through standard kernel get_free_page() calls, then pairs of compressed ephemeral pages are matched using an algorithm called "compression buddies". This algorithm ensures that physical page frames containing two compressed ephemeral pages can easily be reclaimed when necessary; zcache provides a standard "shrinker" routine so those whole page frames can be reclaimed when required by the kernel using the existing kernel shrinker mechanism.

当压缩比低时,Zcacheput操作可能失败.

RAMster

RAMster is still under development but a proof-of-concept exists today. RAMster assumes that we have a cluster-like set of systems with some high-speed communication layer, or "exofabric", connecting them. The collected RAM of all the systems in the "collective" is the shared RAM resource used by tmem. Each cluster node acts as both a tmem client and a tmem server, and decides how much of its RAM to provide to the collective. Thus RAMster is a "peer-to-peer" implementation of tmem.

Interestingly, RAMster-POC demonstrates a useful dimension of tmem: Once pages have been placed in tmem, the data can be transformed in various ways as long as the pages can be reconstituted when required. When pages are put to RAMster-POC, they are first compressed and cached locally using a zcache-like tmem backend. As local memory constraints increase, an asynchronous process attempts to "remotify" pages to another cluster node; if one node rejects the attempt, another node can be used as long as the local node tracks where the remote data resides.

 

While this multi-level mechanism in RAMster works nicely for puts, there is currently no counterpart for gets. When a tmem frontend requests a persistent get, the data must be fetched immediately and synchronously; the thread requiring the data must busy-wait for the data to arrive and the scheduler must not be called. As a result current RAMster-POC is best suited for many-core processors, where it is unusual for all cores to be simultaneously active.

Transcendent memory for Xen

Tmem was originally conceived for Xen and so the Xen implementation is the most mature. The tmem backend in Xen utilizes spare hypervisor memory to store data, supports a large number of guests, and optionally implements both compression and deduplication (both within a guest and across guests) to maximize the volume of data that can be stored. The tmem frontends are converted to Xen hypercalls using a shim. Individual guests may be equipped with "self-ballooning" and "frontswap-self-shrinking" (both in Linux 3.1) to optimize their interaction with Xen tmem. Xen tmem also supports shared ephemeral pools, so that guests co-located on a physical server that share a cluster filesystem need only keep one copy of a cleancache page in tmem.

 

参考资料

 

阅读(1480) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~