2013年(22)
分类: LINUX
2013-08-04 17:10:50
Q:sys_write()缓冲写文件时,为什么page都会先转换为buffer page?
A:文件系统都是以块(buffer_head)为基本单位来进行磁盘操作的,将page转化为buffer page后,page对应的磁盘块都有相应的buffer_head,每个buffer_head可以存放此buffer_head是否映射到磁盘,对应的磁盘块号等映射信息,将脏页回写到磁盘的过程中,有了这些信息就可以直接将要回写的脏数据提交给块设备层,执行IO操作.否则,需要在回写脏页时调用 get_block()获得映射信息,增加了磁盘IO操作.page不转化为buffer page仍然可以正常的回写,此时page不包含buffer_head,可以减少buffer_head占用的内存.ext2的NOBH挂载选项和 mmap()分配的page,如果对应的块在磁盘上是连续的,page回写时都可以不转化为buffer page.
Q:write_begin()会调用get_block()从文件系统分配磁盘块,后续操作如果出错,本页分配的磁盘块在何处释放?
A:如果__block_prepare_write()失败,会调用vmtruncate()将超出文件大小的数据块释放掉,如果分配的文件块落入原先的空洞,则不处理.
Q:系统调用write()缓冲写文件时会在何处阻塞?
1.write_begin()会调用文件系统提供的get_block()函数为buffer分配磁盘数据块,分配数据块时需要读取块组描述符,数据块位图,文件的间接数据块等元数据,导致阻塞.
2.对数据块部分写入时,为避免破坏数据,需要将数据块的数据从磁盘读出,导致阻塞.
3.write()完成后会增加系统脏页的数量,当脏页占内存的比例接近/proc/sys/vm/vm_dirty_ratio时,sys_write()会调用balance_dirty_pages_ratelimited_nr()回写部分脏页,导致阻塞.
Q:sys_write()如果为文件分配了磁盘块(扩展或填充空洞),系统调用完毕后,对新分配磁盘块的预期?
A:buffer_uptodate(bh)=1 && buffer_dirty(bh)
sys_write会遇到以下几种情况:
1.一切正常,标记写入的bh为UPTODATE和DIRTY
__block_commit_write()
for(sys_write涉及的bh)
set_buffer_uptodate(bh);
mark_buffer_dirty()
2.用户态buffer不合法,拷贝到内核buffer的字节数小于预期,此时需要将未拷入预期的用户态内容的bh内容清零,同时标记bh为UPTODATE和DIRTY
block_write_end()
page_zero_new_buffers()
for(buffer_new(bh) && bh未拷入预期的用户态内容)
set_buffer_uptodate(bh);
clear_buffer_new(bh);
mark_buffer_dirty(bh);
3.调用get_block()为文件分配磁盘块出现IO错误,将当前page中刚刚建立映射的bh内容清零,同时标记bh为UPTODATE和DIRTY
__block_prepare_write()
if(get_block()出错 || 读取bh出错)
page_zero_new_buffers();
for(buffer_new(bh) && sys_write涉及的bh)
set_buffer_uptodate(bh);
clear_buffer_new(bh);
mark_buffer_dirty(bh);
Q:sys_write()写文件foo成功后,30秒内系统掉电,重新上电后,读出文件foo的内容是否可能出现非预期的数据?
A:可能,对文件foo做sys_write()成功后,30秒内,可能文件的元数据已经回写磁盘但是有脏页还没来得及回写,重新读出foo的数据,这部分脏页的内容仍然是原先磁盘中的老数据,即出现当前foo的数据既不是用户写入的数据也不是0的情况.
函数调用过程:
sys_write()
vfs_write()
do_sync_write()
generic_file_aio_write()
__generic_file_aio_write_nolock()
generic_file_buffered_write()
generic_perform_write()
for(write()要写入的文件区域的每个page){
a_ops->write_begin()
//分配page,填充文件基树,如果扩展文件,需要调用文件系统提供的get_block()函数分配磁盘块
block_write_begin()
if (page == NULL) {
//分配page,将page加锁后加入到pagecache
page = __grab_cache_page(mapping, index);
}
//调用get_block(create=1),为page的每个数据块(buffer_head)分配磁盘块
__block_prepare_write()
iov_iter_copy_from_user_atomic()//拷贝用户态数据到pagecache
//1.标记写入的buffer_head为uptodate和dirty
//2.如果写入会覆盖了整个page,那么标记page为uptodate
//3.根据成功拷贝的字节数修改inode的大小
//4.解锁page
a_ops->write_end()
generic_write_end()
//当前的写入会增加系统脏页的数目,如果脏页的数目超过阀值需要回写部分脏页
balance_dirty_pages_ratelimited(mapping);
}
几个重要的函数:
//分配page,填充pagecache,为page建立到磁盘块的映射,拷贝用户态的数据到page,将page标记为dirty
static ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
{
struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
long status = 0;
ssize_t written = 0;
unsigned int flags = 0;
/*
* Copies from kernel address space cannot fail (NFSD is a big user).
*/
if (segment_eq(get_fs(), KERNEL_DS))
flags |= AOP_FLAG_UNINTERRUPTIBLE;
do {
struct page *page;
pgoff_t index; /* Pagecache index for current page */
unsigned long offset; /* Offset into pagecache page */
unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */
void *fsdata;
offset = (pos & (PAGE_CACHE_SIZE - 1));
index = pos >> PAGE_CACHE_SHIFT;
//第一页和最后一页bytes不等于PAGE_CACHE_SIZE
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
iov_iter_count(i));
again:
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
*
* Not only is this an optimisation, but it is also required
* to check that the address is actually valid, when atomic
* usercopies are used, below.
*/
//采用从用户态buffer读一个字节的方式验证用户态数据的合法性,只验证iov_iter中当前的iovec
//此句必须在a_ops->write_begin()之前被执行,因为write_begin()会对page加锁,
//同时读用户态的数据可能会产生缺页异常,
//如果此page是mmap的且源page和目标page是同一个page,那么缺页处理函数会执行:
//do_page_fault()->do_no_page()->filemap_nopage()->lock_page()
//造成同一个线程对page两次加锁,导致死锁
if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
status = -EFAULT;
break;
}
//分配page,填充文件基树,如果扩展文件,需要调用get_block()分配磁盘块
//write_begin()成功后,PageLocked(page)=1
//ext2_write_begin(),blkdev_write_begin
status = a_ops->write_begin(file, mapping, pos, bytes, flags,
&page, &fsdata);//ext2_write_begin,blkdev_write_begin
if (unlikely(status))
break;
//使用iov_iter_copy_from_user_atomic()从用户态拷贝数据时,如果bytes跨
//iovec,拷贝第一个iovec以后的数据时,还有可能产生缺页异常,但是缺页的page跟当前目标page肯定不是同一个page
//此时缺页的路径为:do_page_fault()->down_read(&mm->mmap_sem)->lock_page(源page)
//如果不禁止抢占,sys_munmap()之类的内核路径在同一个CPU上执行:
//mmap_sem()->make_pages_present()->get_user_pages()->filemap_nopage()->lock_page()
//导致ABBA死锁
//参考
pagefault_disable();
//因为bytes可能跨多个iovec,但是iov_iter_fault_in_readable只验证iov_iter中当前的iovec,
//那么拷贝后面的iovec时会因为非法的用户态地址而无法拷贝,此时
//即会出现copied < bytes的情况
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
pagefault_enable();
flush_dcache_page(page);
//1.标记写入的buffer_head为uptodate和dirty
//2.如果写入会覆盖了整个page,那么标记page为uptodate
//3.根据成功拷贝的字节数修改inode的大小
//4.解锁page
status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);//generic_write_end,blkdev_write_end
if (unlikely(status < 0))
break;
copied = status;
cond_resched();
iov_iter_advance(i, copied);
//需要处理copied == 0的情况,否则死循环
if (unlikely(copied == 0)) {
/*
* If we were unable to copy any data at all, we must
* fall back to a single segment length write.
*
* If we didn't fallback here, we could livelock
* because not all segments in the iov can be copied at
* once without a pagefault.
*/
//出现bytes跨iovec,且后面的iovec的地址有问题,需要重新拷贝
//重新拷贝时,只拷贝当前iovec的数据,因为当前iovec对应的用户态buffer肯定是没问题的
//后面走到有异常的iovec时,会自然跳出循环,返回-EFAULT
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
iov_iter_single_seg_count(i));
goto again;
}
pos += copied;
written += copied;
//当前的写入会增加系统脏页的数目,如果脏页的数目超过阀值需要回写部分脏页
balance_dirty_pages_ratelimited(mapping);
} while (iov_iter_count(i));
return written ? written : status;
}
//1.如果page不是buffer page,将page转换为buffer page
//2.调用get_block(create=1),为page的每个buffer_head分配磁盘块
static int __block_prepare_write(struct inode *inode, struct page *page,
unsigned from, unsigned to, get_block_t *get_block)
{
unsigned block_start, block_end;
sector_t block;
int err = 0;
unsigned blocksize, bbits;
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
BUG_ON(!PageLocked(page));
BUG_ON(from > PAGE_CACHE_SIZE);//from和to都是页内的相对偏移
BUG_ON(to > PAGE_CACHE_SIZE);
BUG_ON(from > to);
blocksize = 1 << inode->i_blkbits;
//将page转化为buffer page
if (!page_has_buffers(page))
create_empty_buffers(page, blocksize, 0);
head = page_buffers(page);
bbits = inode->i_blkbits;
block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
for(bh = head, block_start = 0; bh != head || !block_start;
block++, block_start=block_end, bh = bh->b_this_page) {
block_end = block_start + blocksize;
//[block_start,block_end]定义page当前块
if (block_end <= from || block_start >= to) {//[from,to]与当前块没有交集
if (PageUptodate(page)) {
if (!buffer_uptodate(bh))
set_buffer_uptodate(bh);
}
continue;
}
if (buffer_new(bh))
clear_buffer_new(bh);
if (!buffer_mapped(bh)) {
WARN_ON(bh->b_size != blocksize);
//请求文件系统为从block开始,长度为blocksize的文件块分配磁盘块
err = get_block(inode, block, bh, 1);
if (err)
break;
//文件系统为bh分配了磁盘块,文件扩张或原来此处为空洞
if (buffer_new(bh)) {
//解决数据块的dirty标志不能被及时清除导致的别名问题
unmap_underlying_metadata(bh->b_bdev,
bh->b_blocknr);
//page的数据是有效的,此时可以任意写入
if (PageUptodate(page)) {
clear_buffer_new(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);//mapped + uptodate = dirty
continue;
}
//对此新映射的块只做部分写入,需要将未写入的部分清0
if (block_end > to || block_start < from)
zero_user_segments(page,
to, block_end,
block_start, from);
continue;
}
}
if (PageUptodate(page)) {
if (!buffer_uptodate(bh))
set_buffer_uptodate(bh);
continue;
}
//当前块已经映射到磁盘,但是要对当前块部分写入,需要将老数据读出,以免被覆盖
if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
!buffer_unwritten(bh) &&
(block_start < from || block_end > to)) {
ll_rw_block(READ, 1, &bh);
*wait_bh++=bh;
}
}
/*
* If we issued read requests - let them complete.
*/
while(wait_bh > wait) {
//等待读buffer_head完毕,最多读两个buffer_head
wait_on_buffer(*--wait_bh);
if (!buffer_uptodate(*wait_bh))
err = -EIO;
}
//如果出现IO错误,需要将刚建立磁盘映射的buffer_head中的数据清零,否则:
//1.如果后续的sys_write()部分写入了同一个buffer_head,因为不会进入
//if (!buffer_mapped(bh)) ->zero_user_segments(),导致
//未初始化的数据会写入到磁盘
//2.sys_read()读到同一个buffer_head时,会读到未初始化的数据
if (unlikely(err))
page_zero_new_buffers(page, from, to);
return err;
}
/*
* If a page has any new buffers, zero them out here, and mark them uptodate
* and dirty so they'll be written out (in order to prevent uninitialised
* block data from leaking). And clear the new bit.
*/
//出现异常时将新建立映射的数据块的内容清零
void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
{
unsigned int block_start, block_end;
struct buffer_head *head, *bh;
BUG_ON(!PageLocked(page));
if (!page_has_buffers(page))
return;
bh = head = page_buffers(page);
block_start = 0;
do {
block_end = block_start + bh->b_size;
if (buffer_new(bh)) {
if (block_end > from && block_start < to) {//[from,to]与当前块有交集
if (!PageUptodate(page)) {
unsigned start, size;
start = max(from, block_start);
size = min(to, block_end) - start;
zero_user(page, start, size);//将交集部分清零
//在__block_prepare_write()中,已经通过zero_user_segments()将没有交集的部分
//清零,此时,整个buffer被清零,可以置UPTODATE标志
set_buffer_uptodate(bh);
}
//buffer_uptodate(bh)=1
//buffer是刚刚映射到磁盘的,现在又遇到错误,需要执行mark_buffer_dirty(),
//将磁盘块清零,避免后续读文件时读出未初始化的数据
clear_buffer_new(bh);
mark_buffer_dirty(bh);
}
}
block_start = block_end;
bh = bh->b_this_page;
} while (bh != head);
}
//1.如果page不是buffer page,将page转换为buffer page
//2.调用get_block(create=1),为page的每个buffer_head分配磁盘块
static int __block_prepare_write(struct inode *inode, struct page *page,
unsigned from, unsigned to, get_block_t *get_block)
{
unsigned block_start, block_end;
sector_t block;
int err = 0;
unsigned blocksize, bbits;
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
BUG_ON(!PageLocked(page));
BUG_ON(from > PAGE_CACHE_SIZE);//from和to都是页内的相对偏移
BUG_ON(to > PAGE_CACHE_SIZE);
BUG_ON(from > to);
blocksize = 1 << inode->i_blkbits;
//将page转化为buffer page
if (!page_has_buffers(page))
create_empty_buffers(page, blocksize, 0);
head = page_buffers(page);
bbits = inode->i_blkbits;
block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
for(bh = head, block_start = 0; bh != head || !block_start;
block++, block_start=block_end, bh = bh->b_this_page) {
block_end = block_start + blocksize;
//[block_start,block_end]定义page当前块
if (block_end <= from || block_start >= to) {//[from,to]与当前块没有交集
if (PageUptodate(page)) {
if (!buffer_uptodate(bh))
set_buffer_uptodate(bh);
}
continue;
}
if (buffer_new(bh))
clear_buffer_new(bh);
if (!buffer_mapped(bh)) {
WARN_ON(bh->b_size != blocksize);
//请求文件系统为从block开始,长度为blocksize的文件块分配磁盘块
err = get_block(inode, block, bh, 1);
if (err)
break;
//文件系统为bh分配了磁盘块,文件扩张或原来此处为空洞
if (buffer_new(bh)) {
//解决数据块的dirty标志不能被及时清除导致的别名问题
unmap_underlying_metadata(bh->b_bdev,
bh->b_blocknr);
//page的数据是有效的,此时可以任意写入
if (PageUptodate(page)) {
clear_buffer_new(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);//mapped + uptodate = dirty
continue;
}
//对此新映射的块只做部分写入,需要将未写入的部分清0
if (block_end > to || block_start < from)
zero_user_segments(page,
to, block_end,
block_start, from);
continue;
}
}
if (PageUptodate(page)) {
if (!buffer_uptodate(bh))
set_buffer_uptodate(bh);
continue;
}
//当前块已经映射到磁盘,但是要对当前块部分写入,需要将老数据读出,以免被覆盖
if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
!buffer_unwritten(bh) &&
(block_start < from || block_end > to)) {
ll_rw_block(READ, 1, &bh);
*wait_bh++=bh;
}
}
/*
* If we issued read requests - let them complete.
*/
while(wait_bh > wait) {
//等待读buffer_head完毕,最多读两个buffer_head
wait_on_buffer(*--wait_bh);
if (!buffer_uptodate(*wait_bh))
err = -EIO;
}
//如果出现IO错误,需要将刚建立磁盘映射的buffer_head中的数据清零,否则:
//1.如果后续的sys_write()部分写入了同一个buffer_head,因为不会进入
//if (!buffer_mapped(bh)) ->zero_user_segments(),导致
//未初始化的数据会写入到磁盘
//2.sys_read()读到同一个buffer_head时,会读到未初始化的数据
if (unlikely(err))
page_zero_new_buffers(page, from, to);
return err;
}