分类: LINUX
2013-11-11 09:57:52
之前对scsi层和vfs层有点大概的了解。想学习一下通用block层的 page cache 和 那些电梯算法之类的。但还是没什么时间认真去看啊,那个东西也算比较复杂的。大概看下面这个书和简单浏览了一下源码。这书确实够经典啊,以前就全部大概翻了一下,但我读书一般也是很粗略的过一边,没什么印象。现在再去看,很多东西其实讲的还是很清楚的。后来在chinaunix论坛的 内核源码 模块也看网友发的帖子,也列了详细的文件读取和写入调用过程了。我还是自己参考找了一遍,关键是想了解不同的数据结构是怎么关联起来的。
读书笔记
这个书果然是经典啊,这几章内容都是相关的
Understanding the Linux Kernel, 3rd Edition
Chapter 14. Block Device Drivers
Chapter 15. The Page Cache
Chapter 16. Accessing Files
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=Documentation/block/biodoc.txt
https://www.kernel.org/doc/htmldocs/kernel-api/blkdev.html
后来在网上看到高手写的这两个帖子,源码调用列出来,比较详细啊,自己看代码时也参考了一下流程。
----------------------相关的数据街哦股---------------------------
file , inode // 有文件关联到 vfs层。 inode那里好像有个 block device的 指针的。
block_device vfs operation // 由磁盘驱动创建 ,
address_space // 管理 page cache 的radix tree 查找,
->address_space_operations //每个文件系统去定义
file -> iovec->kiocb // 把用户空闲的读写,描述为 offset 和 pos的 通用管理结构。
address_space_operations // 文件系统注册的 真正的 读写page的操作,文件系统也使用 block层的很懂辅助函数来完成工作。
page , bufffer_head // 表示 page cache 缓存在 每个page上面的 关系。 用于page cache层/
bio // 由 buffer head 生成,可以多个合并在 一起成为 request, page cache 层的 buffer head 和 磁盘request 之间的 转换结构吧。
gendisk 磁盘驱动发现 存储设备的时候创建这个对象,表示一个磁盘。 同时每个都有对应的request queue 。
request queue
request // 通用的 block层 ,磁盘请求。 由 bio 生成,
scsi command // scsi 层使用
--------------------------
vfs的 inode的初始化时,指定 address_space_operations
static const struct address_space_operations ext4_ordered_aops = {
.readpage = ext4_readpage,
.readpages = ext4_readpages,
.writepage = ext4_writepage, 写脏页到磁盘
.write_begin = ext4_write_begin, ///generic_perform_write 调用到,准备好需要操作的 page cache对应的page
.write_end = ext4_ordered_write_end,
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
.direct_IO = ext4_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
};
----------------------------
block device -> inode
-> struct hd_struct * hd_part 分区
-> struct gendisk* hd_disk 磁盘
系统所有的block device 都在全局链表里面 all_bdevs
驱动
.4.2.1. Defining a custom driver descriptor
.4.2.2. Initializing the custom descriptor register_blkdev 函数
.4.2.3. Initializing the gendisk descriptor
.4.2.4. Initializing the table of block device methods
.4.2.5. Allocating and initializing a request queue blk_init_queue函数
.4.2.6. Setting up the interrupt handler request_irq函数
.4.2.7. Registering the disk add_disk 函数, 注册 sys 文件系统 kobject,扫描 分区,初始化gendisk的分
区数组
default block device file operations
read generic_file_read( )
write blkdev_file_write( )
aio_read generic_file_aio_read( )
aio_write blkdev_file_aio_write( )
----------------------------
发现硬盘 -> alloc_disk( )->
gendisk-> request queue
-> struct hd_struct
bio_alloc
bio-> struct block_device * bi_bdev
Request Queue ->struct request * last_merge
->elevator_t * elevator
-> request_fn_proc * request_fn
->spinlock_t * queue_lock
->unsigned short max_hw_sectors
->unsigned short max_phys_segments
-----------------------
Page Cache
If the owner of a page in the page cache is a file, the address_space object is embedded in the
i_data field of a VFS inode object. The i_mapping field of the inode always points to the
address_space object of the owner of the pages containing the inode's data. The host field of the
address_space object points to the inode object in which the descriptor is embedded.
Thus, if a page belongs to a file that is stored in an Ext3 filesystem , the owner of the page is the
inode of the file and the corresponding address_space object is stored in the i_data field of the
VFS inode object. The i_mapping field of the inode points to the i_data field of the same inode,
and the host field of the address_space object points to the same inode.
The methods of the address_space object
writepage Write operation (from the page to the owner's disk image)
sync_page Start the I/O data transfer of already scheduled operations on owner's pages
set_page_dirty Set an owner's page as dirty
The most important methods are readpage, writepage, prepare_write, and commit_write.
find_get_page(
add_to_page_cache
remove_from_page_cache
read_cache_page
buffer_head 结构 管理 buffer在 page的什么地方
buffer_head -> b_state 标志,是不是 dirty ,异步等
-> page 这个block在哪个 page上面
-> b_size 大小
-> b_data 字啊page里面的偏移
alloc_buffer_head( ) and free_buffer_head( )
.2.4. Allocating Block Device Buffer Pages
.2.7. Submitting Buffer Heads to the Generic Block Layer
.2.7.1. The submit_bh( ) function
.3. Writing Dirty Pages to Disk
.3.1. The pdflush Kernel Threads
Earlier
sync () 把进程的所有脏页写到磁盘
fsyncs() 把进程某个文件的脏页写到磁盘
fdatasync() 同fsync 但不包含文件的inode block
The service routine sys_sync( ) of the sync( ) system call invokes a series of auxiliary functions:
wakeup_bdflush(0);
sync_inodes(0);
sync_supers( );
sync_filesystems(0);
sync_filesystems(1);
sync_inodes(1);
================================
Chapter 16. Accessing Files
Reading a file is page-based: the kernel always transfers whole pages of data at once. If a
process issues a read( ) system call to get a few bytes, and that data is not already in RAM, the
kernel allocates a new page frame, fills the page with the suitable portion of the file, adds the
page to the page cache, and finally copies the requested bytes into the process address space.
For most filesystems, reading a page of data from a file is just a matter of finding what blocks on
disk contain the requested data. Once this is done, the kernel fills the pages by submitting the
proper I/O operations to the generic block layer. In practice, the read method of all disk-based
filesystems is implemented by a common function named generic_file_read( ).
Write operations on disk-based files are slightly more complicated to handle, because the file size
could increase, and therefore the kernel might allocate some physical blocks on the disk. Of
course, how this is precisely done depends on the filesystem type. However, many disk-based
filesystems implement their write methods by means of a common function named
generic_file_write( ). Examples of such filesystems are Ext2, System V /Coherent /Xenix , and
MINIX . On the other hand, several other filesystems, such as journaling and network filesystems
, implement the write method by means of custom functions.
generic_file_read -> do_generic_file_read 函数
flip 操作的文件文件对象
查找address_space 这这里filp->f_mapping. 找出page cache 对应的page buffer
page_cache_readahead 调用read_page去实际读文件
// 需要实际从磁盘读出相应页
int ext3_readpage(struct file *file, struct page *page)
{
return mpage_readpage(page, ext3_get_block);
}
// 直接操作 块设备的读
}int blkdev_readpage(struct file * file, struct * page page)
{
return block_read_full_page(page, blkdev_get_block);
}
generic_file_write
file object , buffer 转换到 iovec
找到 inode
init_sync_kiocb kiocb.
_ _generic_file_aio_write_nolock
ext4_write_begin
grab_cache_page_write_begin
__block_write_begin
ext4_generic_write_end
block_write_end
__block_commit_write
drivers. The block layer make_request function builds up a request structure,
places it on the queue and invokes the drivers request_fn. The driver makes
use of block layer helper routine elv_next_request to pull the next request
off the queue. Control or diagnostic functions might bypass block and directly
invoke underlying driver entry points passing in a specially constructed
request structure.
代码阅读
这里只列出相关的函数名字,便于以后查找,详细的带有简单注释的,看
http://gmd20.blog.163.com/blog/static/16843923201291541739663/
SYSCALL_DEFINE3(write,unsignedint,fd,constchar__user *,buf,size_t,count)
ssize_t vfs_write(structfile *file,constchar__user *buf,size_tcount,loff_t*pos)
ssize_t do_sync_write(structfile *filp,constchar__user *buf,size_tlen,loff_t*ppos)
conststructfile_operations ext4_file_operations ={
.llseek =ext4_llseek,
.read =do_sync_read,
.write =do_sync_write,
.aio_read =generic_file_aio_read,
.aio_write =ext4_file_write,////异步操作对应这个
.unlocked_ioctl =ext4_ioctl,
#ifdefCONFIG_COMPAT
.compat_ioctl =ext4_compat_ioctl,
#endif
.mmap =ext4_file_mmap,
.open =ext4_file_open,
.release =ext4_release_file,
.fsync =ext4_sync_file,
.splice_read =generic_file_splice_read,
.splice_write =generic_file_splice_write,
.fallocate =ext4_fallocate,
staticssize_t
ext4_file_write(structkiocb *iocb,conststructiovec *iov,
unsignedlongnr_segs,loff_tpos)
ssize_t__generic_file_aio_write(structkiocb *iocb,conststructiovec *iov,
unsignedlongnr_segs,loff_t*ppos)
ssize_t
generic_file_buffered_write(structkiocb *iocb,conststructiovec *iov,
unsignedlongnr_segs,loff_tpos,loff_t*ppos,
size_tcount,ssize_twritten)
staticssize_tgeneric_perform_write(structfile *file,
structiov_iter *i,loff_tpos)
staticintext4_write_begin(structfile *file,structaddress_space *mapping,
loff_tpos,unsignedlen,unsignedflags,
structpage **pagep,void**fsdata)
structpage *grab_cache_page_write_begin(structaddress_space *mapping,
pgoff_tindex,unsignedflags)
{
int__block_write_begin(structpage *page,loff_tpos,unsignedlen,
staticint_ext4_get_block(structinode *inode,sector_tiblock,
structbuffer_head *bh,intflags)
voidll_rw_block(intrw,intnr,structbuffer_head *bhs[])
intsubmit_bh(intrw,structbuffer_head *bh)
voidsubmit_bio(intrw,structbio *bio)
voidgeneric_make_request(structbio *bio)
structrequest_queue *scsi_alloc_queue(structscsi_device *sdev)
structrequest_queue *blk_init_queue(request_fn_proc *rfn,spinlock_t*lock)
voidblk_queue_bio(structrequest_queue *q,structbio *bio)
intblock_write_end(structfile *file,structaddress_space *mapping,
staticvoid__set_page_dirty(structpage *page,
intbdi_writeback_thread(void*data)
voidadd_disk(structgendisk *disk)
->bdi_register ->kthread_run(bdi_forker_thread,->kthread_create(bdi_writeback_thread
磁盘驱动往系统添加gendisk的时候,add_disk 函数就为每个gendisk启动了bdi_writeback_thread 内核线程了。不过根据图书的说明,这个线程个数应该有的时候会自动增加的。
/*
* Handle writeback of dirty data for the device backed by this bdi. Also
* wakes up periodically and does kupdated style flushing.
*/
int bdi_writeback_thread(void *data)
->循环调用 pages_written = wb_do_writeback(wb, 0);
/*
* Retrieve work items and do the writeback they describe
*/
long wb_do_writeback(struct bdi_writeback *wb, int force_wait)
->wb_check_old_data_flush()
->wb_writeback() -> queue_io () __writeback_inodes_wb()
->writeback_sb_inodes
->__writeback_single_inode
->do_writepages
int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
int ret;
if (wbc->nr_to_write <= 0)
return 0;
if (mapping->a_ops->writepages) /// 调用 ext4_writepage
ret = mapping->a_ops->writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
return ret;
}
--------------------------------------
ext4_writepage
__block_write_begin
block_commit_write
block_write_full_page
block_write_full_page_endio
__block_write_full_page
submit_bh ///再调到这里就提交buffer_head 然后 bio了。这么前面有列举代码了。
mpage_da_submit_io
submit_bh
有条件的,可以用kprobe 在 submit_bh 调用的时候backtrace 打印一下调用栈。
-------------------------------------------------------
blk_run_queue() ->
scsi_request_fn() -> scsi_dispatch_cmd -> 调用 底层scsi驱动注册的scsi host的 queuecommand 啊函数。
然后底层scsi 驱动在 自己的 queuecommand函数里面 得到 scsi request 和 cmd 再进行处理。
转自: