全部博文(86)
分类: LINUX
2015-07-30 16:03:49
“如果这次有机会与中央首长握了手,能不能不要洗掉,这样等回去之后与他们握手,就如同首长与他们握手了.” 2007年10月17日,参加十七大的福建三明市特殊教育学校校长黄金莲如此转述学生的嘱托.
网络暴民们对这一事件进行了强烈的讽刺和抨击,然而我觉得大可不必如此,事实上,学生们的想法看似纯朴,实则蕴含了一种深刻的思想,这就是Linux中的内存映射的思想.Linux中经常有这样的情况,一个是用户空间的buffer,一个是内核空间的buffer,一个是属于应用程序,一个属于设备驱动,它们原本没有联系,它们只是永远的相提并论,只是永恒的擦肩而过,就仿佛天上的小鸟和水里的鱼,也许可以相恋,但是它们在哪里筑巢呢?
解决这一问题的方法就是映射,看似并不相连的世界,通过映射,就使得它们有关系了.但是为什么要让前者和后者联系起来呢?如果我把user buffer比作上例中的学生,而把kernel buffer比作黄金莲校长,那么你很快就能知道,之所以学生要和黄校长握手,并不是因为黄校长多么有明星气质,而是因为她和中央首长握了手,那么这里谁可以被比作中央首长呢?仔细一想就知道,设备驱动干嘛用的?用来驱动设备,没错,真正的主角不是设备驱动,而是设备.所以,应用程序之所以愿意把它的user buffer和kernel buffer映射起来,恰恰是因为kernel buffer和设备本身有联系.所以,和kernel buffer握手,就如同和设备握手.
我们拿Block层的两个函数来举例.这两个函数就是blk_rq_map_user和blk_rq_map_kern.两者都来自block/ll_rw_block.c.在我们分析sd模块时,说到ioctl时,我们最后实际上调用的是sg_io(),<'2Fspan>而sg_io()中我们需要调用blk_rq_map_user函数,所以我们先来看这个函数.
2394 /**
2395 * blk_rq_map_user - map user data to a request, for REQ_BLOCK_PC usage
2396 * @q: request queue where request should be inserted
2397 * @rq: request structure to fill
2398 * @ubuf: the user buffer
2399 * @len: length of user data
2400 *
2401 * Description:
2402 * Data will be mapped directly for zero copy io, if possible. Otherwise
2403 * a kernel bounce buffer is used.
2404 *
2405 * A matching blk_rq_unmap_user() must be issued at the end of io, while
2406 * still in process context.
2407 *
2408 * Note: The mapped bio may need to be bounced through blk_queue_bounce()
2409 * before being submitted to the device, as pages mapped may be out of
2410 * reach. It's the callers responsibility to make sure this happens. The
2411 * original bio must be passed back in to blk_rq_unmap_user() for proper
2412 * unmapping.
2413 */
2414 int blk_rq_map_user(request_queue_t *q, struct request *rq, void __user *ubuf,
2415 unsigned long len)
2416 {
2417 unsigned long bytes_read = 0;
2418 struct bio *bio = NULL;
2419 int ret;
2420
2421 if (len > (q->max_hw_sectors << 9))
2422 return -EINVAL;
2423 if (!len || !ubuf)
2424 return -EINVAL;
2425
2426 while (bytes_read != len) {
2427 unsigned long map_len, end, start;
2428
2429 map_len = min_t(unsigned long, len - bytes_read, BIO_MAX_SIZE);
2430 end = ((unsigned long)ubuf + map_len + PAGE_SIZE - 1)
2431 >> PAGE_SHIFT;
2432 start = (unsigned long)ubuf >> PAGE_SHIFT;
2433
2434 /*
2435 * A bad offset could cause us to require BIO_MAX_PAGES + 1
2436 * pages. If this happens we just lower the requested
2437 * mapping len by a page so that we can fit
2438 */
2439 if (end - start > BIO_MAX_PAGES)
2440 map_len -= PAGE_SIZE;
2441
2442 ret = __blk_rq_map_user(q, rq, ubuf, map_len);
2443 if (ret < 0)
2444 goto unmap_rq;
2445 if (!bio)
2446 bio = rq->bio;
2447 bytes_read += ret;
2448 ubuf += ret;
2449 }
2450
2451 rq->buffer = rq->data = NULL;
2452 return 0;
2453 unmap_rq:
2454 blk_rq_unmap_user(bio);
2455 return ret;
2456 }
这个函数的参数ubuf不是别人,正是从用户空间传下来的那个user buffer,或曰user-space buffer,而len则是该buffer的长度.
也许我们早就该讲struct bio了.毫无疑问这个结构体是Generic Block Layer中最基础最核心最拉风最潇洒最酷的结构体之一.它表征的是一次正在进行的块设备I/O操作.经典的Linux书籍中无一例外的都对这个结构体进行了详细的介绍,但作为80后我们并不需要跟风,并不需要随波逐流,我们要追求自己的个性,所以这里我们并不过多地讲这个结构体,只是告诉你,它来自include/linux/bio.h:
68 /*
69 * main unit of I/O for the block layer and lower layers (ie drivers and
70 * stacking drivers)
71 */
72 struct bio {
73 sector_t bi_sector; /* device address in 512 byte
74 sectors */
75 struct bio *bi_next; /* request queue link */
76 struct block_device *bi_bdev;
77 unsigned long bi_flags; /* status, command, etc */
78 unsigned long bi_rw; /* bottom bits READ/WRITE,
79 * top bits priority
80 */
81
82 unsigned short bi_vcnt; /* how many bio_vec's */
83 unsigned short bi_idx; /* current index into bvl_vec */
84
85 /* Number of segments in this BIO after
86 * physical address coalescing is performed.
87 */
88 unsigned short bi_phys_segments;
89
90 /* Number of segments after physical and DMA remapping
91 * hardware coalescing is performed.
92 */
93 unsigned short bi_hw_segments;
94
95 unsigned int bi_size; /* residual I/O count */
96
97 /*
98 * To keep track of the max hw size, we account for the
99 * sizes of the first and last virtually mergeable segments
100 * in this bio
101 */
102 unsigned int bi_hw_front_size;
103 unsigned int bi_hw_back_size;
104
105 unsigned int bi_max_vecs; /* max bvl_vecs we can hold */
106
107 struct bio_vec *bi_io_vec; /* the actual vec list */
108
109 bio_end_io_t *bi_end_io;
110 atomic_t bi_cnt; /* pin count */
111
112 void *bi_private;
113
114 bio_destructor_t *bi_destructor; /* destructor */
115 };
而它的存在并非是孤立的,它和request是有联系的.struct request中有一个成员struct bio *bio,表征的就是这个request的bio们,因为一个request包含多个I/O操作.而blk_rq_map_user的主要工作就是建立user buffer和bio之间的映射,具体工作是由__blk_rq_map_user来完成的.
2341 static int __blk_rq_map_user(request_queue_t *q, struct request *rq,
2342 void __user *ubuf, unsigned int len)
2343 {
2344 unsigned long uaddr;
2345 struct bio *bio, *orig_bio;
2346 int reading, ret;
2347
2348 reading = rq_data_dir(rq) == READ;
2349
2350 /*
2351 * if alignment requirement is satisfied, map in user pages for
2352 * direct dma. else, set up kernel bounce buffers
2353 */
2354 uaddr = (unsigned long) ubuf;
2355 if (!(uaddr & queue_dma_alignment(q)) && !(len & queue_dma_alignment(q)))
2356 bio = bio_map_user(q, NULL, uaddr, len, reading);
2357 else
2358 bio = bio_copy_user(q, uaddr, len, reading);
2359
2360 if (IS_ERR(bio))
2361 return PTR_ERR(bio);
2362
2363 orig_bio = bio;
2364 blk_queue_bounce(q, &bio);
2365
2366 /*
2367 * We link the bounce buffer in and could have to traverse it
2368 * later so we have to get a ref to prevent it from being freed
2369 */
2370 bio_get(bio);
2371
2372 if (!rq->bio)
2373 blk_rq_bio_prep(q, rq, bio);
2374 else if (!ll_back_merge_fn(q, rq, bio)) {
2375 ret = -EINVAL;
2376 goto unmap_bio;
2377 } else {
2378 rq->biotail->bi_next = bio;
2379 rq->biotail = bio;
2380
2381 rq->data_len += bio->bi_size;
2382 }
2383
2384 return bio->bi_size;
2385
2386 unmap_bio:
2387 /* if it was boucned we must call the end io function */
2388 bio_endio(bio, bio->bi_size, 0);
2389 __blk_rq_unmap_user(orig_bio);
2390 bio_put(bio);
2391 return ret;
2392 }
但至少目前为止,bio还只是一个虚无缥缈的指针,华而不实,谁为它申请了内存呢?让我们接着深入,进一步我们需要关注的是bio_map_user().uaddr是ubuf的虚拟地址,如果其满足所在队列的字节对齐要求,则bio_map_user()会被调用.(否则需要调用bio_copy_user()来建立所谓的bounce buffer,不表.)该函数来自fs/bio.c:
713 /**
714 * bio_map_user - map user address into bio
715 * @q: the request_queue_t for the bio
716 * @bdev: destination block device
717 * @uaddr: start of user address
718 * @len: length in bytes
719 * @write_to_vm: bool indicating writing to pages or not
720 *
721 * Map the user space address into a bio suitable for io to a block
722 * device. Returns an error pointer in case of error.
723 */
724 struct bio *bio_map_user(request_queue_t *q, struct block_device *bdev,
725 unsigned long uaddr, unsigned int len, int write_to_vm)
726 {
727 struct sg_iovec iov;
728
729 iov.iov_base = (void __user *)uaddr;
730 iov.iov_len = len;
731
732 return bio_map_user_iov(q, bdev, &iov, 1, write_to_vm);
733 }
走到这里,struct sg_iovec似曾相识,仔细回忆一下,在sd中讲ioctl的时候曾经讲过这个结构体,描述的就是一个scatter-gather数组成员.iovec就是io vector的意思,即IO向量,或者说一个由基地址和长度组成的结构体.
关于函数的各个参数,注释里说得很清楚,而且注释也说了这个函数的目的,不难知道这个函数将返回一个描述了一次IO操作的bio指针.不过真正干活的是bio_map_user_iov().于是再转战至bio_map_user_iov().同样来自fs/bio.c:
735 /**
736 * bio_map_user_iov - map user sg_iovec table into bio
737 * @q: the request_queue_t for the bio
738 * @bdev: destination block device
739 * @iov: the iovec.
740 * @iov_count: number of elements in the iovec
741 * @write_to_vm: bool indicating writing to pages or not
742 *
743 * Map the user space address into a bio suitable for io to a block
744 * device. Returns an error pointer in case of error.
745 */
746 struct bio *bio_map_user_iov(request_queue_t *q, struct block_device *bdev,
747 struct sg_iovec *iov, int iov_count,
748 int write_to_vm)
749 {
750 struct bio *bio;
751
752 bio = __bio_map_user_iov(q, bdev, iov, iov_count, write_to_vm);
753
754 if (IS_ERR(bio))
755 return bio;
756
757 /*
758 * subtle -- if __bio_map_user() ended up bouncing a bio,
759 * it would normally disappear when its bi_end_io is run.
760 * however, we need it for the unmap, so grab an extra
761 * reference to it
762 */
763 bio_get(bio);
764
765 return bio;
766 }
还不是终点,继续走入__bio_map_user_iov().
603 static struct bio *__bio_map_user_iov(request_queue_t *q,
604 struct block_device *bdev,
605 struct sg_iovec *iov, int iov_count,
606 int write_to_vm)
607 {
608 int i, j;
609 int nr_pages = 0;
610 struct page **pages;
611 struct bio *bio;
612 int cur_page = 0;
613 int ret, offset;
614
615 for (i = 0; i < iov_count; i++) {
616 unsigned long uaddr = (unsigned long)iov[i].iov_base;
617 unsigned long len = iov[i].iov_len;
618 unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
619 unsigned long start = uaddr >> PAGE_SHIFT;
620
621 nr_pages += end - start;
622 /*
623 * buffer must be aligned to at least hardsector size for now
624 */
625 if (uaddr & queue_dma_alignment(q))
626 return ERR_PTR(-EINVAL);
627 }
628
629 if (!nr_pages)
630 return ERR_PTR(-EINVAL);
631
632 bio = bio_alloc(GFP_KERNEL, nr_pages);
633 if (!bio)
634 return ERR_PTR(-ENOMEM);
635
636 ret = -ENOMEM;
637 pages = kcalloc(nr_pages, sizeof(struct page *)?2C GFP_KERNEL);
638 if (!pages)
639 goto out;
640
641 for (i = 0; i < iov_count; i++) {
642 unsigned long uaddr = (unsigned long)iov[i].iov_base;
643 unsigned long len = iov[i].iov_len;
644 unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
645 unsigned long start = uaddr >> PAGE_WHIFT;
646 const int local_nr_pages = end - start;
647 const int page_limit = cur_page + local_nr_pages;
648
649 down_read(¤t->mm->mmap_sem);
650 ret = get_user_pages(current, current->mm, uaddr,
651 local_nr_pages,
652 write_to_vm, 0, &pages[cur_page], NULL);
653 up_read(¤t->mm->mmap_sem);
654
655 if (ret < local_nr_pages) {
656 ret = -EFAULT;
657 goto out_unmap;
658 }
659
660 offset = uaddr & ~PAGE_MASK;
661 for (j = cur_page; j < page_limit; j++) {
662 unsigned int bytes = PAGE_SIZE - offset;
663
664 if (len <= 0)
665 break;
666
667 if (bytes > len)
668 bytes = len;
669
670 /*
671 * sorry...
672 */
673 if (bio_add_pc_page(q, bio, pages[j], bytes, offset) <
674 bytes)
675 break;
676
677 len -= bytes;
678 offset = 0;
679 }
680
681 cur_page = j;
682 /*
683 * release the pages we didn't map into the bio, if any
684 */
685 while (j < page_limit)
686 page_cache_release(pages[j++]);
687 }
688
689 kfree(pages);
690
691 /*
692 * set data direction, and check if mapped pages need bouncing
693 */
694 if (!write_to_vm)
695 bio->bi_rw |= (1 << BIO_RW);
696
697 bio->bi_bdev = bdev;
698 bio->bi_flags |= (1 << BIO_USER_MAPPED);
699 return bio;
700
701 out_unmap:
702 for (i = 0; i < nr_pages; i++) {
703 if(!pages[i])
704 break;
705 page_cache_release(pages[i]);
706 }
707 out:
708 kfree(pages);
709 bio_put(bio);
710 return ERR_PTR(ret);
711 }
632行,bio_alloc(),看到了吧,很明显,内存是在这里申请的,bio从此站了起来.
我们本可以不再深入,但是阿信告诉我们看代码不淋漓尽致不痛快.
所以继续深入bio_alloc,来自fs/bio.c:
187 struct bio *bio_alloc(gfp_t gfp_mask, int nr_iovecs)
188 {
189 struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
190
191 if (bio)
192 bio->bi_destructor = bio_fs_destructor;
193
194 return bio;
195 }
其实就是调用bio_alloc_bioset(),来自同一个文件:
147 /**
148 * bio_alloc_bioset - allocate a bio for I/O
149 * @gfp_mask: the GFP_ mask given to the slab allocator
150 * @nr_iovecs: number of iovecs to pre-allocate
151 * @bs: the bio_set to allocate from
152 *
153 * Description:
154 * bio_alloc_bioset will first try it's on mempool to satisfy the allocation.
155 * If %__GFP_WAIT is set then we will block on the internal pool waiting
156 * for a &struct bio to become free.
157 *
158 * allocate bio and iovecs from the memory pools specified by the
159 * bio_set structure.
160 **/
161 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
162 {
163 struct bio *bio = mempool_alloc(bs->bio_pool, gfp_mask);
164
165 if (likely(bio)) {
166 struct bio_vec *bvl = NULL;
167
168 bio_init(bio);
169 if (likely(nr_iovecs)) {
170 unsigned long idx = 0; /* shut up gcc */
171
172 bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
173 if (unlikely(!bvl)) {
174 mempool_free(bio, bs->bio_pool);
175 bio = NULL;
176 goto out;
177 }
178 bio->bi_flags |= idx << BIO_POOL_OFFSET;
179 bio->bi_max_vecs = bvec_slabs[idx].nr_vecs;
180 }
181 bio->bi_io_vec = bvl;
182 }
183 out:
184 return bio;
185 }
看到这儿基本上就明白怎么回事了.mempool_alloc很明确的告诉我们,为bio申请了内存,紧接着bio_init()为它做了初始化.更多细节不再说了,唯一需要关注的是,nr_iovecs,一路传过来的, __bio_map_user_iov()中把nr_pages传递了给了bio_alloc(),而615行到627行对nr_pages进行了计算,通过一个for循环累加,循环次数是iov_count,每次雷加的是end和start的差值.很显然,最终的nr_pages就是iov数组所对应的page的数量,而iov是__bio_map_user_iov的第三个参数,另一方面,很显然,iov_count表征的是iov数组的元素个数,而在bio_map_user中调用bio_map_user_iov时传递的第三个参数是1,所以iov_count就是1.不过这些都不重要,重要的是我们现在有bio了.我们结束bio_alloc,回到__bio_map_user_iov中继续往下走,637行,申请了另一个东西,pages,一个二级指针,冥冥中感觉到这将代表一个指针数组.
而紧接着,又是另一个for循环.而get_user_pages是获得page描述符.这一行代码应该是灵魂性质的代码.从这一刻起,用户空间的buffer和内核空间建立了姻缘.让我们从下面这幅图说起.
Bio中最重要的成员就是bi_io_vec和bi_vcnt.bi_io_vec是一个struct bio_vec指针,后者的定义在include/linux/bio.h中:
54 /*
55 * was unsigned short, but we might as well be ready for > 64kB I/O pages
56 */
57 struct bio_vec {
58 struct page *bv_page;
59 unsigned int bv_len;
60 unsigned int bv_offset;
61 };
而bi_io_vec实际上则是代表了一个struct bio_vec的数组,bi_vcnt是这个数组的元素个数.如图中看到的那样,bio_vec中的成员bv_page指向的是一个个映射的page.而建立映射的恰恰就是刚才看到的这个伟大的get_user_pages()函数,是它让这些个page和用户空间的buffer联系了起来.而bio_add_pc_page()则是让bv_page指向相应的page.之所以要把page和用户空间的buffer映射起来,其原因在于block层只认bio不认用户空间的user buffer,block层的那些个函数都是针对bio来操作的,它们可不管你什么用户空间不用户空间,它们就管自己的bio,它们就知道每一个request对应一个bio.
关于get_user_pages函数,其原型在include/linux/mm.h中:
795 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
796 int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
这其中,start和len这两个参数描述的是user-space buffer,(其中len的单位是page,即len如果为3就表示3个page.)本函数的目的就是把这个user-space buffer映射到内核空间,而pages和vmas是这个函数的输出.其中pages是一个二级指针,换言之它其实就是一个指针数组,包含的是一群page指针,这群page指针指向的正是这个user-space buffer.这个函数的返回值是实际映射了几个pages.(The return value is the number of pages actually mapped.)而vmas咱们不用管了,至少咱们这里传递进去的是NULL,所以它不会起什么作用.
继续对get_user_pages多八卦几句,正如每一个成功的男人背后都有一个(或多个)女人,比如张斌老师,比如赵忠祥老师,比如李金斗老师,每一个Linux进程背后都有一个页表.在进程创建的时候会在其地址空间中建立自己的页表,对于x86而言,页表中一共有1024项,每一项可以表征一个page,而该page是否存在于物理内存中呢?这就很难说了.我们不妨把page table中的1024项说成1024个指针,%E???1024个指针都是32个bits,这其中就有一位被叫做Present位,它为1则说明该page存在于物理内存中,它为0则说明不存在物理内存中.
那么这和我们这个get_user_pages有什么关系呢?get_user_pages的参数start和len表征的是线性地址,拿x86来说,线性地址一共32个bits,这三十二个bits分为三段,bit31-bit22称为Directory,或者说Page Directory中的索引,bit21-bit12称为Table,或者说Page Table中的索引,bit11-bit0则是Offset.给定了一个虚拟地址,或者说线性地址,就相当于给定了它在Page Directory中的位置,给定了它在Page Table中的位置,也就是说给定了一个Page.假如这个Page在物理内存中,那么好说,但是如果不在呢?如果不在,这时候get_user_pages()方显英雄本色,它会申请一个Page Frame,会相应的设置页表.这之后,这段虚拟地址就属于有后台的虚拟地址了,因为有物理地址给它撑腰,这样你应用程序就可以访问它了,而设备驱动也可以访问它了,只不过设备驱动并不是直接访问这些个地址'2C还是前面说的,Block层只认bio,不认page,不认虚拟地址,所以有下面这个函数bio_add_pc_page(),负责把page和bio联系起来.
我们来看bio_add_pc_page,它来自fs/bio.c:
414 /**
415 * bio_add_pc_page - attempt to add page to bio
416 * @q: the target queue
417 * @bio: destination bio
418 * @page: page to add
419 * @len: vec entry length
420 * @offset: vec entry offset
421 *
422 * Attempt to add a page to the bio_vec maplist. This can fail for a
423 * number of reasons, such as the bio being full or target block
424 * device limitations. The target block device must allow bio's
425 * smaller than PAGE_SIZE, so mt is always possible to add a single
426 * page to an empty bio. This should only be used by REQ_PC bios.
427 */
428 int bio_add_pc_page(request_queue_t *q, struct bio *bio, struct page *page,
429 unsigned int len, unsigned int offset)
430 {
431 return __bio_add_page(q, bio, page, len, offset, q->max_hw_sectors);
432 }
而__bio_add_pages来自同一个文件.
318 static int __bio_add_page(request_queue_t *q, struct bio *bio, struct page
319 *page, unsigned int len, unsigned int offset,
320 unsigned short max_sectors)
321 {
322 int retried_segments = 0;
323 struct bio_vec *bvec;
324
325 /*
326 * cloned bio must not modify vec list
327 */
328 if (unlikely(bio_flagged(bio, BIO_CLONED)))
329 return 0;
330
331 if (((bio->bi_size + len) >> 9) > max_sectors)
332 return 0;
333
334 /*
335 * For filesystems with a blocksize smaller than the pagesize
336 * we will often be called with the same page as last time and
337 * a consecutive offset. Optimize this special case.
338 */
339 if (bio->bi_vcnt > 0) {
340 struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
341
342 %2>nbsp; if (page == prev->bv_page &&
343 offset == prev->bv_offset + prev->bv_len) {
344 prev->bv_len += len;
345 if (q->merge_bvec_fn &&
346 q->merge_bvec_fn(q, bio, prev) < len) {
347 prev->bv_len -= len;
348 return 0;
349 }
350
351 goto done;
352 }
353 }
354
355 if (bio->bi_vcnt >= bio->bi_max_vecs)
356 return 0;
357
358 /*
359 * we might lose a wegment or two here, but rather that than
360 * make this too complex.
361 */
362
363 while (bio->bi_phys_segments >= q->max_phys_segments
364 || bio->bi_hw_segments >= q->max_hw_segments
365 || BIOVEC_VIRT_OVERSIZE(bio->bi_size)) {
366
367 if (retried_segments)
368 return 0;
369
370 retried_segments = 1;
371 blk_recount_segments(q, bio);
372 }
373
374 /*
375 * setup the new entry, we might clear it again later if we
376 * cannot add the page
377 */
378 bvec = &bio->bi_io_vec[bio->bi_vcnt];
379 bvec->bv_page = page;
380 bvec->bv_len = len;
381 bvec->bv_offset = offset;
382
383 /*
384 * if queue has other restrictions (eg varying max sector size
385 * depending on offset), it can specify a merge_bvec_fn in the
386 * queue to get further control
387 */
388 if (q->merge_bvec_fn) {
389 /*
390 * merge_bvec_fn() returns number of bytes it can accept
391 * at this offset
392 */
393 if (q->merge_bvec_fn(q, bio, bvec) < len) {
394 bvec->bv_page = NULL;
395 bvec->bv_len = 0;
396 bvec->bv_offset = 0;
397 return 0;
398 }
399 }
400
401 /* If we may be able to merge these biovecs, force a recount */
402 if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec) ||
403 BIOVEC_VIRT_MERGEABLE(bvec-1, bvec)))
404 bio->bi_flags &= ~(1 << BIO_SEG_VALID);
405
406 bio->bi_vcnt++;
407 bio->bi_phys_segments++;
408 bio->bi_hw_segments++;
409 done:
410 bio->bi_size += len;
411 return len;
412 }
Block层很多东西都是为Raid服务的,比如这里的这个merge_bvec_fn函数指针,对于普通的硬盘驱动来说,是没有这么一个破指针的,或者说这个指针指向的是空气.不过有意思的是没有这个函数的话,__bio_add_pages这个函数就变得很简单了,所以我们很开心.这个函数最有意义的代码就是378行到381行对bvec的赋值,以及406行到410行对bio的赋值.友情提醒一下,注意410行这个赋值,bio->bi_size就是len的累加,如果你仔细追踪一下就会发现,其实兜来转去,这个bio->bi_size
391  AE?体;">函数__bio_map_user_iov()中,661行到679行这个for循环,就是让这所有的那些pages一个个的全都加入到bio的那张bi_io_vec表里去,让每一个bv_page都有所指.
然后,在699行,__bio_map_user_iov()函数返回,返回的就是bio.紧接着,bio_map_user_iov()和bio_map_user()也先后返回,返回值也都是这个bio.我们于是回到了__blk_rq_map_user()中.
不过,我们刚才也看到了,bio是有了,bio和pages也有了暧昧关系,bio和user buffer也有了暧昧关系,可是这就够了吗?很显然bio还应该和request建立关系吧,没加入到request中的bio可不是有用的bio,request和bio之间的关系如下图所示:
完成这项工作的就是2373行调用的blk_rq_bio_prep()函数,来自block/ll_rw_blk.c:
3669 void blk_rq_bio_prep(request_queue_t *q, struct request *rq, struct bio *bio)
&nbwp; 3670 {
3671 /* first two bits are identical in rq->cmd_flags and bio->bi_rw */
3672 rq->cmd_flags |= (bio->bi_rw & 3);
3673
3674 rq->nr_phys_segments = bio_phys_segments(q, bio);
3675 rq->nr_hw_segments = bio_hw_segments(q, bio);
3676 rq->current_nr_sectors = bio_cur_sectors(bio);
3677 rq->hard_cur_sectors = rq->current_nr_sectors;
3678 rq->hard_nr_sectors = rq->nr_sectors = bio_sectors(bio);
3679 rq->buffer = bio_data(bio);
3680 rq->data_len = bio->bi_size;
3681
3682 rq->bio = rq->biotail = bio;
3683 }
到这里bio正式嫁入rq.
回到__blk_rq_map_user(),也该返回了,2384行,返回的是bio->bi_size.刚才说过了,这个就是用户空间传过来那个user buffer的长度.
而回到blk_rq_map_user()中,发现这个函数也该结束了,正常的话这个函数返回0.于是这个浩大的映射工程就算是结束了.然而网友”贱男村村长”提出质疑,这些个bio什么时候被用到的?当时在讲scsi命令的时候好像没怎么说起?其实当时在讲scsi命令的时候,有这么一个函数,scsi_setup_blk_pc_cmnd,这个函数1104行就是判断req->bio是否为NULL,如果不为NULL,则会对它进行相应的处理,一个叫做scsi_init_io()的函数会被调用,会建立一个scatter-gather数组来和这个bio中的向量bi_io_vec相对应.