Chinaunix首页 | 论坛 | 博客
  • 博客访问: 109021
  • 博文数量: 15
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 210
  • 用 户 组: 普通用户
  • 注册时间: 2014-06-21 11:44
文章分类

全部博文(15)

文章存档

2016年(3)

2015年(1)

2014年(11)

我的朋友

分类: 云计算

2014-12-01 10:29:59

现象:

新部署环境(openstack+qemu+glusterfs)后,启动虚拟机即I/O error


分析

查看qemu与libvirt log,没有可用的报错

查看glusterfs-brick log,找有如下报错

点击(此处)折叠或打开

  1. [2014-11-28 09:03:57.156373] E [posix.c:2135:posix_writev] 0-test-posix: write failed: offset 0, Invalid argument
  2. [2014-11-28 09:03:57.156421] I [server-rpc-fops.c:1439:server_writev_cbk] 0-test-server: 21: WRITEV 0 (dd0085c9-9844-44c7-9d39-3b9ec0ca65b1) ==> (Invalid argument)
  3. [2014-11-28 09:04:34.098004] E [posix.c:2135:posix_writev] 0-test-posix: write failed: offset 0, Invalid argument
  4. [2014-11-28 09:04:34.098046] I [server-rpc-fops.c:1439:server_writev_cbk] 0-test-server: 30: WRITEV 0 (dd0085c9-9844-44c7-9d39-3b9ec0ca65b1) ==> (Invalid argument)
我们知道,openstack/qemu为了数据安全与热迁移的可用性,虚拟机磁盘缓存策略默认采用"cache=none",也就是使用O_DIRECT打开磁盘文件,相关代码

点击(此处)折叠或打开

  1. qemu/block.c
  2. int bdrv_parse_cache_flags(const char *mode, int *flags)
  3. {
  4.     *flags &= ~BDRV_O_CACHE_MASK;

  5.     if (!strcmp(mode, "off") || !strcmp(mode, "none")) {
  6.         *flags |= BDRV_O_NOCACHE | BDRV_O_CACHE_WB;
  7.     } else if (!strcmp(mode, "directsync")) {
  8.         *flags |= BDRV_O_NOCACHE;
  9.     } else if (!strcmp(mode, "writeback")) {
  10.         *flags |= BDRV_O_CACHE_WB;
  11.     } else if (!strcmp(mode, "unsafe")) {
  12.         *flags |= BDRV_O_CACHE_WB;
  13.         *flags |= BDRV_O_NO_FLUSH;
  14.     } else if (!strcmp(mode, "writethrough")) {
  15.         /* this is the default */
  16.     } else {
  17.         return -1;
  18.     }

  19.     return 0;
  20. }

点击(此处)折叠或打开

  1. qemu/block/raw_posix.c
  2. static void raw_parse_flags(int bdrv_flags, int *open_flags)
  3. {
  4.     assert(open_flags != NULL);

  5.     *open_flags |= O_BINARY;
  6.     *open_flags &= ~O_ACCMODE;
  7.     if (bdrv_flags & BDRV_O_RDWR) {
  8.         *open_flags |= O_RDWR;
  9.     } else {
  10.         *open_flags |= O_RDONLY;
  11.     }

  12.     /* Use O_DSYNC for write-through caching, no flags for write-back caching,
  13.      * and O_DIRECT for no caching. */
  14.     if ((bdrv_flags & BDRV_O_NOCACHE)) {
  15.         *open_flags |= O_DIRECT;
  16.     }
  17. }
准备一模拟脚本测试:

点击(此处)折叠或打开

  1. #define _GNU_SOURCE
  2. #include <stdio.h>
  3. #include <stdlib.h>
  4. #include <unistd.h>
  5. #include <fcntl.h>
  6. #include <sys/mman.h>
  7. #include <sys/types.h>
  8. #include <sys/stat.h>


  9. int main(int argc, char **argv)
  10. {
  11.         int fd, ret, total;
  12.         int flags=0;
  13.         char buf[512];
  14.         memset(buf, 'x', 512);
  15.         flags=O_CREAT|O_RDWR|O_DIRECT;
  16.         printf("open flag %d\n", flags);
  17.         fd = open("/datas/local//testfile", flags, 0644);
  18.         if (fd < 0) {
  19.                 printf("error");
  20.                 exit(-1);
  21.         }
  22.         total = 512;
  23.         //void *ptr = mmap(0, total, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
  24.         //memcpy(ptr, buf, 512);
  25.         ret = write(fd, buf, 512);
  26.         printf("write ret = %d\n", ret);
  27.         ret = close(fd);
  28.         printf("close ret = %d\n", ret);

  29.         exit(0);
  30. }
使用上边的模拟脚本测试时,同样I/O失败,现在转向glusterfs调查关于O_DIRECT的使用,查到一个选项“network.remote-dio

点击(此处)折叠或打开

  1. glusterfs doc:
  2. + group-virt: Change profile to include remote-dio and exclude posix-aio.
  3. +
  4. + remote-dio enables filtering O_DIRECT in the client xlator. This has been
  5. + found to be useful for improving performance when there are multiple VMs
  6. + talking to an image store.
  7. +
  8. + Aggregated throughput results for a single thread iozone run from multiple VMs
  9. + and a single host can be seen below:
  10. +
  11. + -------------------------------------------------
  12. + No. of VMs | remote-dio on | remote-dio off |
  13. + -------------------------------------------------
  14. + 2 | 400 MB/s | 202 MB/s |
  15. + 4 | 650 MB/s | 410 MB/s |
  16. + --------------------------------------------------
  17. +
  18. + posix-aio has not been found to improve performance consistently with VM image
  19. + workload. Hence not including that in the default virt profile.
  20. +
  21. + Change-Id: I592f68b95a955036f1a985352d2f4950ced1deef
  22. + BUG: 907301
  23. + Signed-off-by: Vijay Bellur <vbellur@redhat.com>
  24. + Reviewed-on: http://review.gluster.org/4460
  25. + Reviewed-by: Anand Avati <avati@redhat.com>
  26. + Tested-by: Anand Avati <avati@redhat.com>
network.remote-dio 选项默认为disable,使用命令"gluster volume set test-vol network.remote-dio on",测试I/O与虚拟机均恢复正常,接着看看这个选项究竟做了什么

点击(此处)折叠或打开

  1. gluster-volumes-set.c
  2.         { .key = "network.remote-dio",
  3.           .voltype = "protocol/client",
  4.           .option = "filter-O_DIRECT",
  5.           .op_version = 2,
  6.           .flags = OPT_FLAG_CLIENT_OPT
  7.         },

点击(此处)折叠或打开

  1. client.c
  2.         { .key = {"filter-O_DIRECT"},
  3.           .type = GF_OPTION_TYPE_BOOL,
  4.           .default_value = "disable",
  5.           .description = "If enabled, in open() and creat() calls, O_DIRECT "
  6.           "flag will be filtered at the client protocol level so server will "
  7.           "still continue to cache the file. This works similar to NFS's "
  8.           "behavior of O_DIRECT",
  9.         },
  10.         { .key = {NULL} },

点击(此处)折叠或打开

  1. int32_t
  2. client_open (call_frame_t *frame, xlator_t *this, loc_t *loc,
  3.              int32_t flags, fd_t *fd, dict_t *xdata)
  4. {
  5.         int ret = -1;
  6.         clnt_conf_t *conf = NULL;
  7.         rpc_clnt_procedure_t *proc = NULL;
  8.         clnt_args_t args = {0,};

  9.         conf = this->private;
  10.         if (!conf || !conf->fops)
  11.                 goto out;

  12.         args.loc = loc;
  13.         args.fd = fd;
  14.         args.xdata = xdata;

  15.         if (!conf->filter_o_direct)
  16.                 args.flags = flags;
  17.         else
  18.                 args.flags = (flags & ~O_DIRECT);

点击(此处)折叠或打开

  1. io-cache.c
  2.                 /* If O_DIRECT open, we disable caching on it */
  3.                 if ((local->flags & O_DIRECT)){
  4.                         /* O_DIRECT is only for one fd, not the inode
  5.                          * as a whole
  6.                          */
  7.                         fd_ctx_set (fd, this, 1);
  8.                 }

点击(此处)折叠或打开

  1. posix.c
  2. int32_t
  3. posix_open (call_frame_t *frame, xlator_t *this,
  4.             loc_t *loc, int32_t flags, fd_t *fd, dict_t *xdata)
  5. {
  6.         int32_t op_ret = -1;
  7.         int32_t op_errno = 0;
  8.         char *real_path = NULL;
  9.         int32_t _fd = -1;
  10.         struct posix_fd *pfd = NULL;
  11.         struct posix_private *priv = NULL;
  12.         struct iatt stbuf = {0, };

  13.         DECLARE_OLD_FS_ID_VAR;

  14.         VALIDATE_OR_GOTO (frame, out);
  15.         VALIDATE_OR_GOTO (this, out);
  16.         VALIDATE_OR_GOTO (this->private, out);
  17.         VALIDATE_OR_GOTO (loc, out);
  18.         VALIDATE_OR_GOTO (fd, out);

  19.         priv = this->private;
  20.         VALIDATE_OR_GOTO (priv, out);

  21.         MAKE_INODE_HANDLE (real_path, this, loc, &stbuf);

  22.         op_ret = -1;
  23.         SET_FS_ID (frame->root->uid, frame->root->gid);

  24.         if (priv->o_direct)
  25.                 flags |= O_DIRECT;

  26.         _fd = open (real_path, flags, 0);
简单的来说,如果使能了network.remote-dio ,客户端将自行处理O_DIRECT(影响io-cache与read-ahead),而不将它反映到brick下边的文件系统;
qemu社区有贴:https://lists.gnu.org/archive/html/qemu-devel/2012-10/msg00446.html

点击(此处)折叠或打开

  1. :What is the effect of O_DIRECT on the client exactly?
  2. :To avoid caching in the io-cache module, disable read-ahead etc (if those translators are loaded). The behavior in write-behind is tunable. You could either disable write-behind entirely (which will happen once libgfapi supports 0-copy/RDMA) or perform a sliding-window like size-limited write-behind (defaults to 1MB).
现在可以推测出,是glusterfs brick使用的文件系统不支持O_DIRECT,使用脚本测试果然如此。后经询问,这次部署的brick切换到xfs了,关于xfs对于O_DIRECT的支持找到了如下BUG:

至此问题已搞清楚,当glusterfs brick使用xfs文件系统时,需要注意上层应用对O_DIRECT的使用


解决办法

任选一种

1.在nova.conf中设置指定磁盘缓存策略

点击(此处)折叠或打开

  1. [libvirt]
  2.     disk_cachemodes="file=writethrough"

2.设置glusterfs volume开启network.remote-dio

点击(此处)折叠或打开

  1. gluster volume set volume-name network.remote-dio on

3.xfs mkfs时指定sector大小为512

点击(此处)折叠或打开

  1. ex:
  2. mkfs.xfs -f -s 512 /dev/mapper/image-glance



阅读(3173) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~