好久不处理线上问题了,昨天处理一例平台服务器割接过程中KVM虚拟机热迁移问题,留记录
定位过程:
1.热迁移虚机失败,查看openstack日志
-
Warning: option deprecated, use lost_tick_policy property of kvm-pit instead.
-
char device redirected to /dev/pts/17 (label charserial1)
-
Unknown savevm section or instance '0000:06.0/virtio-blk' 0
-
load of migration failed
2.查看接收端宿主机libvirt日志目录下记录的对应KVM虚机日志,读取到不能识别的virtio-blk设备信息
3.因为迁移失败,虚机仍停留在发起端,通过libvirt调用qemu命令行查看虚机设备信息
块设备数据记录中,已经没有virtio-disk1设备
问题分析:
至此问题已清楚,在libvirt层面和qemu-block记录层面,virtio-disk1设备已卸载,在qemu-pci记录里仍存在,
通常,这种情况会发生在这里:
-
blockdev.c
-
-
int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
-
{
-
const char *id = qdict_get_str(qdict, "id");
-
BlockDriverState *bs;
-
-
bs = bdrv_find(id);
-
if (!bs) {
-
qerror_report(QERR_DEVICE_NOT_FOUND, id);
-
return -1;
-
}
-
if (bdrv_in_use(bs)) {
-
qerror_report(QERR_DEVICE_IN_USE, id);
-
return -1;
-
}
-
-
/* quiesce block driver; prevent further io */
-
bdrv_drain_all();
-
bdrv_flush(bs);
-
bdrv_close(bs);
-
-
/* if we have a device attached to this BlockDriverState
-
* then we need to make the drive anonymous until the device
-
* can be removed. If this is a drive with no device backing
-
* then we can just get rid of the block driver state right here.
-
*/
-
if (bdrv_get_attached_dev(bs)) {
-
bdrv_make_anon(bs);
-
-
/* Further I/O must not pause the guest */
-
bdrv_set_on_error(bs, BLOCKDEV_ON_ERROR_REPORT,
-
BLOCKDEV_ON_ERROR_REPORT);
-
} else {
-
drive_uninit(drive_get_by_blockdev(bs));
-
}
个人认为原因是libvirt连续删除设备的前端与后端资源,应该异步等待前端设备删除的callback,再解除后端资源(以后另记录对此问题的分析),以下是前端设备卸载后,qemu处理guestos中断的流程
-
#0 release_drive (obj=0x7ffffffd41d8, name=0x7ffff8bcb5b0 "drive", opaque=0x7ffff82cc3e0) at hw/core/qdev-properties-system.c:85
-
#1 0x00007ffff7df8828 in object_property_del_all (obj=0x7ffffffd41d8) at qom/object.c:367
-
#2 0x00007ffff7df8a96 in object_finalize (data=0x7ffffffd41d8) at qom/object.c:422
-
#3 0x00007ffff7df95f1 in object_unref (obj=0x7ffffffd41d8) at qom/object.c:729
-
#4 0x00007ffff7df89ac in object_unparent (obj=0x7ffffffd41d8) at qom/object.c:402
-
#5 0x00007ffff7ce84d1 in bus_unparent (obj=0x7ffffffd4160) at hw/core/qdev.c:548
-
#6 0x00007ffff7df896b in object_unparent (obj=0x7ffffffd4160) at qom/object.c:396
-
#7 0x00007ffff7ce9c1d in device_unparent (obj=0x7ffffffd3800) at hw/core/qdev.c:1010
-
#8 0x00007ffff7df896b in object_unparent (obj=0x7ffffffd3800) at qom/object.c:396
-
#9 0x00007ffff7cb0190 in acpi_pcihp_eject_slot (s=0x7ffff8bf7e18, bsel=0, slots=32) at hw/acpi/pcihp.c:139
-
#10 0x00007ffff7cb087f in pci_write (opaque=0x7ffff8bf7e18, addr=8, data=32, size=4) at hw/acpi/pcihp.c:277
-
#11 0x00007ffff7b4115c in memory_region_write_accessor (mr=0x7ffff8bf8a28, addr=8, value=0x7fffe98969f8, size=4, shift=0, mask=4294967295)
-
at /usr/local/src/qemu-2.1.2/memory.c:444
-
#12 0x00007ffff7b412a9 in access_with_adjusted_size (addr=8, value=0x7fffe98969f8, size=4, access_size_min=1, access_size_max=4, access=
-
0x7ffff7b410ba <memory_region_write_accessor>, mr=0x7ffff8bf8a28) at /usr/local/src/qemu-2.1.2/memory.c:481
-
#13 0x00007ffff7b444d7 in memory_region_dispatch_write (mr=0x7ffff8bf8a28, addr=8, data=32, size=4) at /usr/local/src/qemu-2.1.2/memory.c:1138
-
#14 0x00007ffff7b48020 in io_mem_write (mr=0x7ffff8bf8a28, addr=8, val=32, size=4) at /usr/local/src/qemu-2.1.2/memory.c:1976
-
#15 0x00007ffff7aef749 in address_space_rw (as=0x7ffff833dc00, addr=44552, buf=0x7ffff7a3e000 " ", len=4, is_write=true) at /usr/local/src/qemu-2.1.2/exec.c:2077
-
#16 0x00007ffff7b3d5e5 in kvm_handle_io (port=44552, data=0x7ffff7a3e000, direction=1, size=4, count=1) at /usr/local/src/qemu-2.1.2/kvm-all.c:1597
-
#17 0x00007ffff7b3db69 in kvm_cpu_exec (cpu=0x7ffff8b08490) at /usr/local/src/qemu-2.1.2/kvm-all.c:1734
-
#18 0x00007ffff7b234bc in qemu_kvm_cpu_thread_fn (arg=0x7ffff8b08490) at /usr/local/src/qemu-2.1.2/cpus.c:874
-
#19 0x00007ffff0d2b9d1 in start_thread () from /lib64/libpthread.so.0
-
#20 0x00007ffff0a78b5d in clone () from /lib64/libc.so.6
如果不等来它,直接拔掉磁盘文件,就会出现上述数据不一致的情况。实际虚机里应该已出现异常,只是客户没反应。。
阅读(3873) | 评论(0) | 转发(0) |