The client has in the /etc/fstab file entry both SFS server NIDS, but only one SFS server has the OST mounted.
When the client tries to access the OST on the node which does not have the OST mounted, he will see these errors/messages.
These messages just mean a client tries to access a OST(LUN), but it is not mounted on this server.
Most of the time the OST(LUN) is mounted on the other SFS(heartbeat) node.
These messages are normal when SFS servers are configured for high availability (like with heartbeat).
3.RPC Debug messages
req@ffff81034d168850 x1448410413532362/t0(0) o101->34a4f130-6594-ba08-18a5-15126d38e40b@20.11.0.1@tcp:0/0 lens 552/2096 e 2 to 0 dl 1381608355 ref 2 fl Interpret:/0/0 rc 0/0
1) req@: is pulled from the ptlrpc_request structure used for an RPC by the DEBUG_REQ macro
2) ffff81034d168850 : memory address denoted by req@
3) x1448410413532362/t0(0): XID and Transaction Number(transno)
4) o101: opcode; o400 is the obd_ping request and o101 is the LDLM enqueue request
5) 34a4f130-6594-ba08-18a5-15126d38e40b@20.11.0.1@tcp:0/0: export or import target UUID and protals request and reply buffers
6) lens: the request and reply buffer lengths
7) e: the number of early replies sent under adaptive timeouts
8) to: timeout and is a logical zero or one depending on whether the request timed out
9) dl: deadline time
10) ref: reference count
11) fl: flags and will indicate whether the request was resent,interrupted, complete, high priority, etc.
12) rc: request/reply flags and the request/reply status. The status is typically an errno, but higher numbers refers to Lustre specific uses.
*) The transno, opcode, and reply status are the most useful entries to parse while examining the logs.
4. LDLM Debug messages
This LDLM_ERROR macro is used whenever a server evicts a client and so it is quite common.
This macro uses the eye catcher '###' so it can be easily found as well.
(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 20.3.21.19@tcp ns: mdt-ffff8104927e0000 lock: ffff81032b2f0240/0xbe90459d960b0a8a lrc: 3/0,0 mode: PR/PR res: 8589947646/1303 bits 0x3 rrc: 352 type: IBT flags: 0x4000020 remote: 0xe9142e8925e85238 expref: 26 pid: 27489 timeout: 8287289131
1) ns: namespaces, which is essentially the lock domain for the storage target.
2) mode: granted and requested mode. The types are exclusive mode (EX), protective write (PW), protective read (PR), concurrent write (CW), concurrent read (CR), or null (NL)
3) res: inode and generation numbers for the resource on ghe ldiskfs backing store.
4) type: lock type. extent EXT, ibits IBT, flock FLK
5. mdt_handler.c:913:mdt_getattr_name_lock() Parent doesn't exist!
lustre/lustre/mdt/mdt_handler.c
normal unlink/delete race, need lower the debug message level
6. readonly
Readonly
磁盘错误
1) 在fsck前,先在mds节点,deactivate 这个ost
lctl --device ggfs-OST0014-osc-MDT0000 deactivate
然后再umount 相应的挂载的目录
2) 用lustre工具,e2fsck命令来check。
e2fsck时,分两步操作,首先e2fsck -fn /dev/sdc;如果报错少的话,就e2fsck -fp /dev/sdc;如果报错多的话,
先备份该裸设备数据,再e2fsck -fp /dev/sdc
3) 备份并修复好后,重启机器并到mds上激活相应设备(少量错误,直接重启机器即可解决)
lctl --device ggfs-OST0014-osc-MDT0000 activate
7.Lustre: Service thread pid 8823 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later
server线程被阻塞,watchdog time到期,说明server thread被阻塞了,原因可能是等待某一资源,死锁及被阻塞的RPC通讯
8.查看各IO节点分别挂载了哪些OST
gg2425:~ # lctl get_param osc.*.ost_conn_uuid
osc.ggfs-OST0000-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.16@tcp
osc.ggfs-OST0001-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.16@tcp
osc.ggfs-OST0002-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.16@tcp
osc.ggfs-OST0003-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.16@tcp
osc.ggfs-OST0004-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.17@tcp
osc.ggfs-OST0005-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.17@tcp
osc.ggfs-OST0006-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.17@tcp
osc.ggfs-OST0007-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.17@tcp
osc.ggfs-OST0008-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.18@tcp
osc.ggfs-OST0009-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.18@tcp
osc.ggfs-OST000a-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.18@tcp
osc.ggfs-OST000b-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.4@tcp
osc.ggfs-OST000c-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.4@tcp
osc.ggfs-OST000d-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.19@tcp
osc.ggfs-OST000e-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.20@tcp
osc.ggfs-OST000f-osc-ffff8806262d3400.ost_conn_uuid=20.3.100.20@tcp