ceph用户态, c++和java
metadata: file names and inodes
anchor table: 在MDS中, 允许inode在目录层次下通过inode number进行定位,
用于辅助实现如hard link之类的特性
failure recovery: replay -> resolve -> reconnect -> rejoin
client/
MetaRequest.h
struct MetaRequest, 操作metadata
struct ceph_mds_request_head head, include/ceph_fs.h,
head.op, CEPH_MDS_OP_*, 标准UNIX FILE IO操作, open/create之类,
src/include/ceph_fs.h
Client.h
class Client,
send_request(),
messenger->send_message() 实现最终传输msg
Client::choose_target_mds(MetaRequest *req), 根据req, 选择对应的mds,
根据req包含的inode/dentry inode/dir inode获取对应的mds,
由indoe获取cap, 没有cap就选择random mds,
cap指capability, 能否支持对应的操作, 见include/ceph_fs.h, CEPH_CAP_*
crush/
CrushWrapper.h,
class CrushWrapper, crush算法的总体封装
include/
libcephfs.h
UNIX FILE IO操作接口, ceph_*(struct ceph_mount_info *cmount, ...),
如mount, create等
libcephfs封装成库, 由java通过jni调用, 见java/native/libcephfs_jni.cc
./libcephfs.cc, 实现,
get_cwd(), 由client构造MetaRequest(CEPH_MDS_OP_LOOKUPPARENT)获取
include/rbd/
librbd.h, rbd – manage rados block device (RBD) images
rbd is a utility for manipulating rados block device (RBD) images, used by the Linux rbd driver and the rbd storage driver for Qemu/KVM. RBD images are simple block devices that are striped over objects and stored in a RADOS object store. The size of the objects the image is striped over must be a power of two.
librbd/
snap在rbd层面上通过 ImageCtx实现, ImageCtx实现对多个osd的layout
internal.cc,
create(), librbd
SnapInfo.h,
class SnapInfo
librados/ ceph内部通信协议
按论文, RADOS集群指代大量的OSD集群和少量负责管理的monitor
block dev | obj storage | ceph fs
librbd | librgw | libcephfs
ceph cluster protocol (librados)
osd | mds | monitor
mds/ 使用层级树实现负载均衡, 将对metadata的访问分发到不同的MDS服务器上
MDS.h, 核心类class MDS, 引用了其他的MDS类
MDCache.h, 维护mds的cache, 基于LRU
struct discovery_info_t ???
实现目录层级, dir的分割与合并等
create_empty_hierarchy(C_Gather *)
create_mydir_hierarchy(C_Gather *)
CDentry.h, CDir.h, CInode.h, C = Cache
MDCache::send_dir_updates() =>
mds->send_message_mds(MDirUpdate)
class MDirUpdate : public Message, messages/MDirUpdate.h
handle_cache_expire =>
dir->remove_replica()
inode/dentry_remove_replica()
MDSMap.h,
struct mds_info_t, info信息
CEPH_MDS_STATE_*, include/ceph_fs.h, mds的状态变迁
// 看class中的具体注释
in: mds对应的cluster
up: mds与cluster的map
failed, stopped
map mds_info, 维护mds_gid与mds_info的对应
Server.h, mds server自身
snap.h,
struct SnapInfo, generic snap descriptor
SnapServer.h
class SnapServer, map
, pending
messages/
*.h, class Message的各种派生类
include/ceph_fs.h, CEPH_MSG_*
mon/ monitor
msg/
Messenger.h, class Messenger, 在MDS和OSD中都有使用, 消息传递机制, 基类
SimpleMessenger.h, class SimpleMessenger : public Messenger,
使用PIPE和queue实现收发msg
* Lock ordering:
*
* SimpleMessenger::lock
* Pipe::pipe_lock
* DispatchQueue::lock
* IncomingQueue::lock
Filer
osd/
对象存储, object storage
SNIA, Object-based_Storage-OSD.pdf
osd通过CRUSH算法, 将DATA分发到不同的rbd上
osd_types.h,
class ObjectExtent, 基类, {oid, no, offset, len}
OSD.h,
class OSD,
OSDMap.h,
class OSDMap, 把PG映射到各个osd
OSDMap::pg_to_osds() =>
OSDMap::_pg_to_osds() =>
cursh->do_rule() =>
do_rule(), crush/CrushWrapper.h =>
crush/mapper.c, crush_do_rule()
PG.h, Replica Placement Group
Objects =(build pgid)=> PGs =(CRUSH 算法分发)=> OSDs
class PG,
struct IndexedLog, adds in-memory index of the log, by oid.
class OndiskLog, some info about how we store the log on disk.
pg states, PG_STATE_*, src/osd/osd_types.h
SCRUB, struct Scrubber,
刷写chunk? 基于thread, 维护/排序多个操作
void PG::scrub(ThreadPool::TPHandle &handle), 以thread作为参数
/*
* when holding pg and sched_scrub_lock, then the states are:
* scheduling:
* scrubber.reserved = true
* scrub_rserved_peers includes whoami
* osd->scrub_pending++
* scheduling, replica declined:
* scrubber.reserved = true
* scrubber.reserved_peers includes -1
* osd->scrub_pending++
* pending:
* scrubber.reserved = true
* scrubber.reserved_peers.size() == acting.size();
* pg on scrub_wq
* osd->scrub_pending++
* scrubbing:
* scrubber.reserved = false;
* scrubber.reserved_peers empty
* osd->scrubber.active++
*/
PG::sched_scrub(),
osdc/
Filer.h, 对文件的操作, r/w/zero/trunc/purge/probe
write() =>
Striper::file_to_extents(), 获得inode到stripe的layout,
objecter->sg_write() =>
Objecter::sg_write_trunc()
Objecter.h,
struct ObjectOperation, 对object的基本操作接口,
实现标准的fops, xattr, trivialmap(tmap), objectmap(omap)
OSD的核心类, Client类中的Objecter* 就是指向OSD
Objecter.cc
Objecter::create_pool_snap()
sg: scatter/gather
Objecter::sg_write_trunc() =>
write_trunc(), 对extents中的每个obj执行,
构造OSDOp, CEPH_OSD_FLAG_WRITE,
Objecter::op_submit(Op*) =>
Objecter::_op_submit(Op*) =>
} else if (op->session) {
send_op(op)
} else {
maybe_request_map()
}
Objecter::send_op(Op*)
对op->session, 构造class MOSDOp,
messenger->send_message(MOSDOp, op->session->con)
Striper.cc,
Striper::file_to_extents(), 单个文件转发到多个stripe上,
使用ObjectExtent维护offset, len等信息
rbd_fuse/
rbd-fuse.c, 实现FUSE,
struct fuse_operations rbdfs_oper,
调用librbd/librbd.cc 下接口实现