Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1940683
  • 博文数量: 1000
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 7921
  • 用 户 组: 普通用户
  • 注册时间: 2013-08-20 09:23
个人简介

storage R&D guy.

文章分类

全部博文(1000)

文章存档

2019年(5)

2017年(47)

2016年(38)

2015年(539)

2014年(193)

2013年(178)

分类: 其他平台

2015-09-14 14:03:09

原文地址:代码阅读 ceph usr 作者:rachine2

    ceph用户态, c++和java


metadata: file names and inodes

anchor table: 在MDS中, 允许inode在目录层次下通过inode number进行定位,
用于辅助实现如hard link之类的特性

failure recovery: replay -> resolve -> reconnect -> rejoin


client/
MetaRequest.h
struct MetaRequest, 操作metadata
struct ceph_mds_request_head head, include/ceph_fs.h,
head.op, CEPH_MDS_OP_*, 标准UNIX FILE IO操作, open/create之类,
src/include/ceph_fs.h

Client.h
class Client,
send_request(),
messenger->send_message() 实现最终传输msg
Client::choose_target_mds(MetaRequest *req), 根据req, 选择对应的mds,
根据req包含的inode/dentry inode/dir inode获取对应的mds,
由indoe获取cap, 没有cap就选择random mds,
cap指capability, 能否支持对应的操作, 见include/ceph_fs.h, CEPH_CAP_*


crush/
CrushWrapper.h,
class CrushWrapper, crush算法的总体封装

include/
libcephfs.h
UNIX FILE IO操作接口, ceph_*(struct ceph_mount_info *cmount, ...),
如mount, create等
libcephfs封装成库, 由java通过jni调用, 见java/native/libcephfs_jni.cc


./libcephfs.cc, 实现,
get_cwd(), 由client构造MetaRequest(CEPH_MDS_OP_LOOKUPPARENT)获取



include/rbd/
librbd.h, rbd – manage rados block device (RBD) images

rbd is a utility for manipulating rados block device (RBD) images, used by the Linux rbd driver and the rbd storage driver for Qemu/KVM. RBD images are simple block devices that are striped over objects and stored in a RADOS object store. The size of the objects the image is striped over must be a power of two.

librbd/
snap在rbd层面上通过 ImageCtx实现, ImageCtx实现对多个osd的layout

internal.cc,
create(), librbd

SnapInfo.h,
class SnapInfo



librados/ ceph内部通信协议


按论文, RADOS集群指代大量的OSD集群和少量负责管理的monitor


block dev | obj storage | ceph fs
librbd | librgw | libcephfs
ceph cluster protocol (librados)
osd | mds | monitor




mds/ 使用层级树实现负载均衡, 将对metadata的访问分发到不同的MDS服务器上

MDS.h, 核心类class MDS, 引用了其他的MDS类

MDCache.h, 维护mds的cache, 基于LRU
struct discovery_info_t ???
实现目录层级, dir的分割与合并等
create_empty_hierarchy(C_Gather *)
create_mydir_hierarchy(C_Gather *)
CDentry.h, CDir.h, CInode.h, C = Cache

MDCache::send_dir_updates() =>
mds->send_message_mds(MDirUpdate)
class MDirUpdate : public Message, messages/MDirUpdate.h

handle_cache_expire =>
dir->remove_replica()
inode/dentry_remove_replica()

MDSMap.h,
struct mds_info_t, info信息
CEPH_MDS_STATE_*, include/ceph_fs.h, mds的状态变迁
// 看class中的具体注释
in: mds对应的cluster
up: mds与cluster的map
failed, stopped

map mds_info, 维护mds_gid与mds_info的对应

Server.h, mds server自身

snap.h,
struct SnapInfo, generic snap descriptor

SnapServer.h
class SnapServer, map, pending

messages/
*.h, class Message的各种派生类
include/ceph_fs.h, CEPH_MSG_*



mon/ monitor

msg/
Messenger.h, class Messenger, 在MDS和OSD中都有使用, 消息传递机制, 基类

SimpleMessenger.h, class SimpleMessenger : public Messenger,
使用PIPE和queue实现收发msg
* Lock ordering:
*
* SimpleMessenger::lock
* Pipe::pipe_lock
* DispatchQueue::lock
* IncomingQueue::lock


Filer

osd/
对象存储, object storage
SNIA, Object-based_Storage-OSD.pdf

osd通过CRUSH算法, 将DATA分发到不同的rbd上

osd_types.h,
class ObjectExtent, 基类, {oid, no, offset, len}

OSD.h,
class OSD,

OSDMap.h,
class OSDMap, 把PG映射到各个osd
OSDMap::pg_to_osds() =>
OSDMap::_pg_to_osds() =>
cursh->do_rule() =>
do_rule(), crush/CrushWrapper.h =>
crush/mapper.c, crush_do_rule()

PG.h, Replica Placement Group
Objects =(build pgid)=> PGs =(CRUSH 算法分发)=> OSDs
class PG,
struct IndexedLog, adds in-memory index of the log, by oid.
class OndiskLog, some info about how we store the log on disk.
pg states, PG_STATE_*, src/osd/osd_types.h
SCRUB, struct Scrubber,
刷写chunk? 基于thread, 维护/排序多个操作
void PG::scrub(ThreadPool::TPHandle &handle), 以thread作为参数

/*
* when holding pg and sched_scrub_lock, then the states are:
* scheduling:
* scrubber.reserved = true
* scrub_rserved_peers includes whoami
* osd->scrub_pending++
* scheduling, replica declined:
* scrubber.reserved = true
* scrubber.reserved_peers includes -1
* osd->scrub_pending++
* pending:
* scrubber.reserved = true
* scrubber.reserved_peers.size() == acting.size();
* pg on scrub_wq
* osd->scrub_pending++
* scrubbing:
* scrubber.reserved = false;
* scrubber.reserved_peers empty
* osd->scrubber.active++
*/
PG::sched_scrub(),



osdc/

Filer.h, 对文件的操作, r/w/zero/trunc/purge/probe
write() =>
Striper::file_to_extents(), 获得inode到stripe的layout,
objecter->sg_write() =>
Objecter::sg_write_trunc()

Objecter.h,
struct ObjectOperation, 对object的基本操作接口,
实现标准的fops, xattr, trivialmap(tmap), objectmap(omap)
OSD的核心类, Client类中的Objecter* 就是指向OSD

Objecter.cc
Objecter::create_pool_snap()
sg: scatter/gather


Objecter::sg_write_trunc() =>
write_trunc(), 对extents中的每个obj执行,
构造OSDOp, CEPH_OSD_FLAG_WRITE,
Objecter::op_submit(Op*) =>

Objecter::_op_submit(Op*) =>


} else if (op->session) {
send_op(op)
} else {
maybe_request_map()
}


    


Objecter::send_op(Op*)
对op->session, 构造class MOSDOp,
messenger->send_message(MOSDOp, op->session->con)



Striper.cc,
Striper::file_to_extents(), 单个文件转发到多个stripe上,
使用ObjectExtent维护offset, len等信息



rbd_fuse/
rbd-fuse.c, 实现FUSE,
struct fuse_operations rbdfs_oper,

调用librbd/librbd.cc 下接口实现



阅读(866) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~