NCCL源码解析3——拓扑生成-lvyilong316-ChinaUnix博客

技术之美

首页　| 　博文目录　| 　关于我

lvyilong316

博客访问： 3677576
博文数量： 217
博客积分： 0
博客等级：民兵
技术积分： 7482
用户组：普通用户
注册时间： 2013-01-23 18:56

个人简介

将晦涩难懂的技术讲的通俗易懂

文章分类

全部博文（217）

RDMA（1）
NCCL（8）
服务器硬件（1）
AI infra（11）
网络安全（3）
容器技术（5）
spdk（1）
SDN（0）
论文（2）
论文（0）
网络大二层（1）
性能优化（6）
虚拟化（22）

qemu-kvm（5）
体系结构（5）
dpdk（23）
链接、装载与库（2）
python（1）
内核网络（13）
分布式存储（1）
Nginx（1）
经典算法（1）
学习生活（2）
Web（2）
网络编程（8）
疑难杂症（5）
Linux系统使用（5）
算法（2）
ASP.NET（1）
linux系统编程（28）

epoll（7）
Linuc/unix（36）

内核（17）
C/C++（18）
未分配的博文（2）

文章存档

2025年（9）

2024年（11）

2023年（9）

2022年（4）

2021年（12）

2020年（8）

2019年（18）

2018年（19）

2017年（9）

2016年（26）

2015年（18）

2014年（54）

2013年（20）

我的朋友

NCCL源码解析3——拓扑生成

——lvyilong316

在前面两部分讨论了 NCCL 如何初始化 bootstrap 网络，并基于 bootstrap 网络建立所有进程间的通信环。之后，每个进程均获得了所有进程的监听地址。也就是说，每个进程均可以和其它任意进程建立连接。下面，我们分析如何通过 initTransportsRank 函数进一步完成通信子的初始化。

initTransportsRank这个函数非常的长，有600+行，NCCL的初始化核心流程基本都和这个函数有关。为此我们后面将分为多个部分进行分析。

下面，我们来按顺序细致分析 initTransportsRank 函数，其中{BANNED}最佳开始的部分是拓扑生成功能。整体由两次AllGather构成。

点击(此处)折叠或打开

static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* parent, uint64_t timers[TIMERS_INIT_COUNT]) {
// We use 2 AllGathers
// 1. { peerInfo, comm, compCap}
// 2. { nChannels, graphInfo, topoRanks }
ncclResult_t ret = ncclSuccess;
int rank = comm->rank;
int nranks = comm->nRanks;
int nNodes = 1;
cpu_set_t affinitySave;
struct ncclTopoGraph* ringGraph = &comm->graphs[NCCL_ALGO_RING];
struct ncclTopoGraph* treeGraph = &comm->graphs[NCCL_ALGO_TREE];
struct ncclTopoGraph* collNetChainGraph = &comm->graphs[NCCL_ALGO_COLLNET_CHAIN];
struct ncclTopoGraph* collNetDirectGraph = &comm->graphs[NCCL_ALGO_COLLNET_DIRECT];
struct ncclTopoGraph* nvlsGraph = &comm->graphs[NCCL_ALGO_NVLS];
struct ncclTopoGraph* graphs[] = { treeGraph, ringGraph, collNetDirectGraph, collNetChainGraph, nvlsGraph, nvlsGraph };
...
// AllGather1 - begin
NCCLCHECKGOTO(ncclCalloc(&comm->peerInfo, nranks+1), ret, fail); // Extra rank to represent CollNet root
NCCLCHECKGOTO(fillInfo(comm, comm->peerInfo+rank, comm->commHash), ret, fail);
NCCLCHECKGOTO(bootstrapAllGather(comm->bootstrap, comm->peerInfo, sizeof(struct ncclPeerInfo)), ret, fail);
//根据主机名是否一致来确定总共有多少台机器。同时还检测是否有多个不同的进程使用同一张 GPU
comm->cuMemSupport = 1;
for (int i = 0; i < nranks; i++) {
if (comm->peerInfo[i].hostHash != comm->peerInfo[rank].hostHash) nNodes++;
if (!comm->peerInfo[i].cuMemSupport) comm->cuMemSupport = 0;
if ((i != rank) && (comm->peerInfo[i].hostHash == comm->peerInfo[rank].hostHash) && (comm->peerInfo[i].busId == comm->peerInfo[rank].busId)) {
WARN("Duplicate GPU detected : rank %d and rank %d both on CUDA device %lx", rank, i, comm->peerInfo[rank].busId);
ret = ncclInvalidUsage;
goto fail;
}
}
// AllGather1 - end

首先，调用 fillInfo 填充本rank的ncclPeerInfo 数据结构。然后使用 bootstrapAllGather 函数获取所有进程(rank)的 ncclPeerInfo 信息，并拷贝到通信子comm->peerInfo中。这样每个进程就知道了所有其他进程的ncclPeerInfo 了。这里，ncclPeerInfo 结构体中包含一系列本进程相关信息，比如本进程管理的 GPU、所在主机名称和本进程rank号，进程号等，如下面的代码所示。通过这一步骤，每个rank都可以获知其它所有rank的相关信息。通过 ncclPeerInfo 信息，我们可以知道哪些 rank 属于同一台机器，以及判断是否有多个进程使用同一张卡等。

点击(此处)折叠或打开

struct ncclPeerInfo {
int rank;
int cudaDev;
int nvmlDev;
int gdrSupport;
uint64_t hostHash;
uint64_t pidHash;
dev_t shmDev;
int64_t busId;
struct ncclComm* comm;
int cudaCompCap;
// MNNVL support
nvmlGpuFabricInfoV_t fabricInfo;
int cuMemSupport;
};

fillInfo 具体代码不再展开，但其实现需要注意需要注意的是对于主机名信息(hostHash)，如果设置了 NCCL_HOSTID 环境变量，则主机名信息为该环境变量指定的值。否则，使用 gethostname 获得的主机名和 /proc/sys/kernel/random/boot_id 中生成的值拼接作为主机名。同时，生成该主机名的哈希值，该哈希值还要加上通信子的哈希值。这里，使用 boot_id 是为了更好的避免主机名的哈希冲突。

同样对于进程号信息，由 getpid 和 /proc/self/ns/pid 文件中的 pid 号组成。同时，生成该进程号的哈希值。同样地，该哈希值还有加上通信子的哈希值，如下所示。

点击(此处)折叠或打开

info->pidHash=getPidHash()+commHash;

这里，我们做一些额外的说明：在系统级别上，不同的进程命名空间中可能会有相同的 PID 号。通过将 getpid 获取的进程号和通过 /proc/self/ns/pid 得到的命名空间的信息结合，可以唯一标识一个进程。

此外，fillInfo 还会判断当前GPU和NIC之间是否支持GDR，如果是高版本CUDA则直接可获取相关属性，否则要尝试使用GDR方式，在GPU分配显存，并逐个尝试register到对应NIC，建立通信的方式确认是否支持GDR。

点击(此处)折叠或打开

ncclResult_t ncclGpuGdrSupport(struct ncclComm* comm, int* gdrSupport) {
constexpr int GPU_BUF_SIZE = 2*1024*1024;
#if CUDART_VERSION >= 11030
// In CUDA 11.3 and later we can now query the cudaDevAttrGPUDirectRDMASupported attribute
int driverVersion;
CUDACHECK(cudaDriverGetVersion(&driverVersion));
if (driverVersion >= 11030) {
int cudaDev, attr = 0;
CUDACHECK(cudaGetDevice(&cudaDev));
CUDACHECK(cudaDeviceGetAttribute(&attr, cudaDevAttrGPUDirectRDMASupported, cudaDev));
*gdrSupport = attr;
return ncclSuccess;
}
#endif
static int gdrSupportMatrix[32] = {
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 };
if (gdrSupportMatrix[comm->cudaDev] == -1) {
int netDevs;
NCCLCHECK(comm->ncclNet->devices(&netDevs));
gdrSupportMatrix[comm->cudaDev] = 0;
for (int dev=0; dev<netDevs; dev++) {
// Find a net device which is GDR-capable
ncclNetProperties_t props;
NCCLCHECK(comm->ncclNet->getProperties(dev, &props));
if ((props.ptrSupport & NCCL_PTR_CUDA) == 0) continue;
// Allocate memory on the GPU and try to register it on the NIC.
void *lComm = NULL, *sComm = NULL, *rComm = NULL;
ncclNetHandle_t handle;
char* gpuPtr = NULL;
void* mHandle = NULL;
ncclResult_t ret;
ncclDebugNoWarn = NCCL_NET;
NCCLCHECKGOTO(comm->ncclNet->listen(dev, &handle, &lComm), ret, cleanup1);
bool connected;
connected = false;
while (!connected) {
// If we're aborting now, skip to cleanup
if (__atomic_load_n(comm->abortFlag, __ATOMIC_ACQUIRE)) {
goto cleanup2;
}
if (sComm == NULL)
NCCLCHECKGOTO(comm->ncclNet->connect(dev, &handle, &sComm, NULL), ret, cleanup2);
if (rComm == NULL)
NCCLCHECKGOTO(comm->ncclNet->accept(lComm, &rComm, NULL), ret, cleanup2);
connected = (rComm != NULL) && (sComm != NULL);
}
NCCLCHECKGOTO(ncclCudaMalloc(&gpuPtr, GPU_BUF_SIZE), ret, cleanup2);
if (comm->ncclNet->regMr(sComm, gpuPtr, GPU_BUF_SIZE, NCCL_PTR_CUDA, &mHandle) == ncclSuccess) {
NCCLCHECK(comm->ncclNet->deregMr(sComm, mHandle));
NCCLCHECK(comm->ncclNet->regMr(rComm, gpuPtr, GPU_BUF_SIZE, NCCL_PTR_CUDA, &mHandle));
NCCLCHECK(comm->ncclNet->deregMr(rComm, mHandle));
gdrSupportMatrix[comm->cudaDev] = 1;
}
ncclDebugNoWarn = 0;
NCCLCHECK(ncclCudaFree(gpuPtr));
//…
*gdrSupport = gdrSupportMatrix[comm->cudaDev];
return ncclSuccess;
}

由前文可知：ncclNet_t* ncclNets[3] = { nullptr, &ncclNetIb, &ncclNetSocket };

ncclNet正常情况下应该是ncclNetIb。可以看到如果是低版本CUDA，则遍历每个IB设备，然后调用ncclNet->getProperties，也就是ncclIbGetProperties。

点击(此处)折叠或打开

ncclResult_t ncclIbGetProperties(int dev, ncclNetProperties_t* props) {
struct ncclIbMergedDev* mergedDev = ncclIbMergedDevs+dev;
props->name = mergedDev->devName;
props->speed = mergedDev->speed;
// Take the rest of the properties from an arbitrary sub-device (should be the same)
struct ncclIbDev* ibDev = ncclIbDevs + mergedDev->devs[0];
props->pciPath = ibDev->pciPath;
props->guid = ibDev->guid;
props->ptrSupport = NCCL_PTR_HOST;
if (ncclIbGdrSupport() == ncclSuccess) {
props->ptrSupport |= NCCL_PTR_CUDA; // GDR support via nv_peermem
}
props->regIsGlobal = 1;
if (ncclIbDmaBufSupport(dev) == ncclSuccess) {
props->ptrSupport |= NCCL_PTR_DMABUF; // GDR support via DMA-BUF
}
props->latency = 0; // Not set
props->port = ibDev->portNum + ibDev->realPort;
props->maxComms = ibDev->maxQp;
props->maxRecvs = NCCL_NET_IB_MAX_RECVS;
props->netDeviceType = NCCL_NET_DEVICE_HOST;
props->netDeviceVersion = NCCL_NET_DEVICE_INVALID_VERSION;
return ncclSuccess;
}

这里主要是获取网卡名，PCIe路径，guid等信息，然后调用ncclIbGdrSupport查看是否支持GDR。

点击(此处)折叠或打开

// Detect whether GDR can work on a given NIC with the current CUDA device
// Returns :
// ncclSuccess : GDR works
// ncclSystemError : no module or module loaded but not supported by GPU
ncclResult_t ncclIbGdrSupport() {
static int moduleLoaded = -1;
if (moduleLoaded == -1) {
// Check for the nv_peer_mem module being loaded
moduleLoaded = ((access("/sys/kernel/mm/memory_peers/nv_mem/version", F_OK) == -1) &&
// Also support the new nv_mem_nc module
(access("/sys/kernel/mm/memory_peers/nv_mem_nc/version", F_OK) == -1)) ? 0 : 1;
}
if (moduleLoaded == 0) return ncclSystemError;
return ncclSuccess;
}

该函数注意通过相关路径判断是否安装了nv_peermem，即nv的驱动，如果安装了的话则说明支持GDR。然后回到ncclGpuGdrSupport，尝试注册显存，如果可以注册则设置gdrSupport为1，这里其实会创建rdma连接，这个在后边会单独介绍，本次先略过。

接着，回到initTransportsRank我们根据主机名是否一致来确定总共有多少台机器。同时，我们还检测是否有多个不同的进程使用同一张 GPU。当前，NCCL 不支持这种多个进程使用同一张 GPU 的模式。因此，会给出报错信息。至此，{BANNED}中国第一个AllGather的流程就结束了。

下面是MNNVL的处理，由于我们常见的网络配置下不支持 Multi Node NVLink (MNNVL，即节点间也使用NVLink)，所以我们后面忽略对该功能的讨论。

此外，因为我们假设使用单进程单卡的模式，我们认为不存在多个 rank 属于同一个进程的情形。所以，我们暂且跳过下面获取进程内 rank 信息(Compute intra-process ranks)的部分。

接下来，调用 ncclTopoGetSystem 函数检测拓扑并建立系统图，调用 ncclTopoComputePaths 计算 GPU 和 NIC 之间的路径，并移除不可达的 GPU 和未使用的 NIC，重新计算 GPU 和 NIC 之间的路径。随后，初始化搜索、确定 GPU 架构并打印{BANNED}最佳终的拓扑。我们将在下面单独具体分析该函数。

点击(此处)折叠或打开

确定 GPU 架构并打印{BANNED}最佳终的拓扑。我们将在下面单独具体分析该函数。
// Topo detection / System graph creation
NCCLCHECKGOTO(ncclTopoGetSystem(comm, &comm->topo), ret, fail);
// Compute paths between GPUs and NICs
NCCLCHECKGOTO(ncclTopoComputePaths(comm->topo, comm), ret, fail);
// Remove inaccessible GPUs and unused NICs
NCCLCHECKGOTO(ncclTopoTrimSystem(comm->topo, comm), ret, fail);
// Recompute paths after trimming
NCCLCHECKGOTO(ncclTopoComputePaths(comm->topo, comm), ret, fail);
// Init search
NCCLCHECKGOTO(ncclTopoSearchInit(comm->topo), ret, fail);
// Decide on comm's CPU architecture.
NCCLCHECKGOTO(ncclTopoComputeCommCPU(comm), ret, fail);
// Print final topology
NCCLCHECKGOTO(ncclTopoPrint(comm->topo), ret, fail);
timers[TIMER_INIT_TOPO] = clockNano() - timers[TIMER_INIT_TOPO];

接着，ncclTopoGetCpuAffinity 获取 GPU 的 CPU 亲和性，并设置亲和性，从而获取更好的性能。具体来说，对于当前进程管理的 GPU，获取其所属的 CPU，即与该 GPU 路径{BANNED}最佳近的 CPU。

点击(此处)折叠或打开

// Set Affinity to a CPU local the our GPU, so that all memory we allocate
// on the host is local.
NCCLCHECKGOTO(ncclTopoGetCpuAffinity(comm->topo, comm->rank, &comm->cpuAffinity), ret, fail);
if (CPU_COUNT(&comm->cpuAffinity)) {
sched_getaffinity(0, sizeof(cpu_set_t), &affinitySave);
sched_setaffinity(0, sizeof(cpu_set_t), &comm->cpuAffinity);
}

接着，获取当前进程的 CPU 亲和性。默认地，当前进程可以调度到任何 CPU。然后，我们获取当前 GPU 亲和性信息，其指定当前 GPU 亲和的 CPU。例如，下面的示例显示 GPU0 亲和的 CPU 为 0,2,4,6,8,10。我们使用当前 GPU 亲和的 CPU 和当前进程的亲和 CPU 的交集设置亲和性。默认地，当前进程的亲和的 CPU 为所有 CPU，那么 GPU0 的亲和 CPU 为 0,2,4,6,8,10。假如我们设置当前进程的 CPU 亲和性为 2,6,7,8，那么 GPU0 的亲和性和当前进程的亲和性的交集为 2,6,8，我们使用该交集设置进程的 CPU 亲和性。

此外，我们可以通过环境变量 NCCL_IGNORE_CPU_AFFINITY 来忽略当前进程已设置的 CPU 亲和性，而仅仅根据当前 GPU 的亲和性设置进程的亲和性。

下面我们看拓扑构建的关键函数：ncclTopoGetSystem

ncclTopoGetSystem

ncclTopoGetSystem 函数建立系统中 pci 设备的拓扑信息。分为两个步骤，先使用xml表示整个PCI树结构，然后基于xml转成ncclTopoNode，其中xml定义如下，一个ncclXmlNode表示了PCI树的一个节点。

首先，ncclTopoGetSystem 函数尝试根据环境变量 NCCL_TOPO_FILE 或者默认拓扑文件的位置加载拓扑文件。

点击(此处)折叠或打开

ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** system) {
struct ncclXml* xml;
NCCLCHECK(xmlAlloc(&xml, NCCL_TOPO_XML_MAX_NODES));
const char* xmlTopoFile = ncclGetEnv("NCCL_TOPO_FILE");
if (xmlTopoFile) {
INFO(NCCL_ENV, "NCCL_TOPO_FILE set by environment to %s", xmlTopoFile);
NCCLCHECK(ncclTopoGetXmlFromFile(xmlTopoFile, xml, 1));
} else {
// Try default XML topology location
NCCLCHECK(ncclTopoGetXmlFromFile("/var/run/nvidia-topologyd/virtualTopology.xml", xml, 0));
}

下面的文件给出 topology 文件的示例。

当没有用户指定和默认的拓扑文件存在时，ncclTopoGetSystem 函数自动探测系统拓扑并创建拓扑文件。具体来说，首先建立名称为 system 的根节点，并设置该节点的 version 属性。接着，调用 ncclTopoRefreshBcmP2pLinks 函数刷新 switch 拓扑。这是通过读取 /sys/kernel/pci_switch_link/refresh_switch_toplogy 文件实现的(通常环境这个文件是不存在的)。

点击(此处)折叠或打开

if (xml->maxIndex == 0) {
// Create top tag
struct ncclXmlNode* top;
NCCLCHECK(xmlAddNode(xml, NULL, "system", &top));
NCCLCHECK(xmlSetAttrInt(top, "version", NCCL_TOPO_XML_VERSION));
}
NCCLCHECK(ncclTopoRefreshBcmP2pLinks());

接着，ncclTopoGetSystem 函数检测本进程管理的 GPU 节点（注意这里只关注本rank管理的GPU，后续会进行xml合并），并根据 GPU 节点信息建立相应的 xml 节点，并设置该节点的相关属性。此外，本函数还通过comm->peerInfo[comm->rank].busId 从sysfs中获取gpu节点到cpu的路径，通过这个路径转成xml树，建立从本 GPU 所属 pci 到 cpu root complex 信息，并添加CPU节点。

点击(此处)折叠或打开

// Detect only the GPU managed by this process. We'll get any others through XML fusion.
char busId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
NCCLCHECK(int64ToBusId(comm->peerInfo[comm->rank].busId, busId));
struct ncclXmlNode* node;
NCCLCHECK(ncclTopoFillGpu(xml, busId, &node));
if (node) {
NCCLCHECK(xmlSetAttrInt(node, "keep", 1));
NCCLCHECK(xmlSetAttrInt(node, "rank", comm->rank));
NCCLCHECK(xmlInitAttrInt(node, "gdr", comm->peerInfo[comm->rank].gdrSupport));
}

随后，ncclTopoGetSystem 函数探测网络设备信息(包括net和collnet节点)，并插入到 xml 结构中。具体是怎么探测到net和collnet设备节点的呢？这是通过开始注册的ncclNets plugin和ncclCollNets plugin的devices函数。

点击(此处)折叠或打开

comm->ncclCollNet->devices(ndev)
comm->ncclNet->devices(&netDevCount)

如对应ncclNetSocket这个plugin，其devices函数是ncclNetSocketDevices，如下返回ncclNetIfs，而是在plugin初始化时赋值的。

点击(此处)折叠或打开

ncclResult_t ncclNetSocketDevices(int* ndev) {
*ndev = ncclNetIfs;
return ncclSuccess;
}

并通过plugin的getProperties函数获取其设备属性，然后生成xml节点。如下所示：

点击(此处)折叠或打开

if (netDevCount == 0) {
NCCLCHECK(comm->ncclNet->devices(&netDevCount));
}
for (int n=0; n<netDevCount; n++) {
ncclNetProperties_t props;
NCCLCHECK(comm->ncclNet->getProperties(n, &props));
comm->netDeviceType = props.netDeviceType;
struct ncclXmlNode* netNode;
NCCLCHECK(ncclTopoFillNet(xml, props.pciPath, props.name, &netNode));
NCCLCHECK(xmlSetAttrInt(netNode, "keep", 1));
NCCLCHECK(xmlSetAttrInt(netNode, "dev", n));
NCCLCHECK(xmlInitAttrInt(netNode, "speed", props.speed));
NCCLCHECK(xmlInitAttrInt(netNode, "port", props.port));
NCCLCHECK(xmlInitAttrFloat(netNode, "latency", props.latency));
NCCLCHECK(xmlInitAttrUint64(netNode, "guid", props.guid));
NCCLCHECK(xmlInitAttrInt(netNode, "maxconn", props.maxComms));
bool gdrSupport = (props.ptrSupport & NCCL_PTR_CUDA) || (comm->dmaBufSupport && (props.ptrSupport & NCCL_PTR_DMABUF));
INFO(NCCL_NET,"NET/%s : GPU Direct RDMA %s for HCA %d '%s'", comm->ncclNet->name, gdrSupport ? "Enabled" : "Disabled", n, props.name);
NCCLCHECK(xmlInitAttrInt(netNode, "gdr", gdrSupport));
}

接下来调用 ncclTopoTrimXml 函数裁剪掉未使用的设备。

点击(此处)折叠或打开

// Remove XML branches which don't have a node with keep="1" (typically when importing a topology)
NCCLCHECK(ncclTopoTrimXml(xml));

再次，ncclTopoGetSystem 函数计算每个节点上的进程数，并使用 bootstrapIntraNodeAllGather 函数合并本节点上所有进程间的 xml 信息。

点击(此处)折叠或打开

} else {
// Intra-node fusion. Much of the comm is not initialized yet at this point so we need to do our own calculations.
NCCLCHECK(ncclCalloc(&localRanks, comm->nRanks));
for (int i = 0; i < comm->nRanks; i++) {
if (comm->peerInfo[i].hostHash == comm->peerInfo[comm->rank].hostHash) {
if (i == comm->rank)
localRank = nLocalRanks;
localRanks[nLocalRanks++] = i;
}
}
}
char* mem;
NCCLCHECK(ncclCalloc(&mem, nLocalRanks * xmlMemSize(NCCL_TOPO_XML_MAX_NODES)));
struct ncclXml* rankXml = (struct ncclXml*)(mem+xmlMemSize(NCCL_TOPO_XML_MAX_NODES)*localRank);
memcpy(rankXml, xml, xmlMemSize(NCCL_TOPO_XML_MAX_NODES));
NCCLCHECK(ncclTopoConvertXml(rankXml, (uintptr_t)xml->nodes, 1));
NCCLCHECK(bootstrapIntraNodeAllGather(comm->bootstrap, localRanks, localRank, nLocalRanks, mem, xmlMemSize(NCCL_TOPO_XML_MAX_NODES)));

随后，使用 ncclTopoFuseXml 函数合并本节点上的 xml 信息。

点击(此处)折叠或打开

for (int i = 0; i < nLocalRanks; i++) {
struct ncclXml* peerXml = (struct ncclXml*)(mem+xmlMemSize(NCCL_TOPO_XML_MAX_NODES)*i);
NCCLCHECK(ncclTopoConvertXml(peerXml, (uintptr_t)peerXml->nodes, 0));
NCCLCHECK(ncclTopoFuseXml(xml, peerXml));
}

如果设置了 NCCL_TOPO_DUMP_FILE 环境变量，则将当前的 xml 信息导出到对应的文件中。

点击(此处)折叠或打开

xmlTopoFile = ncclGetEnv("NCCL_TOPO_DUMP_FILE");
if (xmlTopoFile && comm->rank == ncclParamTopoDumpFileRank()) {
INFO(NCCL_ENV, "NCCL_TOPO_DUMP_FILE set by environment to %s", xmlTopoFile);
NCCLCHECK(ncclTopoDumpXmlToFile(xmlTopoFile, xml));
}

{BANNED}最佳后，调用 ncclTopoGetSystemFromXml 函数，从 xml 文件中解析并转换为内部拓扑结构。

点击(此处)折叠或打开

NCCLCHECK(ncclTopoGetSystemFromXml(xml, system, comm->peerInfo[comm->rank].hostHash));

我们现在看查看 system (ncclTopoSystem) 的数据结构。如下图所示，其包含编号从 0 开始的 systemId 编号，不同类型的 node。这里，类型包括 cpu、net、pci、nvlink 等。nHosts 表示主机的数量，不同的主机以 cpu 的 host_hash 属性区别。对于同一主机内的不同 cpu，其 host_hash 值是一样的。

然后每个ncclTopoNodeSet结构表示一种node类型(cpu、net、pci、nvlink 等),其中包含对应类型的node(ncclTopoNode)和link(ncclTopoLink)，ncclTopoGetSystemFromXml 函数依次添加 Cpu、NVLink、C2c、Pci 等信息到上述数据结构中，从而构建系统级拓扑信息。

ncclTopoComputePaths(路径生成)

在上节介绍完拓扑的构建后，NCCL中就构建了类似如下的拓扑结构，其中GPU之间是通过NVLink连接起来的。但是这相当于只有一个基本的图的联通信息，但是却没有路径（path）信息，比如GPU1和GPU2通信，要怎么走，是否可达，这就设计到path，而ncclTopoComputePaths这个函数就是计算图中各节点间的path信息的。抽象一下，这个问题可以建模为给定一个无向图，每条边有一个权值，给定查询(u, v)，求节点u到节点v的路径，使得路径上的{BANNED}最佳小边的权值{BANNED}最佳大，类似无向图的{BANNED}最佳小瓶颈路，可以用生成树+LCA的方法解决；如果查询中的u是固定的，那么也可以使用类似SPFA的方法解决，将松弛方法改一下即可。

为了方便介绍后续的逻辑，这里先介绍一下相关数据结构：

1. ncclTopoLink

ncclTopoLink表示一条边，其中type字段区分边的类型，比如NVLink，PCI等；bw表示带宽；remNode表示当前边连接的对端节点；

点击(此处)折叠或打开

struct ncclTopoLink {
int type;
float bw;
struct ncclTopoNode* remNode;
};

其中link的取值如下：

点击(此处)折叠或打开

// We want link types and path types to match as much as possible
#define LINK_LOC 0 //表示node自己和自己的链接
#define LINK_NVL 1 // NVlink
// Skipping 2 for PATH_NVB
#define LINK_PCI 3 // PCIe
// Skipping 4 for PATH_PXB
// Skipping 5 for PATH_PXN
// Skipping 6 for PATH_PHB
#define LINK_SYS 7 //QPI
#define LINK_NET 8 //网络

PATH_LOC为节点到自己，PATH_NVL表示路径上的边都是NVLink，PATH_PIX表示经过{BANNED}最佳多一个PCIe switch，PATH_PXB表示经过了多个PCIe witch，但是没有经过CPU，PATH_PHB表示经过了CPU，PATH_SYS表示不同numa之间的路径。

2. ncclTopoLinkList

ncclTopoLinkList表示一条path，path是由多个边(link)构成的，其定义如下，其中list即表示组成path的边(link)的集合，count是边(link)的个数，bw是path的带宽，所谓path的带宽即为构成其link中的{BANNED}最佳小link带宽。

点击(此处)折叠或打开

struct ncclTopoLinkList {
struct ncclTopoLink* list[NCCL_TOPO_MAX_HOPS];
int count;
float bw;
int type;
};

path也有其对应的type，由于一个path可能由不同类型的link构成，因此path的类型相对link会更复杂，如下所示：

点击(此处)折叠或打开

#define PATH_LOC 0 // Local (myself)节点到自己
#define PATH_NVL 1 // Connection traversing NVLink(表示路径上的边都是NVLink)
#define PATH_NVB 2 // Connection through NVLink using an intermediate GPU
#define PATH_PIX 3 // Connection traversing at most a single PCIe bridge
// Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
#define PATH_PXB 4
// Connection between a GPU and a NIC using an intermediate GPU. Used to enable rail-local, aggregated network send/recv operations.
#define PATH_PXN 5
// Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
#define PATH_PHB 6
// Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
#define PATH_SYS 7
#define PATH_NET 8 // Connection through the network
#define PATH_DIS 9 // Disconnected

3. ncclTopoNode

ncclTopoNode表示一个节点，其中type表示节点的类型，比如GPU，PCI，NIC等，根据不同的类型后面的union会使用不同字段，nlinks表示该节点有几条边，links存储了具体连接的边，paths存储了本节点到其他节点的路径，例如node1中的paths[type][id]就是node1到type类型的第id个node的路径；

点击(此处)折叠或打开

struct ncclTopoNode {
int type;
int64_t id;
// Type specific data
union {
struct {
int dev; // NVML dev number
int rank;
int cudaCompCap;
int gdrSupport;
}gpu;
struct {
int dev; // Plugin dev number
uint64_t asic;
int port;
float bw;
float latency;
int gdrSupport;
int collSupport;
int maxChannels;
}net;
struct {
int arch;
int vendor;
int model;
cpu_set_t affinity;
}cpu;
struct {
uint64_t device;
}pci;
};
int nlinks;
struct ncclTopoLink links[NCCL_TOPO_MAX_LINKS];
// Pre-computed paths to GPUs and NICs
struct ncclTopoLinkList* paths[NCCL_TOPO_NODE_TYPES];
// Used during search
uint64_t used;
};

node的type取值如下，其中CPU表示的其实是一个NUMA：

点击(此处)折叠或打开

#define NCCL_TOPO_NODE_TYPES 7
#define GPU 0
#define PCI 1
#define NVS 2
#define CPU 3 // Actually NUMA domains
#define NIC 4
#define NET 5

4. ncclTopoNodeSet

ncclTopoNodeSet表示某种类型的所有节点集合，比如GPU，PCI，NIC等，其中count表示这个类型节点的总数量。

点击(此处)折叠或打开

struct ncclTopoNodeSet {
int count;
struct ncclTopoNode nodes[NCCL_TOPO_MAX_NODES];
};

5. ncclTopoSystem

{BANNED}最佳后ncclTopoSystem就表示整个图拓扑结构了，

点击(此处)折叠或打开

struct ncclTopoSystem {
int systemId;
uint64_t hostHashes[NCCL_TOPO_MAX_NODES];
int nHosts;
struct ncclTopoNodeSet nodes[NCCL_TOPO_NODE_TYPES];
float maxBw;
float totalBw;
};

以上数据结构的关系如下图所示：

下面看路径计算的具体逻辑：

ncclTopoComputePaths 计算路径的核心函数。首先，调用 ncclTopoRemovePathType 函数删除原来的 paths 信息。这样，我们可以开始重新计算。接着，调用 ncclTopoSetPaths 设置其它 Node 到指定 Node 的路径。

点击(此处)折叠或打开

ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm* comm) {
// Precompute paths between GPUs/NICs.
// Remove everything in case we're re-computing
for (int t=0; t<NCCL_TOPO_NODE_TYPES; t++) ncclTopoRemovePathType(system, t);
// Set direct paths to CPUs. We need them in many cases.
for (int c=0; c<system->nodes[CPU].count; c++) {
NCCLCHECK(ncclTopoSetPaths(system->nodes[CPU].nodes+c, system));
}
// Set direct paths to GPUs.
for (int g=0; g<system->nodes[GPU].count; g++) {
NCCLCHECK(ncclTopoSetPaths(system->nodes[GPU].nodes+g, system));
}
// Set direct paths to NICs.
for (int n=0; n<system->nodes[NET].count; n++) {
NCCLCHECK(ncclTopoSetPaths(system->nodes[NET].nodes+n, system));
}
// Set direct paths to NVSwitches.
for (int n=0; n<system->nodes[NVS].count; n++) {
NCCLCHECK(ncclTopoSetPaths(system->nodes[NVS].nodes+n, system));
}

ncclTopoSetPaths作用就是计算出其他所有节点到baseNode的path，这里先遍历所有的CPU节点，计算出其他所有节点到所有CPU节点的路径。

点击(此处)折叠或打开

//计算出其他所有节点到baseNode的path
static ncclResult_t ncclTopoSetPaths(struct ncclTopoNode* baseNode, struct ncclTopoSystem* system) {
if (baseNode->paths[baseNode->type] == NULL) {
NCCLCHECK(ncclCalloc(baseNode->paths+baseNode->type, system->nodes[baseNode->type].count));
}
// breadth-first search to set all paths to that node in the system
//nodeList和nextNodeList就是队列的作用，先将baseNode入队列
struct ncclTopoNodeList nodeList;
struct ncclTopoNodeList nextNodeList;
nodeList.count = 1; nodeList.list[0] = baseNode;
nextNodeList.count = 0;
struct ncclTopoLinkList* basePath;
//getPath函数是获取node中到type为t的第id个节点的路径path
//getPath(struct ncclTopoSystem* system, struct ncclTopoNode* node, int t, int64_t id, struct ncclTopoLinkList** path)
NCCLCHECK(getPath(system, baseNode, baseNode->type, baseNode->id, &basePath));
//由于这里获取的是baseNode自己到自己的path，所以设置count为0，带宽为LOC_WIDTH，type为PATH_LOC
basePath->count = 0;
basePath->bw = LOC_BW;
basePath->type = PATH_LOC;
while (nodeList.count) { //每次从nodeList中拿出一个节点node
nextNodeList.count = 0;
for (int n=0; n<nodeList.count; n++) {
struct ncclTopoNode* node = nodeList.list[n];
struct ncclTopoLinkList* path;
//获取node到baseNode的路径path
NCCLCHECK(getPath(system, node, baseNode->type, baseNode->id, &path));
//用node去更新和node相连的节点，遍历node的边link，获取link对端节点remNode，获取remNode到baseNode的路径remPath，此时需要比较两个路径哪个更优，一个路径是原来的remPath，另一个是path+link这个新路径，新路径的带宽width是path和link的带宽取个min，如果width大于remPath->width>，那么remPath更新为path+link
for (int l=0; l<node->nlinks; l++) {
struct ncclTopoLink* link = node->links+l;
struct ncclTopoNode* remNode = link->remNode;
if (remNode->paths[baseNode->type] == NULL) {
NCCLCHECK(ncclCalloc(remNode->paths+baseNode->type, system->nodes[baseNode->type].count));
for (int i=0; i<system->nodes[baseNode->type].count; i++) remNode->paths[baseNode->type][i].type = PATH_DIS;
}
struct ncclTopoLinkList* remPath;
NCCLCHECK(getPath(system, remNode, baseNode->type, baseNode->id, &remPath));
float bw = std::min(path->bw, link->bw);
// allow routing through a GPU only as 1 hop
if (node != baseNode && node->type == GPU &&
(ncclParamNvbDisable() || link->type != LINK_NVL || remNode->type != GPU || path->count > 1)) continue;
if ((remPath->bw == 0 || remPath->count > path->count) && remPath->bw < bw) {
// Find reverse link
for (int l=0; l<remNode->nlinks; l++) {
if (remNode->links[l].remNode == node && remNode->links[l].type == link->type) {
remPath->list[0] = remNode->links+l;
break;
}
}
if (remPath->list[0] == NULL) {
WARN("Failed to find reverse path from remNode %d/%lx nlinks %d to node %d/%lx",
remNode->type, remNode->id, remNode->nlinks, node->type, node->id);
return ncclInternalError;
}
// Copy the rest of the path
for (int i=0; i<path->count; i++) remPath->list[i+1] = path->list[i];
remPath->count = path->count + 1;
remPath->bw = bw;
// Start with path type = link type. PATH and LINK types are supposed to match.
// Don't consider LINK_NET as we only care about the NIC->GPU path.
int type = link->type == LINK_NET ? LINK_LOC : link->type;
// Differentiate between one and multiple PCI switches
if (node->type == PCI && remNode->type == PCI) type = PATH_PXB;
// Consider a path going through the CPU as PATH_PHB
if (link->type == LINK_PCI && (node->type == CPU || link->remNode->type == CPU)) type = PATH_PHB;
// Set 1 hop NVLink as NVB
if (node->type == GPU && path->type == PATH_NVL && type == PATH_NVL && remPath->count > 1) type = PATH_NVB;
remPath->type = std::max(path->type, type);
// Add to the list for the next iteration if not already in the list
//nextNodeList中存放的是当前节点的相邻节点
int i;
for (i=0; i<nextNodeList.count; i++) if (nextNodeList.list[i] == remNode) break;
if (i == nextNodeList.count) nextNodeList.list[nextNodeList.count++] = remNode;
}
}
}
//等nodeList遍历完之后，将nextNodeList赋给nodeList继续遍历。
memcpy(&nodeList, &nextNodeList, sizeof(nodeList));
}
return ncclSuccess;
}

ncclTopoSetPaths接口将参数baseNode节点起点，进行广度优先搜索，具体搜索过程如代码中注释，不再展开。路径更新后需要计算remPath的type，这里有个取巧的地方是设置边type和设置路径type是对应的，比如LINK_PCI等于PATH_PIX，然后可以看到之前说的各种路径的type是怎么计算出来的：

首先计算当前link作为一条路径的type，初始化为link的type，比如这个边是LINK_PCI，那么就是LINK_PIX，如果link有一端是CPU，那么type进一步更新为PATH_PHB，{BANNED}最佳后取个max，remPath->type = std::max(path->type, type)。

回到ncclTopoComputePaths，经过几次ncclTopoSetPaths的调用，就完成了CPU，GPU，NIC，NVSwitch四种不同类型的node之间的path创建。随后，我们判断任意 GPU 对间是否存在 P2P链接。如果不存在，则在路径间添加中继 CPU；具体来说，将 GPU 对之间的路径更新为源 GPU 到 CPU 和 CPU 到目的 GPU 路径的总和。

默认GPU直接是支持P2P的，当存在以下情况时，GPU 对间不存在 P2P链接：

1) 如果设置了NCCL_P2P_DISABLE环境变量，或者不符合NCCL_P2P_LEVEL环境变量的设定

2) 对于 ARM、Intel 和 ZHAOXIN 芯片，设置 p2p level为PXB

3) 根据 p2p level 的值和 GPU 对间的链接判断是否使用 p2p 链接

点击(此处)折叠或打开

// Update path for GPUs when we don't want to / can't use GPU Direct P2P
for (int g=0; g<system->nodes[GPU].count; g++) {
for (int p=0; p<system->nodes[GPU].count; p++) {
int p2p;
NCCLCHECK(ncclTopoCheckP2p(system, system->nodes[GPU].nodes[p].id, system->nodes[GPU].nodes[g].id, &p2p, NULL, NULL));
if (p2p == 0) {
// Divert all traffic through the CPU
int cpu;
NCCLCHECK(getLocalCpu(system, g, &cpu));
NCCLCHECK(addInterStep(system, CPU, cpu, GPU, p, GPU, g));
}
}

然后判断当前GPU和其他GPU是否可以通过shm通信，因为在docker环境中如果shm挂载的不一样就无法通信，如果无法通过shm通信的话就将path的count设置为0，之后会删除掉对应节点。

点击(此处)折叠或打开

if (comm == NULL) continue;
// Remove GPUs we can't (or don't want to) communicate with through P2P or SHM
struct ncclPeerInfo* dstInfo = comm->peerInfo+system->nodes[GPU].nodes[g].gpu.rank;
for (int p=0; p<system->nodes[GPU].count; p++) {
if (p == g) continue;
struct ncclPeerInfo* srcInfo = comm->peerInfo+system->nodes[GPU].nodes[p].gpu.rank;
int p2p;
NCCLCHECK(ncclTransports[TRANSPORT_P2P]->canConnect(&p2p, system, NULL, srcInfo, dstInfo));
if (p2p == 0) {
int shm;
NCCLCHECK(ncclTransports[TRANSPORT_SHM]->canConnect(&shm, system, NULL, srcInfo, dstInfo));
if (shm == 0) {
// Mark this peer as inaccessible. We'll trim it later.
system->nodes[GPU].nodes[p].paths[GPU][g].type = PATH_NET;
}
}
}

{BANNED}最佳后类似GPU，然后对所有的NIC执行ncclTopoSetPaths计算出路径，然后遍历每个NIC和每个GPU，判断是否支持gdr。

点击(此处)折叠或打开

// Update paths for NICs (no GPU Direct, PXN, ...)
for (int n=0; n<system->nodes[NET].count; n++) {
struct ncclTopoNode* netNode = system->nodes[NET].nodes+n;
for (int g=0; g<system->nodes[GPU].count; g++) {
// Check whether we can access the NIC through another NVLink-connected GPU (PXN)
struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
if (ncclPxnDisable(comm) != 1) {
int localGpuIndex;
NCCLCHECK(ncclTopoGetLocalGpu(system, netNode->id, &localGpuIndex));
if (localGpuIndex != g && localGpuIndex != -1) {
// PXN = PCI + NVLink.
struct ncclTopoNode* peerNode = system->nodes[GPU].nodes+localGpuIndex;
// Only use PXN for NIC n if remote GPU p ...
if (peerNode->paths[NET][n].type <= PATH_PXB && // Is connected to the NIC through PCI
peerNode->paths[GPU][g].type <= PATH_NVL && // Is connected to us through NVLink
NCCL_TOPO_ID_SYSTEM_ID(peerNode->id) == NCCL_TOPO_ID_SYSTEM_ID(gpu->id) && // Is on the same node as us
(peerNode->paths[NET][n].bw > gpu->paths[NET][n].bw || // Has either higher BW to that NIC
gpu->paths[NET][n].type > PATH_PXB)) // or avoids going through a CPU
// We can use that GPU as relay to communicate with that NIC.
// Only enabling it in the GPU->NIC direction for now to favor
// receiving locally and sending remotely (consistent with net.cc)
NCCLCHECK(addInterStep(system, GPU, localGpuIndex, GPU, g, NET, n));
}
}
if (gpu->paths[NET][n].type < PATH_PHB) {
// Update path when we dont want to / can't use GPU Direct RDMA.
int gdr;
NCCLCHECK(ncclTopoCheckGdr(system, system->nodes[GPU].nodes[g].id, netNode->id, 0, &gdr));
if (gdr == 0) {
// We cannot use GPU Direct RDMA, divert all traffic through the CPU local to the GPU
int localCpu;
NCCLCHECK(getLocalCpu(system, g, &localCpu));
NCCLCHECK(addInterStep(system, CPU, localCpu, NET, n, GPU, g));
NCCLCHECK(addInterStep(system, CPU, localCpu, GPU, g, NET, n));
}
}
}
}
return ncclSuccess;

而ncclTopoCheckGdr判断是否支持GDR时，除了看之前判断是否支持gdr之外，还要看GPU和NIC之间的距离是否小于netGdrLevel，netGdrLevel默认是PATH_PXB，用户也可以自定义，默认值为PXB的原因可见官方文档：

Even though the only theoretical requirement for GPUDirect RDMA to work between a third-party device and an NVIDIA GPU is that they share the same root complex, there exist bugs (mostly in chipsets) causing it to perform badly, or not work at all in certain setups.

We can distinguish between three situations, depending on what is on the path between the GPU and the third-party device:

PCIe switches only

single CPU/IOH

CPU/IOH <-> QPI/HT <-> CPU/IOH

The first situation, where there are only PCIe switches on the path, is optimal and yields the best performance. The second one, where a single CPU/IOH is involved, works, but yields worse performance ( especially peer-to-peer read bandwidth has been shown to be severely limited on some processor architectures ). Finally, the third situation, where the path traverses a QPI/HT link, may be extremely performance-limited or even not work reliably.

可以看到在只有经过PCIe switch的时候性能{BANNED}最佳好，在经过CPU的时候性能较差，在跨numa的时候性能很差，甚至不可用。

到这里ncclTopoComputePaths就完成了，接下来会通过ncclTopoTrimSystem删除图中不可达的GPU节点和用不到的NIC。其思路是通过类似并查集的思路将多个GPU节点合并成多个集合，myDomain为当前rank的GPU所对应的集合号，然后将不属于myDomain集合的GPU节点在图中删除掉，{BANNED}最佳后判断下如果comm的rank数等于当前图中的gpu节点数，那么说明不需要网卡，所以也将网卡从图中删除。得到新的图结构后再重新执行一次ncclTopoComputePaths就得到{BANNED}最佳终各个节点之间的路径了。

阅读(388) | 评论(0) | 转发(0) |

上一篇：GPU Direct相关技术和原理

下一篇：NCCL源码解析4——channel搜索

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6