DAOS 系统内部介绍(一)—— 概述_daos internel

The purpose of this document is to describe the internal code structure and major algorithms used by DAOS. It assumes prior knowledge of the DAOS storage model and acronyms. This document contains the following sections:

// 本文的目的是描述DAOS使用的内部代码结构和主要算法。它假定对DAOS存储模型和首字母缩略词有先验知识。本文件包含以下部分:

DAOS Components

As illustrated in the diagram below, a DAOS installation involves several components that can be either colocated or distributed. The DAOS software-defined storage (SDS) framework relies on two different communication channels: an out-of-band TCP/IP network for management and a high-performant fabric for data access. In practice, the same network can be used for both management and data access. IP over fabric can also be used as the management network. 

//如下图所示,DAOS安装涉及多个组件,这些组件可以是同一位置的,也可以是分布式的。DAOS软件定义存储(SDS)框架依赖于两种不同的通信通道:用于管理的带外TCP/IP网络和用于数据访问的高性能fabric。实际上,同一个网络可以用于管理和数据访问。IP over fabric也可用作管理网络。

DAOS System

A DAOS server is a multi-tenant daemon running on a Linux instance (i.e. physical node, VM or container) and managing the locally-attached SCM and NVM storage allocated to DAOS. It listens to a management port, addressed by an IP address and a TCP port number, plus one or more fabric endpoints, addressed by network URIs. The DAOS server is configured through a YAML file (/etc/daos/daos_server.yml, or a different path provided on the command line). Starting and stopping the DAOS server can be integrated with different daemon management or orchestration frameworks (e.g. a systemd script, a Kubernetes service or even via a parallel launcher like pdsh or srun).


A DAOS system is identified by a system name and consists of a set of DAOS servers connected to the same fabric. Two different systems comprise two disjoint sets of servers and do not coordinate with each other. DAOS pools cannot span across multiple systems. 


Internally, a DAOS server is composed of multiple daemon processes. The first one to be started is the control plane (binary named daos_server) which is responsible for parsing the configuration file, provisionning storage and eventually starting and monitoring one or multiple instances of the data plane (binary named daos_engine). The control plane is written in Go and implements the DAOS management API over the gRPC framework that provides a secured out-of-band channel to administrate a DAOS system. The number of data plane instances to be started by each server as well as the storage, CPU and fabric interface affinity can be configured through the daos_server.yml YAML configuration file.

//在内部,DAOS服务器由多个守护进程组成。第一个要启动的是控制平面(二进制名为daos_server),它负责解析配置文件、配置存储并最终启动和监视数据平面的一个或多个实例(二进制名为daos_engine)。控制平面用Go编写,并在gRPC框架上实现DAOS管理API,该框架提供了一个安全的带外通道来管理DAOS系统。每个服务器要启动的数据平面实例的数量以及存储、CPU和结构接口关联可以通过daos_server.yml YAML配置文件进行配置。

The data plane is a multi-threaded process written in C that runs the DAOS storage engine. It processes incoming metadata and I/O requests though the CART communication middleware and accesses local NVM storage via the PMDK (for storage-class memory, aka SCM) and SPDK (for NVMe SSDs) libraries. The data plane relies on Argobots for event-based parallel processing and exports multiple targets that can be independently addressed via the fabric. Each data plane instance is assigned a unique rank inside a DAOS system.

//数据平面是一个用C编写的多线程进程,它运行DAOS存储引擎。它通过CART通信中间件处理传入的元数据和I/O请求,并通过PMDK(用于存储类内存,又称SCM)和SPDK(用于NVMe ssd)库访问本地NVM存储。数据平面依赖于argobot进行基于事件的并行处理,并导出可通过fabric独立寻址的多个目标。在DAOS系统中,每个数据平面实例都被分配一个唯一的rank值

The control plane and data plane processes communicate locally through Unix Domain Sockets and a custom lightweight protocol called dRPC. 


For further reading:

Client APIs, Tools and I/O Middleware

Applications, users and administrators can interact with a DAOS system through two different client APIs.


The DAOS management Go package allows to administrate a DAOS system from any nodes that can communicate with the DAOS servers through the out-of-band management channel. This API is reserved for the DAOS system administrators who are authenticated through a specific certificate. The DAOS management API is intended to be integrated with different vendor-specific storage management or open-source orchestration frameworks. A CLI tool called dmg is built over the DAOS management API. For further reading on the management API and the dmg tool:

//DAOS management Go包允许从任何(可以通过带外管理通道与DAOS服务器通信的)节点管理DAOS系统。此API保留给通过特定证书进行身份验证的DAOS系统管理员。DAOS管理API旨在与不同的特定于供应商的存储管理或开源编排框架集成。名为dmg的CLI工具是在DAOS管理API上构建的。有关管理API和dmg工具的进一步阅读:

The DAOS library (libdaos) implements the DAOS storage model and is primarily targeted at application and I/O middleware developers who want to store datasets into DAOS containers. It can be used from any nodes connected to the fabric used by the targeted DAOS system. The application process is authenticated via the DAOS agent (see next section). The API exported by libdaos is commonly called the DAOS API (in contrast to the DAOS management API) and allows to manage containers and access DAOS objects through different interfaces (e.g. key-value store or array API). The libdfs library emulates POSIX file and directory abstractions over libdaos and provides a smooth migration path for applications that require a POSIX namespace. For further reading on libdaos, bindings for different programming languages and libdfs:

//DAOS库(libdaos)实现DAOS存储模型,主要面向希望将数据集存储到DAOS容器中的应用程序和I/O中间件开发人员。它可以从连接到目标DAOS系统使用的结构的任何节点使用。应用程序进程通过DAOS代理进行身份验证(请参阅下一节)。libdaos导出的API通常称为DAOS API(与DAOS管理API不同),允许通过不同的接口(例如键值存储或数组API)管理容器和访问DAOS对象。libdfs库模拟libdaos上的POSIX文件和目录抽象,并为需要POSIX命名空间的应用程序提供平滑的迁移路径。有关libdaos、不同编程语言和libdf的绑定的进一步阅读:

The libdaos and libdfs libraries provide the foundation to support domain-specific data formats like HDF5 and Apache Arrow. For further reading on I/O middleware integration, please check the following external references:

//libdaos libdfs 库为支持特定领域的数据格式(如HDF5和Apache Arrow)提供了基础。有关I/O中间件集成的进一步阅读,请查看以下外部参考资料:


The DAOS agent is a daemon residing on the client nodes. It interacts with the DAOS client library through dRPC to authenticate the application process. It is a trusted entity that can sign the DAOS Client credentials using local certificates. The DAOS agent can support different authentication frameworks and uses a Unix Domain Socket to communicate with the client library. The DAOS agent is written in Go and communicates through gRPC with the control plane component of each DAOS server to provide DAOS system membership information to the client library and to support pool listing.


Network Transport and Communications

As introduced in the previous section, DAOS uses three different communication channels.

gRPC and Protocol Buffers

gRPC provides a bi-directional secured channel for DAOS management. It relies on TLS/SSL to authenticate the administrator role and the servers. Protocol buffers are used for RPC serialization and all proto files are located in the proto directory. //gRPC为DAOS管理提供了双向安全通道。它依赖于TLS/SSL来验证管理员角色和服务器。协议缓冲区用于RPC序列化,所有proto文件都位于proto目录中。


dRPC is communication channel built over Unix Domain Socket that is used for inter-process communications. It provides bothC and Go interface to support interactions between: 


  • the daos_agent and libdaos for application process authentication
  • the daos_server (control plane) and the daos_engine (data plane) daemons Like gRPC, RPC are serialized via protocol buffers.


CART is a userspace function shipping library that provides low-latency high-bandwidth communications for the DAOS data plane. It supports RDMA capabilities and scalable collective operations. CART is built over Mercury and libfabric. The CART library is used for all communications between libdaos and daos_engine instances.

// CART是一个用户空间函数传递库,它为DAOS数据平面提供低延迟高带宽通信。它支持RDMA功能和可扩展的集合操作。手推车是在Mercury (水银)和libfabric上建造的。CART库用于libdaos和daos_engine 实例之间的所有通信。

DAOS Layering and Services


As shown in the diagram below, the DAOS stack is structured as a collection of storage services over a client/server architecture. Examples of DAOS services are the pool, container, object and rebuild services.


 A DAOS service can be spread across the control and data planes and communicate internally through dRPC. Most services have client and server components that can synchronize through gRPC or CART. Cross-service communications are always done through direct API calls. Those function calls can be invoked across either the client or server component of the services. While each DAOS service is designed to be fairly autonomous and isolated, some are more tightly coupled than others. That is typically the case of the rebuild service that needs to interact closely with the pool, container and object services to restore data redundancy after a DAOS server failure.


While the service-based architecture offers flexibility and extensibility, it is combined with a set of infrastucture libraries that provide a rich software ecosystem (e.g. communications, persistent storage access, asynchronous task execution with dependency graph, accelerator support, ...) accessible to all the DAOS services.


Source Code Structure

Each infrastructure library and service is allocated a dedicated directory under src/. The client and server components of a service are stored in separate files. Functions that are part of the client component are prefixed with dc\_ (stands for DAOS Client) whereas server-side functions use the ds\_ prefix (stands for DAOS Server). The protocol and RPC format used between the client and server components is usually defined in a header file named rpc.h.

//每个基础结构库和服务在src/下分配一个专用目录。服务的客户机和服务器组件存储在不同的文件中。作为客户机组件的一部分的函数的前缀是dc \(代表DAOS client),而服务器端函数的前缀是ds \(代表DAOS server)。客户端和服务器组件之间使用的协议和RPC格式通常在名为RPC.h的头文件中定义。

All the Go code executed in context of the control plane is located under src/control. Management and security are the services spread across the control (Go language) and data (C language) planes and communicating internally through dRPC. //在控制平面上下文中执行的所有Go代码都位于src/control下。管理和安全是分布在控制(Go语言)和数据(C语言)平面上的服务,通过dRPC进行内部通信。

Headers for the official DAOS API exposed to the end user (i.e. I/O middleware or application developers) are under src/include and use the daos\_ prefix. Each infrastructure library exports an API that is available under src/include/daos and can be used by any services. The client-side API (with dc\_ prefix) exported by a given service is also stored under src/include/daos whereas the server-side interfaces (with ds\_ prefix) are under src/include/daos_srv.

// 向最终用户(即i/O中间件或应用程序开发人员)公开的官方DAOSAPI的标头位于src/include下,并使用DAOS\前缀。每个基础结构库导出一个API,该API在src/include/daos下可用,可以由任何服务使用。给定服务导出的客户端API(带有dc前缀)也存储在src/include/daos下,而服务器端接口(带有ds前缀)则存储在src/include/daos\srv下。

Infrastructure Libraries

The GURT and common DAOS (i.e. libdaos\_common) libraries provide logging, debugging and common data structures (e.g. hash table, btree, ...) to the DAOS services.

//GURT和common DAOS(即libdaos\_ common)库为DAOS服务提供日志记录、调试和公共数据结构(如哈希表、btree等)。

Local NVM storage is managed by the Versioning Object Store (VOS) and blob I/O (BIO) libraries. VOS implements the persistent index in SCM whereas BIO is responsible for storing application data in either NVMe SSD or SCM depending on the allocation strategy. The VEA layer is integrated into VOS and manages block allocation on NVMe SSDs.

//本地NVM存储由版本对象存储(VOS)和blob I/O(BIO)库管理。VOS在SCM中实现持久索引,而BIO负责根据分配策略将应用程序数据存储在NVMe SSD或SCM中。VEA层集成到VOS中,并管理NVMe ssd上的块分配。

DAOS objects are distributed across multiple targets for both performance (i.e. sharding) and resilience (i.e. replication or erasure code). The placement library implements different algorithms (e.g. ring-based placement, jump consistent hash, ...) to generate the layout of an object from the list of targets and the object identifier.


The replicated service (RSVC) library finally provides some common code to support fault tolerance. This is used by the pool, container & management services in conjunction with the RDB library that implements a replicated key-value store over Raft.


For further reading on those infrastructure libraries, please see: 有关这些基础结构库的更多信息,请参阅:

DAOS Services

The diagram below shows the internal layering of the DAOS services and interactions with the different libraries mentioned above. 


Vertical boxes represent DAOS services whereas horizontal ones are for infrastructure libraries.

// 垂直框表示DAOS服务,而水平框表示基础结构库。

For further reading on the internals of each service:

Software Compatibility

Interoperability in DAOS is handled via protocol and schema versioning for persistent data structures.

// DAOS中的互操作性是通过持久数据结构的协议和模式版本控制来处理的。

Protocol Compatibility

Limited protocol interoperability is to be provided by the DAOS storage stack. Version compatibility checks will be performed to verify that: //DAOS存储堆栈将提供有限的协议互操作性。将执行版本兼容性检查以验证:

  • All targets in the same pool run the same protocol version.
  • Client libraries linked with the application may be up to one protocol version older than the targets.

If a protocol version mismatch is detected among storage targets in the same pool, the entire DAOS system will fail to start up and will report failure to the control API. Similarly, connection from clients running a protocol version incompatible with the targets will return an error.


PM Schema Compatibility and Upgrade

The schema of persistent data structures may evolve from time to time to fix bugs, add new optimizations or support new features. To that end, the persistent data structures support schema versioning.

Upgrading the schema version is not done automatically and must be initiated by the administrator. A dedicated upgrade tool will be provided to upgrade the schema version to the latest one. All targets in the same pool must have the same schema version. Version checks are performed at system initialization time to enforce this constraint.

To limit the validation matrix, each new DAOS release will be published with a list of supported schema versions. To run with the new DAOS release, administrators will then need to upgrade the DAOS system to one of the supported schema version. New target will always be reformatted with the latest version. This versioning schema only applies to data structure stored in persistent memory and not to block storage that only stores user data with no metadata.

// 持久数据结构的模式可能会不时地演变,以修复错误、添加新的优化或支持新的特性。为此,持久数据结构支持模式版本控制。




