Micron 9400 NVMe SSDs explore big accelerator memory using NVIDIA technology

数据集训练规模继续增长，超过数十亿个参数. While some models can fit in system memory completely, larger models cannot. 在这种情况下, data loaders need to access models located on flash storage through various methods. 其中一种方法是将内存映射文件存储在ssd上. 这允许数据加载器访问文件，就像它在内存中一样, but the overhead of the CPU and software stack drastically reduces the performance of the training system. This is where Big accelerator Memory (BaM)* and the GPU-Initiated Direct Storage (GIDS)* data loader come in.

什么是BaM和GIDS?

BaM是一种利用低延迟的系统架构, 极高的吞吐量, 大的密度, 和ssd的耐用性. BaM’s goal is to provide efficient abstractions that enable GPU threads to make fine-grained accesses to datasets on SSDs and achieve much higher performance than solutions requiring CPU(s) to provide storage requests to serve GPUs. BaM acceleration uses a custom storage driver that is designed specifically to enable the inherent parallelism of GPUs to access storage devices directly. BaM不同于 NVIDIA Magnum IO™GPUDirect^® 存储(GDS), as BaM doesn’t rely on the CPU to prepare the communication from GPU to SSD.

美光之前与NVIDIA GDS的合作如下:

The GIDS dataloader is built on the BaM subsystem to address memory capacity requirements for GPU-accelerated Graph Neural Network (GNN) training while also masking storage latency. GIDS通过在SSD上存储图的特征数据来实现这一点, since this data is typically the largest part of the total graph dataset for large-scale graphs. 图结构数据, 与特征数据相比，哪个通常要小得多, 被钉入系统内存，使快速GPU图形采样. Lastly, the GIDS dataloader allocates a software-defined cache on the GPU memory for recently accessed nodes in order to reduce storage accesses.

使用GIDS进行图神经网络训练

以显示BaM和GIDS的好处, we performed GNN training using the Illinois Graph Benchmark (IGB) heterogeneous full dataset. 这个数据集是2.28TB大，不适合大多数平台的系统内存. We timed the training for 100 iterations using a single Nvidia a100 80gb Tensor Core GPU and varied the number of SSDs to provide a broad range of results, 如图1和表1所示.

Figure 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations

	GIDS(4块ssd)	GIDS (2 SSDs)	GIDS (1 SSD)	内存映射抽象
Sampling	4.75	4.93	4.08	4.65
功能聚合	8.57	15.9	31.6	1,130
Training	1.98	1.97	1.87	2.13
End-to-End	15.3	22.8	37.6	1,143

Table 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations

The first part of the training is graph sampling done by the GPU and by accessing the graph structure data within system memory (seen in blue). This value varies little across the different test configurations because the structure stored in system memory does not change between these tests.

另一部分是实际训练时间(见最右边的绿色部分). 这部分高度依赖于GPU, and we can see that this does not change much between the multiple test configurations as expected.

最重要的部分, 我们在哪里看到最大的差异, 是特征聚合(用金色显示). 由于该系统的特性数据存储在Micron 9400 ssd上, we see that scaling from 1 to 4 Micron 9400 SSDs drastically improves (reduces) the feature aggregation processing time. 特征聚合提高了3.从1个SSD扩展到4个SSD.

我们还包括了基线计算, which uses a memory map abstraction and the Deep Graph Library (DGL) data loader to access the feature data. Because this method of accessing the feature data requires the use of the CPU software stack instead of direct access by the GPU, we can see how inefficient the CPU software stack is at keeping the GPU saturated during training. 特性抽象相对于基线的改进是35.使用GIDS和131的1微米9400 NVMe SSD的76x.在4微米9400 NVMe固态硬盘上运行. 该数据的另一个视图如图2和表2所示, 其中显示了这些测试期间的有效带宽和IOPs.

图2:GIDS训练与基线的有效带宽和IOPS

	DGL内存映射	GIDS (1 SSD)	GIDS (2 SSDs)	GIDS(4块ssd)
有效带宽(GB/s)	0.194	6.9	13.8	25.6
实现IOPs (M/s)	0.049	1.7	3.4	6.3

表2:GIDS训练与基线的有效带宽和IOPS

随着数据集的不断增长, we can see the need for a shift in paradigm in order to train these models in a reasonable amount of time and to take advantage of the improvements provided by leading GPUs. BaM和GIDS是一个很好的起点, and we look forward to working with more of these types of systems in the future.

Test System

Component	Details
Server	Supermicro® 4124年gs-tnr
CPU	2x AMD EPYC™7702(64核)
Memory	1tb微米DDR4-3200
GPU	Nvidia a100 80gb 内存时钟:1512mhz SM时钟:1410mhz
SSDs	4x Micron 9400 MAX 6.4TB
OS	Ubuntu 22.04 LTS，内核5.15.0.86
NVIDIA驱动程序	535.113.01
软件栈	CUDA 12.2, DGL 1.1.2, Pytorch 2.1运行在NVIDIA Docker容器

参考链接

大型加速器存储器纸 & GitHub
GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture (arxiv.org)
GitHub - ZaidQureshi/bam

GPU发起的直接存储纸和GitHub
Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses (arxiv.org)
GitHub - jeongminpark417/GIDS
GitHub - IllinoisGraphBenchmark/IGB-Datasets: Largest real-world open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research.

*Note: NVIDIA Big Accelerator Memory (BaM) and NVIDIA GPU Initiated Direct Storage (GIDS) dataloader are prototype projects from NVIDIA Research and are not intended for general release.

首席存储解决方案工程师

John Mazzie

John is a Member of the Technical Staff in the Data Center Workload Engineering group in Austin, TX. He graduated in 2008 from West Virginia University with his MSEE with an emphasis in wireless communications. John has worked for Dell on their storage MD3 Series of storage arrays on both the development and sustaining side. John于2016年加入美光，从事Cassandra的研究, MongoDB, and Ceph, 以及其他高级存储工作负载.

沙巴体育结算平台概述

搜索、过滤和下载美光数据表

Market & 行业概述

人工智能和机器学习

合作伙伴的概述

了解并参加美光的技术支持计划(TEP)

Sales & 支持概述

联系美光的销售支持

沙巴体育安卓版下载概述

访问Careers加入我们的团队.

投资者关系概述

访问美光的投资者关系网站

Micron 9400 NVMe SSDs Explore Big Accelerator Memory Using NVIDIA Technology

什么是BaM和GIDS?

使用GIDS进行图神经网络训练

Test System

参考链接

John Mazzie