DESIGN TOOLS
Storage

Micron 9400 NVMe SSDs Explore Big Accelerator Memory Using NVIDIA Technology

约翰·马齐| 2024年1月

数据集训练规模继续增长,超过数十亿个参数. While some models can fit in system memory completely, larger models cannot. 在这种情况下, data loaders need to access models located on flash storage through various methods. 其中一种方法是将内存映射文件存储在ssd上. 这允许数据加载器访问文件,就像它在内存中一样, but the overhead of the CPU and software stack drastically reduces the performance of the training system. This is where Big accelerator Memory (BaM)* and the GPU-Initiated Direct Storage (GIDS)* data loader come in.

什么是BaM和GIDS?

 

BaM是一种利用低延迟的系统架构, 极高的吞吐量, 大的密度, 和ssd的耐用性. BaM’s goal is to provide efficient abstractions that enable GPU threads to make fine-grained accesses to datasets on SSDs and achieve much higher performance than solutions requiring CPU(s) to provide storage requests to serve GPUs. BaM acceleration uses a custom storage driver that is designed specifically to enable the inherent parallelism of GPUs to access storage devices directly. BaM不同于 NVIDIA Magnum IO™GPUDirect® 存储(GDS), as BaM doesn’t rely on the CPU to prepare the communication from GPU to SSD.

美光之前与NVIDIA GDS的合作如下:

The GIDS dataloader is built on the BaM subsystem to address memory capacity requirements for GPU-accelerated Graph Neural Network (GNN) training while also masking storage latency. GIDS通过在SSD上存储图的特征数据来实现这一点, since this data is typically the largest part of the total graph dataset for large-scale graphs. 图结构数据, 与特征数据相比,哪个通常要小得多, 被钉入系统内存,使快速GPU图形采样. Lastly, the GIDS dataloader allocates a software-defined cache on the GPU memory for recently accessed nodes in order to reduce storage accesses.

使用GIDS进行图神经网络训练

 

以显示BaM和GIDS的好处, we performed GNN training using the Illinois Graph Benchmark (IGB) heterogeneous full dataset. 这个数据集是2.28TB大,不适合大多数平台的系统内存. We timed the training for 100 iterations using a single Nvidia a100 80gb Tensor Core GPU and varied the number of SSDs to provide a broad range of results, 如图1和表1所示.

Figure 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations

 

 

GIDS(4块ssd)

GIDS (2 SSDs)

GIDS (1 SSD)

内存映射抽象

Sampling

 4.75

 4.93

 4.08

 4.65

功能聚合 

 8.57

 15.9

 31.6

 1,130

Training

 1.98

 1.97

 1.87

 2.13

End-to-End

 15.3

 22.8

 37.6

 1,143

Table 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations


The first part of the training is graph sampling done by the GPU and by accessing the graph structure data within system memory (seen in blue). This value varies little across the different test configurations because the structure stored in system memory does not change between these tests.

另一部分是实际训练时间(见最右边的绿色部分). 这部分高度依赖于GPU, and we can see that this does not change much between the multiple test configurations as expected.

最重要的部分, 我们在哪里看到最大的差异, 是特征聚合(用金色显示). 由于该系统的特性数据存储在Micron 9400 ssd上, we see that scaling from 1 to 4 Micron 9400 SSDs drastically improves (reduces) the feature aggregation processing time. 特征聚合提高了3.从1个SSD扩展到4个SSD. 

我们还包括了基线计算, which uses a memory map abstraction and the Deep Graph Library (DGL) data loader to access the feature data. Because this method of accessing the feature data requires the use of the CPU software stack instead of direct access by the GPU, we can see how inefficient the CPU software stack is at keeping the GPU saturated during training. 特性抽象相对于基线的改进是35.使用GIDS和131的1微米9400 NVMe SSD的76x.在4微米9400 NVMe固态硬盘上运行. 该数据的另一个视图如图2和表2所示, 其中显示了这些测试期间的有效带宽和IOPs.

图2:GIDS训练与基线的有效带宽和IOPS

 

 

DGL内存映射

GIDS (1 SSD)

GIDS (2 SSDs)

GIDS(4块ssd)

有效带宽(GB/s)

0.194

6.9

13.8

25.6

实现IOPs (M/s)

0.049

1.7

3.4

6.3

表2:GIDS训练与基线的有效带宽和IOPS


随着数据集的不断增长, we can see the need for a shift in paradigm in order to train these models in a reasonable amount of time and to take advantage of the improvements provided by leading GPUs. BaM和GIDS是一个很好的起点, and we look forward to working with more of these types of systems in the future.

Test System

 

Component

Details

Server

Supermicro® 4124年gs-tnr

CPU

2x AMD EPYC™7702(64核)

Memory

1tb微米DDR4-3200

GPU

Nvidia a100 80gb

内存时钟:1512mhz

SM时钟:1410mhz

SSDs

4x Micron 9400 MAX 6.4TB

OS

Ubuntu 22.04 LTS,内核5.15.0.86

NVIDIA驱动程序

535.113.01

软件栈

CUDA 12.2, DGL 1.1.2, Pytorch 2.1运行在NVIDIA Docker容器

首席存储解决方案工程师

John Mazzie

John is a Member of the Technical Staff in the Data Center Workload Engineering group in Austin, TX. He graduated in 2008 from West Virginia University with his MSEE with an emphasis in wireless communications. John has worked for Dell on their storage MD3 Series of storage arrays on both the development and sustaining side. John于2016年加入美光,从事Cassandra的研究, MongoDB, and Ceph, 以及其他高级存储工作负载.