登录/注册
- US - English
- China - 简体中文
- 印度-英语
- Japan - 日本語
- 马来西亚-英语
- 新加坡-英语
- 台湾-繁體中文
数据集训练规模继续增长,超过数十亿个参数. While some models can fit in system memory completely, larger models cannot. 在这种情况下, data loaders need to access models located on flash storage through various methods. 其中一种方法是将内存映射文件存储在ssd上. 这允许数据加载器访问文件,就像它在内存中一样, but the overhead of the CPU and software stack drastically reduces the performance of the training system. This is where Big accelerator Memory (BaM)* and the GPU-Initiated Direct Storage (GIDS)* data loader come in.
什么是BaM和GIDS?
BaM是一种利用低延迟的系统架构, 极高的吞吐量, 大的密度, 和ssd的耐用性. BaM’s goal is to provide efficient abstractions that enable GPU threads to make fine-grained accesses to datasets on SSDs and achieve much higher performance than solutions requiring CPU(s) to provide storage requests to serve GPUs. BaM acceleration uses a custom storage driver that is designed specifically to enable the inherent parallelism of GPUs to access storage devices directly. BaM不同于 NVIDIA Magnum IO™GPUDirect® 存储(GDS), as BaM doesn’t rely on the CPU to prepare the communication from GPU to SSD.
美光之前与NVIDIA GDS的合作如下:
- The Micron® 9400 NVMe™ SSD Performance With NVIDIA Magnum IO GPUDirect Storage Platform
- Micron Collaboration With Magnum IO GPUDirect Storage Brings Industry-Disruptive Innovation to AI & ML
The GIDS dataloader is built on the BaM subsystem to address memory capacity requirements for GPU-accelerated Graph Neural Network (GNN) training while also masking storage latency. GIDS通过在SSD上存储图的特征数据来实现这一点, since this data is typically the largest part of the total graph dataset for large-scale graphs. 图结构数据, 与特征数据相比,哪个通常要小得多, 被钉入系统内存,使快速GPU图形采样. Lastly, the GIDS dataloader allocates a software-defined cache on the GPU memory for recently accessed nodes in order to reduce storage accesses.
使用GIDS进行图神经网络训练
以显示BaM和GIDS的好处, we performed GNN training using the Illinois Graph Benchmark (IGB) heterogeneous full dataset. 这个数据集是2.28TB大,不适合大多数平台的系统内存. We timed the training for 100 iterations using a single Nvidia a100 80gb Tensor Core GPU and varied the number of SSDs to provide a broad range of results, 如图1和表1所示.
Figure 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations
|
GIDS(4块ssd) |
GIDS (2 SSDs) |
GIDS (1 SSD) |
内存映射抽象 |
Sampling |
4.75 |
4.93 |
4.08 |
4.65 |
功能聚合 |
8.57 |
15.9 |
31.6 |
1,130 |
Training |
1.98 |
1.97 |
1.87 |
2.13 |
End-to-End |
15.3 |
22.8 |
37.6 |
1,143 |
Table 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations
The first part of the training is graph sampling done by the GPU and by accessing the graph structure data within system memory (seen in blue). This value varies little across the different test configurations because the structure stored in system memory does not change between these tests.
另一部分是实际训练时间(见最右边的绿色部分). 这部分高度依赖于GPU, and we can see that this does not change much between the multiple test configurations as expected.
最重要的部分, 我们在哪里看到最大的差异, 是特征聚合(用金色显示). 由于该系统的特性数据存储在Micron 9400 ssd上, we see that scaling from 1 to 4 Micron 9400 SSDs drastically improves (reduces) the feature aggregation processing time. 特征聚合提高了3.从1个SSD扩展到4个SSD.
我们还包括了基线计算, which uses a memory map abstraction and the Deep Graph Library (DGL) data loader to access the feature data. Because this method of accessing the feature data requires the use of the CPU software stack instead of direct access by the GPU, we can see how inefficient the CPU software stack is at keeping the GPU saturated during training. 特性抽象相对于基线的改进是35.使用GIDS和131的1微米9400 NVMe SSD的76x.在4微米9400 NVMe固态硬盘上运行. 该数据的另一个视图如图2和表2所示, 其中显示了这些测试期间的有效带宽和IOPs.
图2:GIDS训练与基线的有效带宽和IOPS
|
DGL内存映射 |
GIDS (1 SSD) |
GIDS (2 SSDs) |
GIDS(4块ssd) |
有效带宽(GB/s) |
0.194 |
6.9 |
13.8 |
25.6 |
实现IOPs (M/s) |
0.049 |
1.7 |
3.4 |
6.3 |
表2:GIDS训练与基线的有效带宽和IOPS
随着数据集的不断增长, we can see the need for a shift in paradigm in order to train these models in a reasonable amount of time and to take advantage of the improvements provided by leading GPUs. BaM和GIDS是一个很好的起点, and we look forward to working with more of these types of systems in the future.
Test System
Component |
Details |
Server |
Supermicro® 4124年gs-tnr |
CPU |
|
Memory |
1tb微米DDR4-3200 |
GPU |
内存时钟:1512mhz SM时钟:1410mhz |
SSDs |
4x Micron 9400 MAX 6.4TB |
OS |
Ubuntu 22.04 LTS,内核5.15.0.86 |
NVIDIA驱动程序 |
535.113.01 |
软件栈 |
CUDA 12.2, DGL 1.1.2, Pytorch 2.1运行在NVIDIA Docker容器 |
参考链接
大型加速器存储器纸 & GitHub
GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture (arxiv.org)
GitHub - ZaidQureshi/bam
GPU发起的直接存储纸和GitHub
Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses (arxiv.org)
GitHub - jeongminpark417/GIDS
GitHub - IllinoisGraphBenchmark/IGB-Datasets: Largest real-world open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research.
*Note: NVIDIA Big Accelerator Memory (BaM) and NVIDIA GPU Initiated Direct Storage (GIDS) dataloader are prototype projects from NVIDIA Research and are not intended for general release.