ON-DEMAND SHARED DATA CACHING METHOD, COMPUTER PROGRAM, AND COMPUTER READABLE MEDIUM APPLICABLE FOR DISTRIBUTED DEEP LEARNING COMPUTING
20230236980 · 2023-07-27
Inventors
Cpc classification
G06F2212/6042
PHYSICS
International classification
Abstract
Disclosed are an on-demand shared data caching method, a computer program, and a computer readable medium applicable for distributed deep learning computing. The method includes a step of dynamically building a distributed shared memory cache space, in which a distributed shared memory deployment and data file access management module is added to a deep learning framework to build the distributed shared memory cache space by a memory set of a multiple of computing nodes of a cluster computer; and a distributed deep learning computing step, in which the computing node overrides a Dataset API of the deep learning framework to execute the distributed deep learning computing. When reading a data file, if the data file exists in the distributed shared memory cache space, then it will be accessed directly, or else it will be obtained from an original specified directory location and stored in the distributed shared memory cache space.
Claims
1. An on-demand shared data caching method applicable for distributed deep learning computing, comprising the steps of: dynamically building a distributed shared memory cache space, in which a distributed shared memory deployment and data file access management module to a deep learning framework to share a part of memories of a plurality of computing nodes of a cluster computer and build a distributed shared memory cache space; and performing a distributed deep learning computing, in which the cluster computer executes a distributed deep learning computing, and the computing nodes override a Dataset API required by the deep learning framework, and a data file access rule of the distributed shared memory deployment and data file access management module is added, and all computing nodes continues their execution, and when it is necessary to read a data file, if the data file exists in the distributed shared memory cache space, then the data file will be accessed directly, or else the data file will be obtained from an original specified directory location and stored in the distributed shared memory cache space.
2. The on-demand shared data caching method applicable for distributed deep learning computing according to claim 1, wherein a resource configuration step is executed before the step of dynamically building a distributed shared memory cache space, in which a job script is written and the quantity of the computing nodes, the quantity of CPUs/GPUs and the size of the distributed shared memory cache space required for running the program are set and sent to a queueing system for configuring resources, and the information of the configured resources is stored into an environment variable for executing the job script, and the environment variable comprises a computing nodes list ($PBS_NODEFILE), the size of a distributed shared memory cache space ($PBS_GLBMEM), and the queueing system starts executing the program set in the job script of each computing node according to the assigned list of the computing nodes.
3. The on-demand shared data caching method applicable for distributed deep learning computing according to claim 2, wherein the computing nodes list ($PBS_NODEFILE) and the size of the distributed shared memory cache space ($PBS_GLBMEM) in the environment variable are read to set and build the distributed shared memory cache space, and the built distributed shared memory cache space is mounted on a mount point:/disfs of each computing node.
4. The on-demand shared data caching method applicable for distributed deep learning computing according to claim 1, wherein when the step of dynamically building a distributed shared memory cache space is executed, an initial function will be called to perform an initialization, and the initial function is overridden to build the distributed shared memory cache space, and the distributed shared memory deployment and data file access management module uses a Gluster File System (GlusterFS) for execution to produce a RAM disk on the memory of each computing node, and then uses the GlusterFS to connect the RAM disk of each computing node in series to form the distributed shared memory cache space.
5. The on-demand shared data caching method applicable for distributed deep learning computing according to claim 4, wherein the memory is a temporary file system (tmpfs) in an Unix/Linux system.
6. The on-demand shared data caching method applicable for distributed deep learning computing according to claim 1, wherein the distributed shared memory deployment and data file access management module adopts a remote direct memory access (RDMA) technology.
7. The on-demand shared data caching method applicable for distributed deep learning computing according to claim 1, wherein the Dataset API required for the deep learning framework required comprises TensorFlow (tf.data), and PyTorch (torch.utils.data).
8. The on-demand shared data caching method applicable for distributed deep learning computing according to claim 1, wherein if the data file does not exist in the distributed shared memory cache space when reading the data file in the distributed deep learning computing step, then the data file will be stored into the distributed shared memory cache space first before it is accessed.
9. The on-demand shared data caching method applicable for distributed deep learning computing according to claim 1, further comprising a step of releasing resources, in which the distributed shared memory cache space is released after the distributed deep learning computing ends.
10. The on-demand shared data caching method applicable for distributed deep learning computing according to claim 9, wherein after the distributed deep learning computing ends, all programs will call a destructor (Finalize function) and override the destructor, such that each computing node unloads its distributed shared memory cache space, and all data files will disappear after the unload, such that the distributed shared memory cache space of the computing node is released.
11. A computer program, installed to a computer, for executing the on-demand shared data caching method applicable for distributed deep learning computing according to claim 1.
12. A computer readable medium, for storing the computer program according to claim 11.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0037] The objectives, technical characteristics and effects of the on-demand shared data caching method, computer program, and computer readable medium applicable for distributed deep learning computing of the present disclosure will become apparent with the detailed description of preferred embodiments accompanied with the illustration of related drawings. It is intended that the embodiments and drawings disclosed herein are to be considered illustrative rather than restrictive.
[0038] With reference to
[0039] In the step of executing the resources configuration step as shown in
[0040] The job script, for example, is as follows—
#!/bin/bash
#SBATCH -J job_name# Job Name
#SBATCH --nodes 8 # of computing node
#SBATCH --gres=gpu:16# Total GPUs #SBATCH --memory=256G # distributed shared memory cache space (total memory capacity)
python DL training.py # Executing deep learning training program
[0041] With reference to
The instruction of the GlusterFS is as follows:
# gluster volume create vol_distributed transport tcp node1:/ramdisk node2:/ramdisk force
# gluster volume start vol_distributed
# apt -y install glusterfs-client
# mount -t glusterfs node1:/vol_distributed/disfs
[0042] With reference to
[0043] In the step of releasing resources, the distributed shared memory cache space is released after the distributed deep learning computing ends. Specifically, after the distributed deep learning computing ends, all programs will call a destructor (Finalize function) and override the destructor, such that each computing node 1 unloads its distributed shared memory cache space 2, and all data files will disappear after the unload, such that the distributed shared memory cache space 2 of the computing node is released. In this way, the distributed shared memory cache space can dynamically form an On-Demand Global Cached Memory according to the requirements of a computing job, and the job is released immediately after its completion without occupying the system memory space permanently.
[0044] In the embodiment as shown in
[0045] The on-demand shared data caching method applicable for distributed deep learning computing is executed by the computer program installed on the cluster computer, and the computer program can be stored in a computer readable medium.
[0046] While the invention has been described by means of specific embodiments, numerous modifications and variations could be made thereto by those skilled in the art without departing from the scope and spirit of the invention as set forth in the claims.