How is Google Aiming At a Trillion Parameter Model (PaLM): Page-by-Page Review

PATHWAYS: ASYNCHRONOUS DISTRIBUTED DATAFLOW FOR ML and Scaling Language Modeling with Pathways

Mandar Karhade, MD. PhD.
Towards AI

--

This time I was going to deviate from a usual 1 paper 1 review approach to 2 papers 1 review. The reason is that the Pathways paper (Pathways: Asynchronous Distributed Dataflow for ML) is too technical, while the paper (PaLM: Scaling Language Modeling with Pathways) discusses the application of this architecture into the actual Language Model. The applied paper is good for discussion and has a lot of interesting information. However, after seeing that my article is now 15 minutes long, I decided to split these two into 1 paper per review.

Abstract from the Abstracts

Pathways: Asynchronous Distributed Dataflow for ML

We present the design of a new large scale orchestration layer for accelerators. Our system, PATHWAYS makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows PATHWAYS to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that PATHWAYS can achieve performance parity (∼ 100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network. https://arxiv.org/abs/2203.12533

First paper

Important points

Orchestration of the accelerators (compute accelerators like TPUs — Tensor Processing Unit and GPUs) explicitly to improve the utilization of available FLOPs (Floating Point Operations) to decrease the total resource consumption (time, money, electricity, etc.). PATHWAYS is a new paradigm that uses shared dataflow graphs of asynchronous operations that can effectively coordinate both data and model state across heterogenous parallel computations.

Abbreviations and brief information

SPMD: Single Program Multiple Data. This is the current standard of Large Language Model training architecture. Generally speaking, this architecture ends up not being able to utilize large islands of the compute resources (i.e., Less flexible for heterogenous compute). Recent advances in large-scale ML computing have reached the limits of the offerings of SPMD.

MPI programming model: is an Application Program Interface that defines a model of parallel computing where each parallel process has its own local memory, and data must be explicitly shared by passing messages between processes. Similar to SPMD, the model is too restrictive for both users and the underlying system.

MoE Mode: (Mixture of Experts) model (Shazeer et al. 2017) uses computational sparsity, which increases the heterogeneous computation needs across accelerators.

MPMD: Multiple Program Multiple Data model. Using large islands of homogenous accelerators connected over high-bandwidth interconnects is expensive and wasteful. MPMD allows flexibility of computations by mapping sub-parts over available smaller islands of accelerators. This improves the utilization of resources. (Xiao et al., 2020)

This paper compares the PATHWAYS framework performance with the state-of-the-art (SOTA) ML systems as a preparatory move for the ML workloads of the future. This framework is specifically developed for the efficient execution of programs spanning multiple “pods” or TPUs that Google uses for its ML acceleration. The paper flow is as follows —

  1. Limitations of the current distributed ML system
  2. PATHWAYS support for flexible programming model
  3. PATHWAYS architecture
  4. Lastly, Shared dataflow and asynchronous gang-scheduling to address key limitations discussed in #1

Another faction to keep in mind is the drive towards standardization of the foundation models (Bommasani et al. 2021) that are trained on large data at scale but can share the state of the model with multiple downstream tasks/training etc. Therefore allowing effective parallelism.

Motivation behind PATHWAYS

Simple speaking to increase the computational efficiency and parallelism of the Large Language Model training tasks. Here is an example of how JAX on PyTorch SPMD allows communication over PCIe lanes which are leagues faster than the DCN connections while queuing the accelerator computations is an independent process.

Programming Model in PATHWAYS

PATHWAYS uses the single-controller model that makes it easier to separate the tasks and coordinate using a separate control plane. The PATHWAYS is made to run on TPUs however the JAX today can’t be scaled beyond a single TPU pod due to some restrictions on data sharing over XLA. However, a multiple-controller model could be run using the same code over ICI (Inter-Core Interconnect links) on TPU. Since PATHWAYS can communicate over both ICI and DCN, it allows JAX programs to scale for the first time to multiple TPU pods containing many thousands of TPU cores.

Introduction of Tracer Program

By default, each compiled function is converted into a standalone PATHWAYS program containing just one (sharded) computation, meaning that if a user wants to run many functions back to back, a separate Python call and RPC from client to coordinator is required for each function. Therefore a new program called new program tracer (Figure 2) that a user can wrap around a block of Python code that calls many compiled functions. The tracer generates a single PATHWAYS program where each compiled function is represented by a computation node in a dataflow graph.

Resource manager in PATHWAYS

A PATHWAY S backend consists of a set of accelerators grouped into tightly-coupled islands that are, in turn connected to each other over DCN (Figure 3). PATHWAYS has a “resource manager”, which is responsible for the centralized management of devices across all of the islands. A client may ask for “virtual slices” of the island with specific 2D or 3D mesh shapes that suit their communication pattern. Each virtual slice contains “virtual devices“ that allow the client to express how computations are laid out on the mesh. The resource manager dynamically assigns physical devices for virtual devices satisfying the desired interconnect topology, memory capacity, etc.

I will be skipping over greater details on the system architecture of PATHWAYS, but it involves a resource manager, Client, coordination implementation using PLAQUE, and gang-scheduled dynamic dispatch. In this image, one can appreciate the asynchronous dispatch that can allow parallel computations saving significant FLOPs.

Lastly, from this image, we can see that PATHWAYS outperforms the single-controller systems no matter what the number of hosts used (x-axis). The -O means OpByOp, -C is for the chain, and -F depicts fused where the user code contains separate cells, a series of cells that executes across all nodes, or a series of cells that executes in a single node, respectively. Also, one can notice that the number of computations per second for smaller computations remains significantly lower for both 16 and 512 host configs for PATHWAYS. This means that PATHWAYS matches JAX throughput for computations that are longer than 2.3ms for 16 hosts with 128TPUs and 35ms for 512 hosts with 2048 TPUs.

Large Scale model Performance

The researchers compared tested T5 models and came to the conclusion that both JAX and PATHWAYS performance was identical in Encoder-Decoder architecture used in text-to-text NLP tasks. Meaning that the overhead of PATHWAYS was effectively masked by the computation size/complexity in a way that we could benefit from PATHWAYS architecture but not pay an extra cost for overhead. The same was true for decoder-only tasks. The performance of the systems was linearly proportional to the number of TPUs.

Conclusions

PATHWAYS matches the SOTA multi-controller performance without having the limitations of the single-tenant SPMDs. PATHWAYS extends the JAX program functionality and improves resource management. PATHWAYS allows cluster management at the level of pods allowing multi-tenant sharing, virtualization, and elasticity tailored to the workloads. PATHWAYS is efficient in leveraging resources for concurrent workloads and utilizes efficient pipelined execution, which forms a solid basis for future research.

I must admit that this was one of the most difficult papers to read, mainly due to my lack of familiarity with the intricacies that go into architecting such a massive training job. I felt like I needed to read this paper to understand why PaLM model with 540B parameters is going to be a significant landmark in the NLP journey. I am still learning each day — If there is something that I have missed, please feel free to correct me. If you got to learn something with me, then do drop a like, follow, etc (links below).

--

--