Introduction

Machine Learning systems have played a pivotal role in the rapid adaptation of Ai in the world today. This domain is essential for solving future problems and also making the current architectures more efficient. This is crucial considering that companies are reactivating nuclear power plants to power AI in the real-world.

Along with the progress in AI from small neural networks to large language models, there has been a development in the size of datasets as well. Big data arrived, and AI today relies on these internet-scale datasets. After all, doesn’t ChatGPT just do pattern-matching in the internet?

Moreover, the compute capabilities have been scaling exponentially. Just last year (2024), NVIDIA released a new super-chip architecture Blackwell that has 97 billion transistors that can reach up to 1.4 exa-flops! The largest super-computer was barely able to reach 1 exa-flop. All this power in the palm of your hand…

Richard Sutton once said, search and learning can scale unparalleled with growing computation power.

Consider the year 2012 — AlexNet made waves showing SOTA capabilities with images. They use Stochastic Gradient Descent, dropout, convolution networks and initialization techniques. Without ML systems (CUDA, etc), the code would have been 44,000 lines with days of training! With these systems (Jax, PyTorch, TensorFlow) in place, you can achieve the same result in 100 lines within hours of training.

In Practice

In industry, problems are typically of the form - improve the self-driving car’s pedestrian detection to be X-percent accurate at Y-ms latency budget. For an ML engineer is, the general approach is to design a better model with better learning efficiency followed by hyper-parameter running, pruning, distillation. An ML systems engineer would take the best model by ML researchers, specialize the implementation to target the H/W platform to reduce latency. Streamlining the entire process from development to deployment.

Overview

From ad-hoc methods having diverse models and optimization algorithms with various data pre-processing techniques - we have arrived at an optimal algorithm that is iterative and convergent. As our models have become more and more specialized, the computation resources scaled exponentially.

Through the course of this article, we will cover deep learning basics, computational graphs, Autodiff, ML frameworks, GPUs, CUDA and collective communication.

There are more related topics to the ones discussed here

  • ML for systems
  • ML Hardware design

Unfortunately, the textbook for this content is just miscellaneous research papers.

Background

DL Computation

The idea is to concatenate composable layers

\[\theta^{(t + 1)} = f(\theta^{(t)}, \nabla_L(\theta^{(t)}, D^{(t)})\]

A model is a parameterized function that describes how we map inputs to predictions. The parameters are optimized using optimization methods like SGD, Newton methods, etc. A loss function guides the model to give feedback on how well the model is performing.

Having these basic definitions, we will build abstractions to map all the models being used today. It is not possible to build systems to support all models. A quick refresher of important models

  • CNNs - Learnable filters to convolute across images to learn spatial features. The top 3 breakthrough architectures were - AlexNet, ResNet, U-Net. What are the important components in CNNs?
    • Convolution (1D, 2D, 3D)
    • Matmul
    • Softmax
    • Element-wise operations - ReLU, add, sub, pooling, normalization, etc.
  • Recurrent Neural Networks - Many problems in nature are many-to-many. RNNs maintain an internal state that is updated as a sequence is processed. Arbitrary inputs and outputs can be generated, and any neural network can be used in the RNN architecture. The top 3 breakthrough architectures were - Bidirectional RNNs, LSTMs, GRU. What are the important components in RNNs?
    • Matmul
    • Element-wise non-linear - ReLU, Sigmoid, Tanh
    • Underlying MLP RNNs have a problem of forgetting (\(0.9*0.9*… \approx 0\)). Additionally, they lack parallelizability - both forward and backward passes have \(O(sequence length)\).
  • Transformers (Attention + MLP) - Treat representations of each element in the sequences as queries to access and incorporate information from a set of values. Transformers have an encoder part (BERT most famous) and a decoder part (GPT most famous). Along with these, DiT is one of the top 3 models. What are the important components in Transformers?
    • Attention - Matmul, softmax, Normalization
    • MLP
    • Layernorm, GeLU, etc.
  • Mixture of Experts - Voting from many experts is better than one expert. Latest LLMs are mostly MoEs - Grok, Mixtral, Deepseek-v3. A router (Matmul, softmax) is the novel component in MoE - it makes system design difficult.

Machine Learning Systems

As mentioned before, the three pillars for the systems are data, model and compute. The foal is to express as manny as models as possible using one set of programming interface by connecting math primitives.

Computational Dataflow Graph

A representation to show data flow in programs. A node represents the computation (operator) and an edge represents the data dependency (data flowing direction). A node can also represent the input/output tensor of the operator.

Example: Deep learning with TensorFlow v1

import tinyflow as tf

x = tf.placeholder(tf.float32, [None, 784]) # Forward declaration 
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y *tf.log(y), reduction_indices=[1])) # Loss function declaration 
W_grad = tf.gradients(cross_entropy, [W])[0] # Automatic differentiation
train_step = tf.assign(W, W - learning_rate*W_grad) # SGD update rule 

for i in range(1000):
        sess.run(train_step, feed_dict={x, y}) # Real-execution happens here

This DAG representation opens up all possibilities of optimizations. However, creating such a graph doesn’t allow flexibility - once a graph is defined, it cannot be changed based on the input.

Example: PyTorch

PyTorch also uses computational graphs, but it creates it on the fly. Previously, we had defined the graph and then executed it. Symbolic declaration vs imperative programming. Define-then-run vs Define-and-run. C++ vs Python.

What are the pros and cons?

  Good Bad
Symbolic Easy to optimize, much more efficient (can be 10x faster) The way of programming can be counter-intuitive, hard to debug and less flexible
Imperative More flexible, easy to program and debug Less efficient and more difficult to optimize

How does TensorFlow work in Python then? Tensorflow has Python as the interface language.

Apart from these two famous frameworks, there were more like Caffe, DyNet, mxnet (has ability to switch between both), etc. Recently, Jax (derived from Tensorflow) has been getting more popular.

Just-in-time (JIT) compilation

Ideally, we want define-and-run during development and define-then-run during deployment. However do we combine both? PyTorch introduced a deploy mode through a decorator torch.compile(). So is there an issue with JIT? It creates only static graphs, and cannot work with conditionals or loops in the code.

Static vs Dynamic models

Static graphs are defined and optimized only once. The execution follows a defined computation. On the other hand, dynamic graphs depend on the input. It is difficult to express complex flow-control logic and debug. The implementation is also difficult.

As seen above, LSTMs are trying to replace the dynamics in the natural language problem.

How to handle dynamics?

  • Just do Define-and-run and forget about JIT - most popular unforunately :(
  • Introduce Control Flow Ops -
    • Example: Switch and Merge. This can be added a computational primitive in the graph and introduce dynamics in the graph.
    • These ideas are natural across all programming languages - conditionals and loops. However, the problem with this approach is that graphs becomes complex, and more importantly, how does we do back propagation? What is the gradient of “switch”? TensorFlow team has been working on this.
  • Piecewise compilation and guards - This approach is better adopted than control flow.
    • Case 1: A graph accepting input shapes of \([x, c1, c2]\) where \(x\) is variable. The solution is to compile for different values of \(x\) (powers of 2).

So far, we have seen representations that express the forward computations using primitives. But, how do we represent backward computations?

Autodiff (AD)

Derivative can be taken using the first order principles. However, this approach can be slow since we have to evaluate the function twice \(f(\theta + \epsilon) , f(\theta)\) and it is also error prone \(\theta(\epsilon^2)\).

To optimize the derivative calculation, we pre store the gradients of primitives and map the derivative chain rules in the computational graph. There are two ways of doing this as well

  1. Calculating the derivative from left (inside) to right (outside) in a network - from inputs to outputs
  2. Calculating it from right to left - from outputs to inputs

Both are valid approaches and we will discuss them in detail.

Forward Mode Autodiff

We start from the input nodes, and derive the gradients all the way to the output nodes. Cons - - For \(f: R^n \to R^k\), we need \(n\) forward passes to get the gradients with respect to each input. - However, it is usually the case that \(k = 1\) (loss) and \(n\) is very large.

If this is confusing, think of it this way - we want the gradient of output with respect to all parameters to update weights. However, forward mode calculates the gradient of inputs with respect to all parameters.

Reverse Mode Autodiff

We define the quantity adjoint \(\bar v_i = \frac{\partial y}{\partial v_i}\). We then compute each \(\bar v_i\) in the reverse topological order of the graph. This way, we can simply do one backward pass to get the necessary gradients.

In some scientific scenarios, we can have \(k >> n\) where the forward mode can be more efficient.

What are the size bounds of the backward graph as compared to the neural network?

We construct backward graphs in a symbolic way to reuse it multiple times.

Backpropagation vs. Reverse-mode AD

In old frameworks like Caffe/cuda-convnet, the backward computations were done through the forward graph itself. Newer frameworks like Tensorflow and PyTorch construct the backward graph explicitly. The reasons to do so are -

  1. Explicit graphs allow backward computation with any input values. They have flexibility to even calculate gradient of gradients.
  2. Having an explicit backward graph can help optimization!
  3. Gradient update rules can be efficiently implemented.

Gradient update rules

Typically done via gradient descent, the weights are updated with the gradients with the following simplified rule

\[f(\theta, \nabla_l) = \theta - \eta \nabla_L\]

Architecture Overview of ML systems

The aim is to make the systems fast, scalable, memory-efficient, run on diverse hardware, energy efficient and easy to program/debug/deploy. Phew.

We have discussed dataflow and Autodiff graphs. However, there are numerous things that can be added to these - graph optimization, parallelization, runtime memory, operator optimizations and compilation.

Graph Optimization

The goal is to rewrite the original graph \(G\) as \(G’\) that is faster.

Consider the following motivating example - Typically, convolution is followed by batch normalization. Instead of performing batch normalization, just update the weights in convolution to do everything in one step!

Note that some steps can become slower based on the hardware, but you get the general idea.

Similarly, in attention calculations, the code is typically written with a concatenated vector of queries, keys and values. This version is optimal - it can be understood with Arithmetic Intensity which the ratio of #operations and #bytes. For example, an addition operation has intensity of \(1/3\) (2 loads and one store). However, fusing multiple arithmetic operations reduces the loads and stores by bringing all variables into memory once, improving the arithmetic intensity.

So how do we optimize graphs?

We write rules or templates for opportunities to simplify graphs. There is also implementation of auto-discovering optimizations in the latest libraries, we shall study these.

Parallelization

The goal is to parallelize the graph computation over multiple devices. Note that devices can be connected with fast (memory communication NVLink) and slow connections (across GPUs), with up to 10x performance difference. Ideally, we do not want to describe partitioning rules for every new model that comes up. Based on these communication patterns, distributing the tasks is not an easy problem. So, we shall discuss how we partition the computational graph on a device cluster.

Runtime and Scheduling

How do we schedule the compute, communication and memory in a way that execution is as fast as possible, communication is overlapped with compute and is subject to memory constraints?

Operator Implementations

The goal is this layer is to get the fastest possible implementation of matmuls, for different hardware, different precision and different shapes.

NVIDIA releases a GPU every 2 years, and they have rewrite all operations every time! Notably, previously, models were trained using 32-bit floating points, but now researchers are emphasizing on lower and lower precisions.

Now, we shall delve into each of these architectures.

Operator Optimization and Compilation

The goal is maximize arithmetic intensity. In general there are three ways to speed up operators

Vectorization

The right version is faster because of the hardware - cache sizes, etc. Tensorflow and PyTorch have this built-in.

Refactoring data layout

This is again related to how data is stored in memory. For example, C++ stores matrices in row-major order. Accessing columns of a matrix can be 10x slower! Remember this while writing code to lower cache misses and reduce pointer movements.

ML systems don’t store tensors in row or column major but in a new format called strides format - A[i, j, …] = A.data[offset + i*A.strides[0] + j*A.strides[1] + …. It is a generalization of row and column major storage, and it offers more flexibility - so based on the batch-sizes or other parameters in a neural network.

Strides can separate the underlying storage and the view of the tensor. Consider the following operations

  1. slice - simply changing the offsets and shape will output the slice without any copying involved.
  2. transpose - modifying strides will transpose the tensor without any copying! For example, consider the following example
         M.strides() # (24, 12, 4, 1)
         M.permute((1, 2, 3, 0)
         M.t.strides() # (12, 4, 1, 24)
    
  3. broadcast - Suppose we have to extend a tensor’s data across a dimension for performing operations with another tensor, then by simply adding 0 stride in the appropriate dimensions would be enough! Again, no copying

Many more operations can be done without copying the data and simply modifying the strides. For example, consider the following example -

However, strides also has an issue - Memory access may become non-contiguous, and many vectorized ops require continuous storage.

Summary

To make operators efficient, we have seen the following tactics -

  1. Vectorization - leverage platform-specific vectorized functions that reduce seek time
  2. Data layout - strides format that allow zero-copies enabling fast array-manipulations
  3. Parallelization on CPUs

These were techniques for general operations. However, we can optimize certain operators with their special properties.

Matmul optimization

The brute-force approach takes \(\mathcal O(n^3)\). The best approach humans know is \(\mathcal O(n^{2.371552})\)!

How to improve the speed in practice then? Recall that we are trying to increase AI = #ops/#bytes.

Memory Hierarchy If everything ran on registers, things would be super-fast. But, that is expensive. Remember that L1-Cache has 0.5ns latency, L2-Cache has 7ns and DRAM has 200ns (400x slower!)

Let us analyze the AI of matmul considering the different layers of memory

  1. We can directly move data to registers in every iteration in inner loop

GPUs and accelerators

Recall that parallelizing operations across threads is super useful! CPUs have some level of parallelism through SIMD operations (vectorization) but they are limited. Building on the same idea, GPUs were born.

When we started out, the ALU units were limited by the physical space on the chips. As technology improved, we moved from 70nm process all the way 3nm process! That is, we can fit up to 20x more cores in the same area! The majority of the area on CPUs is consumed by Control and Cache, and Jensen thought, ditch those and put cores.

Graphical Processing Unit (GPU) are tailored for matrix or tensor operations. The basic idea is to use tons of ALUs (weak but specialized) with massive parallelism (SIMD on steroids).

There are other hardware accelerators like Tensor Processing Unit (TPU) or Application specific integrated circuit (ASIC), etc. The common theme across all these is the same - there are specialized cores. What are specialized cores? They can only compute certain computations. Specialized cores can be super powerful -

Companies also tried reducing precision and maintain the same performance. Additionally, they also tune the distribution of different components for specific workloads.

Why does quantization work in ML systems?