Yusuf Elnady Logo
Back to Notes

GPUs

Last updated: 6/25/2025

Table of Contents:

  1. Von Neumann Architecture
    1. Intro
    2. Von Neumann Bottleneck
    3. Von Neumann Solutions (Caching, Memory Hierarchies, Multi-Core CPUs, Specialized Hardware, Alternative Architectures, Software Optimizations, In-Memory Computing)
  2. CPUs
  3. Parallelism Terms
    1. Process vs Thread
    2. Concurrency vs Parallelism
    3. Multithreading vs Multitasking vs Multiprocessing
  4. Hardware Accelerators (GPUs, FPGAs, ASICs, DPUs, NPUs, TPUs, Neural Engines,
  5. Principales of GPUs
  6. Virtual Machines in Azure (Which GPU to choose)
🗝️
The first time I started writing about CPUs and GPUs was on Nebo: September 2021, and then I am back to that topic in February 2024 on Notion.

CPUs

  • The greatest benefit of CPUs is their flexibility. You can load any kind of software on a CPU for many different types of applications.
  • For example, you can use a CPU for word processing on a PC, controlling rocket engines, executing bank transactions, or classifying images with a neural network.
  • A CPU loads values from memory, performs a calculation on the values, and stores the result back in memory for every calculation.
    • Memory access is slow when compared to the calculation speed —> This limis the total throughput of CPUs. (This is called Von Neumann Bottleneck discussed here: )

Physical Core:

What is a physical core? A physical core, also referred to as processing unit, within the CPU. A single physical core may correspond to one or more logical cores.

  • multiprocessor, i.e. a computer with more than one central processor.
  • multi-core processor, i.e. a single processor with two or more independent cores.
  • Conventional CPUs had only a single core. However, by the mid-2000s, the concept of multiple cores was introduced to enhance processing capacity.
  • These are the actual CPUs installed in the physical servers that make up the cloud infrastructure.
  • Each physical CPU has a certain number of cores and threads that can execute instructions.
  • Single —> Double —> Quad —> Hexa —> Octa —> Deca Cores
💡
We discuss more about multithreading in . Then we discuss that Intel & AMD provide hardware threading that can actually do two things on the same core here: .

Virtualized CPUs (vCPUs):

  • In Azure, a virtual CPU (vCPU) equates to a core on a physical machine.
    • In Azure, vCPUs mean how many cores.
    • For example, if you choose a VM SKU with 4 vCPUs, your VM will be allocated 4 cores from the physical machine.
    • In other words, the cloud provider allocates 4 vCPUs from the available physical CPU resources to your VM instance.
    • The virtualization layer manages the allocation of physical CPU resources among the VMs running on the server.
  • Each VM instance runs its operating system and applications, isolated from other VM instances on the same physical server.
  • IMPORTANT: It’s a portion or share of the underlying physical CPU.
  • In the cloud, you have unlimited vCPUs essentially (can go up to 416 vCPUs). Really should be called virtual cores though
  • Note that people’s usual laptops & computers have a single CPU with a specific number of cores, you cannot increase that number.
🗝️
The difference between cores and vCPUs is that cores are physical processors, while vCPUs are virtual processors that are created by the hypervisor and presented to the VM as if they were physical processors.
  • Azure virtual machines are a great choice with VM options up to 416 vCPUs and 12 TB of memory and storage IOPS up to 3.7 million.
    • The virtualized computing power allocated to a VM can be up to 416 vCPUs
  • Example: NCasT4_v3-series
    • They are powered by Nvidia Tesla T4 GPUs and AMD EPYC 7V12(Rome) CPUs.
    • It provides 4 sizes, as in the table:
    SizevCPUMemory: GiBTemp storage (SSD) GiBGPUGPU memory: GiBMax data disksMax NICs / Expected network bandwidth (Mbps)
    Standard_NC4as_T4_v342818011682 / 8000
    Standard_NC8as_T4_v3856352116164 / 8000
    Standard_NC16as_T4_v316110352116328 / 8000
    Standard_NC64as_T4_v3644402880464328 / 32000
    • If we look at the documentation of AMD EPYC 7V12, we find that one processor of AMD EPYC 7V12 can have 64 cores!
    • However, they can scale down the use, and provide your VM with only 4 cores of this CPU (as in Standard_NC4as_T4_v3).
  • More about this when we go to Azure VMs:
  • You can check how many cores and logical processors in your Windows —> Task Manager —> Performance

Logical Cores:

  • They are explained later in the multithreading and hyperthreading.
  • In general, if a physical processor can run 2 threads in parallel, we say this single processor has TWO logical cores.

Parallelism Terms

Hardware Accelerators

Hardware accelerators are specialized hardware components designed to offload specific computational tasks from the CPU, thereby improving performance and efficiency for targeted workloads.

  1. Graphics Processing Units (GPUs)
    • They are general purpose, designed for various tasks (discussed more here: )
    • Tasks: 3D graphics rendering, Video editing and encoding, Virtual Reality (VR) and Augmented Reality (AR), Computer-aided design (CAD), Medical imaging, Scientific Computing and Research, Deep Learning, Data Science and Big Data Analytics, Financial Modeling, and Crypto Currency Mining
    • Highly parallel structure, Excellent at tasks that can be parallelized
    • Typically more expensive than a CPU
  2. Field-Programmable Gate Arrays (FPGAs):
    • Reconfigurable hardware that can be programmed to perform various tasks efficiently, often used for custom logic, signal processing, and high-performance computing.
    • They are often used in applications requiring high-performance computing and rapid prototyping.
    • Somewhat less efficient than ASICs for specific tasks
    • Complex in programming and configuring
  3. Application-Specific Integrated Circuits (ASICs):
    • ASICs are custom-designed integrated circuits optimized for specific tasks or applications.
    • They offer high performance and energy efficiency but lack the flexibility of programmable accelerators like FPGAs.
    • Example: Bitcoin mining ASICs (like Bitmain Antminer S19 series)
  4. Data Processing Units (DPUs):
    • Specializing in data-centric workloads, such as networking, storage offloading, and security operations in data centers
    • They are commonly used in storage systems, data centers, and cloud computing environments.
      • Niche market; not typically used in consumer products.
      • Requires specialized software support
    • Microsoft acquires Fungible, a maker of data processing units, to bolster Azure for around $190 million.
    • DPUs are also known as "IPUs" or "SmartNICs".
    • It’s a new class of programmable processors and will join CPUs and GPUs as one of the three pillars of computing. (Nividia says).
    • A DPU can be used to improve data center infrastructure by increasing efficiency, enhancing data processing speed, and reducing workload on CPUs, leading to faster and more reliable data processing.
    • Key vendors in the DPU market include NVIDIAMarvellFungible (acquired by Microsoft), BroadcomIntelResnics, and AMD Pensando
      https://www.datacenterknowledge.com/data-center-faqs/data-processing-units-what-are-dpus-and-why-do-you-want-them
      https://www.datacenterknowledge.com/data-center-faqs/data-processing-units-what-are-dpus-and-why-do-you-want-them
  5. Network Processing Units (NPUs):
    • They are commonly used in routers, switches, and network appliances.
    • NPUs are specialized processors designed to offload network-related tasks such as packet processing, routing, and traffic management.
  6. Neural Processing Units (NPUs):
    • A chip that is specifically designed to run neural networks.
    • An NPU speeds up neural network operations like matrix multiplies and convolutions, which is different from how a GPU speeds up graphics.
    1. Tensor Processing Units (TPUs):
      • TPUs are custom ASICs developed by Google in 2016 specifically for accelerating machine learning workloads (NN inference and training).
      • Extremely efficient for specific tasks like tensor manipulation
      • Can be much faster than CPUs and GPUs for machine learning applications.
      • Cloud TPU is a web service that makes TPUs available as scalable computing resources on Google Cloud.
      • Google's TPUs are primarily available as part of their cloud computing services through the Google Cloud Platform (GCP). This means access TPUs remotely rather than having physical TPUs installed on-premises or at home.
    2. Apple's Neural Engine (ANE)
      • ANE is the marketing name for a series of specialized compute cores (AI Accelerators) designed for running deep neural networks on Apple devices.
      • It's much faster than the CPU or GPU! And it's more energy efficient.
      • Running your models on the ANE will leave the GPU free for doing graphics tasks, and leave the CPU free for running the rest of your app.
      • The Neural Engine was first introduced in the A11 Bionic chip for the iPhone 8, which was released in 2017.
      • Apple’s Neural Engine is used to power a variety of ML and AI features on Apple devices, including:
        • Face ID: The Neural Engine is used to identify your face and unlock your iPhone.
        • Animoji: The Neural Engine is used to animate your facial expressions in Animoji and Memoji.
        • Live Text: The Neural Engine is used to recognize text in images and videos.
        • Smart HDR: The Neural Engine is used to improve the dynamic range of photos.
        • Cinematic mode: The Neural Engine is used to create a shallow depth of field effect in videos.
      • Programmers can use Neural Engine via Core ML API.
      • Apple Silicon —> “Core ML is designed to seamlessly take advantage of powerful hardware technology including CPU, GPU, and Neural Engine, in the most efficient way to maximize performance while minimizing memory and power consumption”.
      • Not all models can run on the ANE cuz not all types of layers are supported. But if you have a model that can run on the ANE, you should prefer that over the GPU or CPU on devices that have one.
      • Unfortunately, Apple isn't giving third-party developers any guidance on how to optimize their models to take advantage of the ANE. 
      • If we have Core ML, then what’s MPS or MLX?
  7. Neuromorphic Processors: Neuromorphic processors are specialized hardware designed to mimic the structure and function of biological neural networks. They are particularly well-suited for tasks involving pattern recognition, sensor data processing, and cognitive computing.
  8. Cryptographic Accelerators: Cryptographic accelerators are specialized hardware components designed to accelerate cryptographic operations such as encryption, decryption, and hashing. They are commonly used in security applications and cryptographic protocols.
  9. Vision Processing Units (VPUs): VPUs are specialized processors designed to accelerate computer vision tasks such as image recognition, object detection, and video analytics. They are commonly used in surveillance systems, autonomous vehicles, and robotics.
  10. Quantum Computing Units (QPUs) :
    • QPUs are the core computational units in quantum computers, which use quantum bits (qubits) instead of classical bits for computation.
    • Applications: Cryptography, Complex Simulations, Optimization Problems, Future applications in Machine Learning and Drug Discovery
    • Can theoretically solve complex problems much faster than classical computers.
    • Extremely expensive + Very early stage technology
    • Extremely delicate; prone to errors due to quantum decoherence
👌
Of course, you can also consider multi-core CPUs as Hardware accelerators 😄
👌
While CPUs and GPUs are general-purpose and versatile, TPUs, DPUs, QPUs, ASICs, and FPGAs are more specialized but can offer significant advantages in their specific domains.

Principles of GPUs

  • Graphics processing units (GPUs) are specialized processing cores that you can use to speed up computational processes.
  • IMPORTANT: These cores were initially designed to process images and visual data. Their ability to handle parallel processing tasks efficiently has led to their adoption in a wide range of fields, beyond just gaming, to enhance other computational processes.
  • Here are some of the key areas where GPUs are used:

    Traditional Graphics and Gaming:

    • 3D graphics rendering: This is where GPUs excel, handling tasks like lighting, shading, and texture mapping to create realistic and immersive visuals in games, movies, and animations.
    • Video editing and encoding: GPUs can accelerate the processing of large video files, making editing and encoding faster and smoother.
    • Virtual Reality (VR) and Augmented Reality (AR): GPUs are essential for powering the demanding graphics processing required for VR and AR experiences.

    Professional and Scientific Applications:

    • Computer-aided design (CAD): GPUs enable smoother and more realistic 3D modeling and rendering in CAD software, improving design workflows.
    • Medical imaging: GPUs accelerate the processing of medical images like X-rays, CT scans, and MRIs, aiding in faster diagnosis and treatment.
    • Scientific Computing and Research: Simulations, complex calculations, and data analysis in various scientific fields benefit from the parallel processing power of GPUs.
      • Fields such as computational biology, physics, chemistry, and climate modeling rely on GPUs to perform simulations and data analysis. The parallel computing power of GPUs helps researchers process large datasets and run complex simulations more quickly.
    • Financial modeling and risk analysis: High-performance computing with GPUs allows for faster and more accurate financial modeling and risk assessments.

    Emerging Technologies:

    • Deep learning: The parallel processing capabilities of GPUs are well-suited for deep learning algorithms, which are used in various applications like image recognition, natural language processing, and speech recognition.
    • Data Science and Big Data Analytics: Tasks such as data visualization, statistical analysis, and big data processing can be significantly accelerated using GPUs. Many data science libraries and frameworks have GPU-accelerated implementations for tasks like matrix operations, speeding up computations.
    • Cryptocurrency mining: Certain cryptocurrencies rely on complex calculations that GPUs can efficiently perform, making them popular among miners. (Ex: Proof of Work Alogirthms)

  • High parallelism: GPUs have thousands of cores capable of parallel computations, making them efficient for executing large matrix operations common in neural networks.
  • Programmability: GPUs have flexible programming models and extensive libraries like CUDA and TensorFlow, making them easier to use for various neural network implementations.
  • Wide availability: GPUs are readily available and well-integrated with existing computing infrastructure.

What about the 6th generations of GPUs? Trillium?



GPUs vs TPUs vs LPUs

Storage & Connectivity Accelerators

  1. NVMe Storage 
  2. Computational Storage
  3. PCIe Protocol

Classes of Computers

  1. PCs
  2. Servers
  3. Supercomputers

Workstation & Cluster

Choose from the largest GPU catalog in the world. Leverage the latest NVIDIA GPUs including Ampere A100s with up to 8 GPUs.

Video Ram (vRam)

  • Virtual RAM is something else, and we discussed in paging and swapping files.

AMD vs ARM vs NVIDIA vs INTEL ? Who is company who is arch

GPU Kernel

TPU Pod

data disks, and NICs.

Virtual GPU

Ray Tracing

MPS and MLX

Parallel Hardware and Parallel software

  • Serial Hardware (HW):
    • runs (more or lessm) a single job at a time.
  • Each location in memory can store either (1) Data/D (2) Instruction/I

RISC vs CISC

NVIDIA GRID technology

/subscriptions/ed907bc6-4266-4c03-b824-041e1965bcea/resourceGroups/gpu-rg/providers/Microsoft.Compute/virtualMachines/fuse-gpu-1

azureuser@20.122.54.74

Ubuntu Data Science