Yusuf Elnady Logo
Back to Notes

Tensor Operations (PyTorch)

Last updated: 12/13/2025

Contiguity (Row-Major Order, Fortran & MatLab Column-Major Order, .stride, Traversing Memory to Print Tensor, data_ptr, What is Contiguous, Importance & Benefits, Common Operations for Contiguity & Non-Contiguity, )

I stopped to fix index_select, then need to finish vdot, then atomicadd.



Dimension Reordering Operations




Summary of Multiplication and Product Functions in PyTorch

Multiplication NameFunctionExplanationSymbolVector Or Matrix
Tensor Multiplication.matmul @performs matrix multiplication or batch matrix multiplication ABA·B —— ABABVectors/Matrix
Matrix Multiplication.mmperforms matrix multiplication only on 2D matrices.ABA·B —— ABAB2D Matrix
Hadamard product / Element-Wise Multiplication.mul .multiply *performs Element-wise MultiplicationABA * B —— ABA ⊙ BVectors/Matrix
Dot Product / Inner Product / Scalar Product.dot .inneraba ⋅ bVectors, and .inner accepts Matrix but performs on last dim only.
Outer Product.outer aba ⊗ b —— aba ∧ bVectors Only
Kronecker Product.kronABA ⊗ BVectors/Matrix
Cross Product.crossa×ba × bVectors Only
Cartesian Product.cartesian_prodReturns 2D list of all combinations pairs (no multiplication) A×BA × BVectors Only
To use any of these operations, both tensors must be of the same data type.
  • Even if one is float32, and the other is float64, PyTorch will raise a runtime error of mismatched types.
🔑

✅ Best Practice:

Use float32 unless you explicitly need:

  • Mixed precision (float16) for speed
  • Double precision (float64) for numerical stability
  • Complex numbers for Fourier or signal ops
  • More details here: for more data types like torch.bfloat16 or torch.qint8, etc..
  • All data types are here:




💡
The main bottleneck in modern deep learning is usually Memory Bandwidth (moving data), not Compute (doing the math).

Fused operations reduce the number of memory read/write cycles, leading to significant speedups.

Kernel Fusion: The primary benefit is saving Memory Bandwidth.


addr // outer product then add to matrix input

I still need to complete tensordot, cross, outer, kronecker, vdot, and so on…

torch.nn.Linear

torch.vdot


.outer (Outer Product)

  • It’s very SIMPLE.
  • It takes just 1 dimensional tensors.
  • If input is a vector of size n and vec2 is a vector of size m, then out must be a matrix of size (n×m).
    python
    v1 = torch.tensor((1, 2, 3, 4)) # Size [4] --> Treated as [4, 1]
    v2 = torch.tensor((1, 2, 3)) # Size[3] --> Treated as [1, 3]
    
    

    If we have u= (u1u2u3),v= (v1v2v3)uv=uvT=(u1u2u3)(v1v2v3)=(u1v1u1v2u1v3u2v1u2v2u2v3u3v1u3v2u3v3)\mathbf{u} = \begin{pmatrix}u_1 \\u_2 \\u_3\end{pmatrix},\quad\mathbf{v} = \begin{pmatrix}v_1 \\v_2 \\v_3\end{pmatrix}\Rightarrow\mathbf{u} \otimes \mathbf{v} = \mathbf{u} \mathbf{v}^T=\begin{pmatrix}u_1 \\u_2 \\u_3\end{pmatrix}\begin{pmatrix}v_1 & v_2 & v_3\end{pmatrix}=\begin{pmatrix}u_1 v_1 & u_1 v_2 & u_1 v_3 \\u_2 v_1 & u_2 v_2 & u_2 v_3 \\u_3 v_1 & u_3 v_2 & u_3 v_3\end{pmatrix}, with a shape (3×1)(1×3)=(3×3)(3\times1)*(1\times3) = (3\times3)

    🔑
    The outer product aba⊗b is the same as writing abTab^T in Matrix Format.

    However, when working with 1D vectors, we typically omit the transpose for simplicity, since ⊗ implicitly imply an outer product.

  • To use matmul, to do the outer product, we can do the following:
    python
    v1 = torch.tensor((1, 2, 3, 4)).reshape(4, 1) # Written in Matrix Format
    v2 = torch.tensor((1, 2, 3)).reshape(3, 1) # Written in Matrix Format
    
    python
    a = torch.tensor([[1],
    								  [2],
    								  [3]])  
    b = torch.tensor([[1], 
    									[2], 
    									[3]])
    
    # INNER PRODUCT: aᵀ @ b → scalar
    inner_product = torch.matmul(a.T, b)
    
    # OUTER PRODUCT: a @ bᵀ → 3×3 matrix
    outer_product = torch.matmul(a, b.T)

Outer product is Symmetric up to Transposition

abbaa\otimes b \ne b\otimes a
ab=(ba)Ta \otimes b = (b \otimes a)^T

Low-Rank Matrix Factorization

  • It's a way to approximate a big matrix using two smaller matrices — so you save space, reduce noise, or learn meaningful patterns.

Use Cases:

  1. Constructing Covariance Matrices
    • Remember that Var(X1)=(x1μ1)2\text{Var}(X_1) = (x_1 - \mu_1)^2, and Cov(X1,X2)=(x1μ1)(x2μ2)\text{Cov}(X_1, X_2) = (x_1 - \mu_1)(x_2 - \mu_2)

.kron (Kronecker Product) (Matrix Direct Product) (Generalized Outer Product)

  • The Kronecker product (denoted ABA \otimes B) is a generalization of the outer product from vectors to matrices.
  • Given (m×n)(m\times n) matrix AA and a (p×q)(p \times q) matrix BB, the kronecker product will be C=ABC = A \otimes B with a shape of (mp×nq)(mp \times nq)
A=[a11a12a21a22],B=[b11b12b21b22]AB=[a11Ba12Ba21Ba22B]=[a11b11a11b12a12b11a12b12a11b21a11b22a12b21a12b22a21b11a21b12a22b11a22b12a21b21a21b22a22b21a22b22] A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}, \quad B = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} \Rightarrow A \otimes B = \begin{bmatrix} a_{11} B & a_{12} B \\ a_{21} B & a_{22} B \end{bmatrix} = \begin{bmatrix} a_{11} b_{11} & a_{11} b_{12} & a_{12} b_{11} & a_{12} b_{12} \\ a_{11} b_{21} & a_{11} b_{22} & a_{12} b_{21} & a_{12} b_{22} \\ a_{21} b_{11} & a_{21} b_{12} & a_{22} b_{11} & a_{22} b_{12} \\ a_{21} b_{21} & a_{21} b_{22} & a_{22} b_{21} & a_{22} b_{22} \end{bmatrix}
python
A = torch.tensor([[1, 2],
                  [3, 4]])

B = torch.tensor([[0, 5],
                  [6, 7]])

python
mat1 = torch.eye(2)
mat2 = torch.arange(1, 5).reshape(2, 2)


🔑
The Kronecker product is a way to turn a small matrix into a bigger matrix by multiplying its elements with another matrix.

The Kronecker product of two matrices A and B means: “Take every number in matrix A, and replace it with that number multiplied by all of matrix B.”

💡
Outer product is a special case of the Kronecker product when inputs are vectors.
Kronecker product is not commutative
ABBAA\otimes B \ne B \otimes A
  • In Matrix-Matrix Multiplication, we learned that each element of CC is Cij=k=1nAikBkjC_{ij} = \sum_{k=1}^n A_{ik}⋅B_{kj}
  • In Kronecker Product, kt=itbt+jtk _t​ =i_ t​ ⋅b_ t​ +j_ t​ for 0tn\text{for} \space 0\le t \le n.
    • It means move to the
  • ✳️ If one tensor has fewer dimensions —> PyTorch will automatically add extra dimensions (using .unsqueeze()) to the smaller one.
    python
    a = torch.tensor([[1, 2]])    # shape: (1, 2)
    b = torch.tensor([3, 4, 5])   # shape:    (3)
    result = torch.kron(a, b)     # PyTorch treats b as shape (1, 3) # Finall result shape (1x1, 2x3) = (1, 6)

Basic Properties

  • It does not matter where we place multiplication with a scalar (αA)B=A(αB)=α(AB)(\alpha A) \otimes B = A \otimes (\alpha B) = \alpha (A \otimes B)
  • Taking the transpose before carrying out the Kronecker product yields the same result as doing so afterwards (AB)=AB(A \otimes B)^\top = A^\top \otimes B^\top
  • The Kronecker product is associative (AB)C=A(BC)(A \otimes B) \otimes C = A \otimes (B \otimes C)
  • The Kronecker product is right–distributive (A+B)C=AC+BC(A + B) \otimes C = A \otimes C + B \otimes C
  • The Kronecker product is left–distributive A(B+C)=AB+ACA \otimes (B + C) = A \otimes B + A \otimes C
  • The product of two Kronecker products yields another Kronecker product (AB)(CD)=(AC)(BD)(A \otimes B)(C \otimes D) = (AC) \otimes (BD)
  • The trace of the Kronecker product of two matrices is the product of the traces of the matrices tr(AB)=tr(A)tr(B)\operatorname{tr}(A \otimes B) = \operatorname{tr}(A)\operatorname{tr}(B)
    • The trace of a square matrix is the sum of its diagonal elements.

Use Cases:

  1. Repeating a Pattern (Matrix Tiling)
    • You’re building a large image or grid that follows a pattern — like tiles on a floor.
    • Imagine a pattern like
      python
      [0 1]
      [1 0]

      And you want to repeat it across a 4x4 area, scaled differently in each region (Brightness, Weights, …).

    • Think of it like “Take this small checkerboard and copy it multiple times into a bigger grid — but scale each copy differently."
      python
      # Tile pattern (e.g., black/white checkerboard)
      tile = torch.tensor([[0, 1],
                           [1, 0]], dtype=torch.float32)
      
      # Control pattern (e.g., brightness or number of times to repeat)
      pattern = torch.tensor([[1, 2],
                              [3, 4]], dtype=torch.float32)
      
      # Kronecker product to scale and tile
      
  2. Combining Systems (Quantum / State Expansion)
    • You have two small systems (like coin flips or binary states), and want to model the combined system.
    • You want to simulate both together as one big system with 4 states (2 × 2).
    • This is how quantum computing combines qubits.
      python
      # 2-state systems (e.g., [1, 0] is "on", [0, 1] is "off")
      sys_A = torch.tensor([[1.], [0.]])  # A is ON
      sys_B = torch.tensor([[0.], [1.]])  # B is OFF
      
      # Combined system (A ⊗ B)
      combined = torch.kron(sys_A, sys_B)
  3. Building Smart Weight Matrices (Machine Learning) 🧠
    • You’re training a neural network, and one of your layers has a huge matrix of weights.
    • You realize:
      • It’s slow
      • Takes too much memory
      • But most of the values are based on a smaller pattern.
    • You can use Kronecker product to build that big matrix from small ones.
      python
      # Base patterns
      A = torch.tensor([[1, 2],
                        [3, 4]], dtype=torch.float32)
      
      B = torch.tensor([[0.1, 0.2],
                        [0.3, 0.4]], dtype=torch.float32)
      
      # Build a large structured weight matrix
      W = torch.kron(A, B)
      
      # Input to multiply (size must match)
      x = torch.randn(4)  # W is 4x4
      
      # Apply the layer
      y = W @ x
  4. Efficient Neural Network Layer
    • A paper called “Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1/n Parameters” discusses this idea.
    • Instead of training normal FC layers, we will have PHM layer (Parameterized Hypercomplex Multiplication).
    • Each layer, will have the matrix H\bold{H}, which is H=AS\bold{H} = \bold{A} \otimes \bold{S}, and such construction reduces number of parameters to 1/n.
    • Instead of learning 1000 values, I’ll learn 10 values and repeat them smartly to build the big thing.”

Manual Implementation of Kronecker Product

python
def manual_kron(A: torch.Tensor, B: torch.Tensor):
    a_rows, a_cols = A.shape
    b_rows, b_cols = B.shape

    # Shape of the output: (a_rows * b_rows, a_cols * b_cols)
    C = torch.zeros((a_rows * b_rows, a_cols * b_cols), dtype=A.dtype)

    for i in range(a_rows):
        for j in range(a_cols):
            # Multiply the scalar A[i, j] by the full matrix B
            C[i*b_rows:(i+1)*b_rows, j*b_cols:(j+1)*b_cols] = A[i, j] * B
            
            # Note that we for loop on Matrix A, and for every element of it, we generate a block
            # 

Fully Vectorized Kronecker Product (2D matrices only)

axes=([1, 2], [1, 2]) tells NumPy to sum over the 2nd and 3rd dimensions of both arrays.






.tensordot (Tensor Dot Product) (Generalized Contraction) HARD

  • Many PyTorch users get stuck when they have to move beyond simple 2D matrix multiplication (torch.matmul or @) into 3D, 4D, or 5D tensors.
👌🏻
Everyone loves matrix multiplication when it’s just 2D. It’s clean, it’s simple.

But what happens when you are doing deep learning and you hit 4D image tensors? Suddenly, you find yourself desperately using .view(), .permute(), and .transpose() just to get dimensions to line up so you can use matmul. It’s messy, error-prone, and hard to read six months later.

What if I told you there is a single function that handles multiplying complex tensors without ever needing to reshape them first? Today, we are mastering torch.tensordot.

  • What is Tensor Contraction?
    • Don't let the name scare you. "Contraction" just means we are choosing specific dimensions (axes) from two different tensors, multiplying the elements along those axes, and summing them up.
    • Because we sum them up, those dimensions "contract" —> they disappear from the final output.
  • .tensordot takes three main arguments: your first tensor, your second tensor, and the most importantly dims parameter.
  • Basic Syntax: torch.tensordot(A, B, dims)
  • The magic happens in dims.
  1. Integer: number of last axes of A and first axes of B to contract (Not Explicit Axis Index).
  2. A tuple of two lists: explicit axes from A and B, dims=([List A], [List B]).
    • List A: The indices of the dimensions in the first tensor you want to contract.
    • List B: The indices of the dimensions in the second tensor you want to contract against them.

Case 0: Dot Product .dot

python
A = torch.tensor([1, 2, 3]) # (3,)
B = torch.tensor([4, 5, 6]) # (3,)
result = torch.tensordot(A, B, dims=1) # Contract 3 from A with the 3 from B
print(result) # tensor(32)

Case 1: 2D Matrix Multiplication .mm

python
A = torch.randn(3, 4)    
B = torch.randn(4, 5)

torch.tensordot(A, B, dims=1).size() 

Case 2: dims = 0 → Outer Product

  • dims = 0 means: “do NOT contract anything.”
  • No summation, just multiplication one by one.
python
A = torch.randn(4)    
B = torch.randn(5)

torch.tensordot(A, B, dims=0).size() # torch.Size([4, 5]), same as in normal outer product

# If A = [1, 2, 3, 4]
# and B = [6, 7, 8, 9, 10]
# Then it will be 

  • Now let’s say I have A of size (3,4), and B of size (5), and I said torch.tensordot(A, B, dims=0) , what will happen?
  • Say we have A is [[1, 2, 3, 4] [5, 6, 7, 8] [9, 10, 11, 12]]
  • And B is [1, 2, 3, 4, 5]
  • Since, dims=0, which means no contraction, so again outer product.
  • We will take the first row from A and perform outer product with B, then second row from A, and outer product with B, and so on…
python
A = torch.randn

  • So real quick if A = torch.randn(3, 4); B = torch.randn(4, 5) then torch.tensordot(A, B, dims=0) will result in a 4-D tensor by taking the outer product of every element of A with every element of B, which gives a matrix of size (3, 4, 4, 5).
  • Even though both tensors have a matching dimension of size 4, dims=0 treats them as distinct independent axes. It does not attempt to align or broadcast them.
👌🏻
Outer product of two matrices produces a 4-D tensor. You cannot perform this using .outer().
👌🏻
When dims=0, tensordot creates an outer product: every element of A is multiplied with every element of B, producing a tensor whose shape is the concatenation of A’s shape and B’s shape.
Shape(A)+Shape(B)Result Shape\text{Shape}(A) + \text{Shape}(B) \rightarrow \text{Result Shape}
👌🏻
You cannot use dims=([ ], [ ]) to get an outer product.

The only correct way to get an outer product with tensordot is: dims = 0 .

Case 3 — dimension size 1

python
A = torch.tensor([[1, 2, 3]])   # shape (1, 3)
B = torch.tensor([[4, 5, 6]])   # shape (1, 3)

result = torch.tensordot(A, B, dims=1)
print(result) 
  • We need to multiply across the last dimension of A (3) and first dimension of A (1).

Case 4 — explicit axes (Matrix Multiplication)

  • Here, we say, contract the last index of A (axis = 0) with first index of B (axis = 1).
  • It’s like normal matrix multiplication.
python
A = torch.randn(2, 3)
B = torch.randn(3, 2)
result = torch.tensordot(A, B, dims = ([1], [0])) # dims = 1

print(result) # Size: [2, 2]

Case 5 — explicit axes (Double Contraction) (A:BA :B)

  • Tensor contractions can be thought of as the higher-dimensional equivalent of matrix-matrix multiplications.
  • The symbol “:” means double contraction (also called double dot product).
  • Double Contraction is the tensor-analogue of the dot product but applied twice.
  • Dot product = contract one index from each matrix.
  • Double dot product = contract two indices from each matrix.
  • A double dot product between two tensors of orders m and n will result in order(A:B)=(m+n4)\text{order}(A:B)=(m+n−4), which is .dim() , because 2 axes have been removed from each matrix.
👌🏻
I spent two days, to get to this understanding:

We are not doing matrix multiplication here per se, all we want is tensor contractions. This moves our thinking to having (Dot Product): vectors of matching lengths, that we multiply them by each other to get a scalar.

Let’s understand through an example:

  • If AA is size (4,3,2)(4, 3, 2) which is 4 blocks of 3×23 \times2. BB is size (2,3,5)(2, 3, 5) which is 2 blocks of 3×53 \times 5. Let’s say we want to do torch.tensordot(A, B, dims = ([2, 1], [0, 1])).
  • You might think that well, the matrices are organized very well for us for Matrix Multiplication where we want rows ×\times columns, i.e., (3×2)(3 \times 2) vs (2×3)(2 \times 3).
    • This is the most point of confusion with tensor contractions. We are not thinking of this as Matrix Multiplication, rather Vector Dot Products.
  • We requested contracting using this mapping:
    A axisB axissize
    2 (size 2)0 (size 2)✓ matches
    1 (size 3)1 (size 3)✓ matches
  • [IMPORTANT] But How tensordot actually executes this contraction?
  • To perform this operation efficiently, PyTorch (and NumPy) follows a three-step process: Permute \rightarrow Reshape \rightarrow Matrix Multiply.
    • You need to study this: to understand how is it different from .reshape/.view .
  • Our goal is to have something like (4,6)×(6,4)(4, 6) \times (6, 4), so applying @ to them is easy.
  • The answer that flies is then let’s flatten —> A.reshape(A.shape[0], -1); B.reshape(-1, B.shape[0]) , but it’s not that simple, and let’s understand why this need permute first.
  • Let’s say we have the following:
    python
    A = torch.randint(low=0, high=10, size=(4, 3, 2))   # integers 0–9
    B = torch.randint(low=0, high=10, size=(2, 3, 5))   # integers 0–9
    result = torch.tensordot(A, B, dims = ([2, 1], [0, 1]))
    
    # Let's say
    A : (4, 3, 2)
    tensor([[[0, 1],
             [5, 7],
             [9, 9]],
    
            [[2, 7],
             [3, 9],
             [4, 0]],
    
            [[2, 8],
             [4, 4],
             [7, 4]],
    
            [[1, 0],
             [5, 4],
             [8, 4]]])
             
    B: (2, 3, 5)
    tensor([[[2, 7, 1, 8, 2],
             [9, 3, 6, 7, 3],
             [3, 0, 5, 4, 8]],
    
            [[3, 9, 1, 5, 1],
             [2, 6, 7, 7, 5],
             [5, 5, 1, 5, 2]]])         
  • In a normal 2D matrix multiplication, we loop row by row from A, and column by column from B, then in each iteration, we perform the dot product operation to get one Result\text{Result} cell value.
    Result[i,j]=k=0K1(A[i,k]×B[k,j])\text{Result}[i, j] = \sum_{k=0}^{K-1} (\mathbf{A}[i, k] \times \mathbf{B}[k, j])

  • We have:
    • A.shape = (4, 3, 2) # think A[a, j, i]
    • B.shape = (2, 3, 5) # think B[i, j, b]
  • We want to perform:
    • result = torch.tensordot(A, B, dims=([2, 1], [0, 1]))
  • We know the output should be of size (4, 5), as two dimensions has been contracted from each.
  • Let’s trace result[0,0] explicitly:
    • a = 0 (The cell
    • b = 0
    • i ∈ {0, 1}
    • j ∈ {0, 1, 2}
  • Now, since we are in need to double contract, the formula is sum over (i) then over (j) because we are doing
    result[0,0]=ijA[0,j,i]B[i,j,0]result[0,0]=∑_{i}∑_{j}A[0,j,i]⋅B[i,j,0]
  • Let’s perform the

  • Result[i,j,k,l]=m=01(A[i,j,m]×B[m,k,l])\text{Result}[i, j, k, l] = \sum_{m=0}^{1} (\mathbf{A}[i, j, m] \times \mathbf{B}[m, k, l])

The usefulness of permute and reshape functions is that they allow a contraction between a pair of tensors (which we call a binary tensor contraction) to be recast as a matrix multiplication.

Batch matrix multiplication is a special case of a tensor contraction.

python
# 1. Permute:
A_perm = A.permute(0, 1, 2)          # (4, 3, 2), same as original

B_perm = B. 

# If we flatten / Reshape A

  1. Flatten (Reshape) is basically for grouping into one shot operation of double contraction
    • PyTorch effectively flattens the tensors so the Contracted dimensions are grouped on the "inside."
    • Flattening / Reshaping A: We keep dim 0. We combine dims 1 and 2. (4,3,2)(4,3×2)Matrix size (4,6)(4, 3, 2) \rightarrow (4, 3 \times 2) \rightarrow \text{Matrix size } (4, 6)
    • Flattening / Reshaping B: We combine dims 0 and 1. We keep dim 2. (2,3,5)(2×3,5)Matrix size (6,5)(2, 3, 5) \rightarrow (2 \times 3, 5) \rightarrow \text{Matrix size } (6, 5)
  2. Now just multiply, and we get the final result of (4,6)(4, 6).
python
# Final Result
tensor([[134, 111, 134, 170, 141],
        [ 82, 140, 110, 151,  97],
        [113, 142, 101, 160, 108],
        [ 99,  66, 103, 123, 109]])


  • .tensordot generalizes:
    • inner product .dot
    • outer product .outer
    • matrix multiply .mm
    • batch .matmul
    • multi-axis contraction .tensordot
    • Einstein summation patterns .einsum

What is full geometric intuition behind tensor contractions???

Tensor product could be between 3D and 2D (doesnt have. to be of same order).

https://www.youtube.com/watch?v=RxbL5i8gczg (Log from Tensor Contraction Section).

🔑

tensordot vs einsum

The Rule of Thumb

  • Default to einsum: Use it for 95% of your code (modeling code, layers, loss functions). It is self-documenting and handles permutations automatically

  • For Readability & Documentation The equation bij,bjk->bik documents itself. You can instantly see it is a batch matrix multiplication. The equivalent tensordot requires you to mentally map indices to axis numbers.

  • einsum supports 3+ tensors; tensordot is binary.

  • Use tensordot when you know the contraction pattern and want the fastest GEMM-like (General Matrix Multiply) path.
  • Use tensordot when you want guaranteed “single matmul” behavior, because a single einsum maybe decomposed internally into (several matmuls, plus adds, plus permutes, plus buffer allocations, …).
    • This is because the behavior is always as:
      1. Permute A and B so that the contracted axes are contiguous (if needed).
    • If you know your contraction is really “one big GEMM”, tensordot makes that structure explicit and reduces the risk of the framework doing something surprising.

  • tensordot is constrained —> only contracts equal-length dimensions.




.repeat (Whole Blocks)

  • Similar to numpy.tile()

.repeat_interleave (Individual Elements)

  • Similar to numpy.repeat()
  • Repeat elements of a tensor.



Slicing


What we studied so far does not Cover any of (torch.nn) Classes and Methods

  • All of the following things can appear in a computation graph, but not all of them are “trainable building blocks”.
  • We can divide them into three categories:
    • [1] TRAINABLE LAYERS
    • [2] OPERATIONAL BUILDING BLOCKS (part of the graph but not trainable)
    • [3] SUPPORT / META COMPONENTS (NOT part of the graph)

[1] TRAINABLE LAYERS

These absolutely become part of the real neural network structure.

[2] OPERATIONAL BUILDING BLOCKS

These do create computation graph connections, but they don’t hold weights, so they don’t show up in .parameters().

[3] SUPPORT / META COMPONENTS

These do not create operations in the graph. They are helpers.

A simple analogy (you'll never forget this). Think of building a robot:

The brain & muscles (trainable modules)

Conv, Linear, Embedding, Attention, LSTM…

The joints and wiring (ops without parameters)

ReLU, Pooling, Softmax, Dropout, reshape, matmul…

The toolbox (utilities)

ModuleList, prune, weight_norm, Lazy modules…


Torch.gather

Reproducibility (Seeds)

python
import torch
import random
import numpy as np

seed = 42

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # if using multi-GPU

Floating Point Associativity atomicAdd

Floating-point addition is not “order independent”

For real numbers (math world), we have: (a+b)+c=a+(b+c)(a + b) + c = a + (b + c)

But for floating-point numbers, that is not always true because of rounding.

Example: