Contiguity (Row-Major Order, Fortran & MatLab Column-Major Order, .stride, Traversing Memory to Print Tensor, data_ptr, What is Contiguous, Importance & Benefits, Common Operations for Contiguity & Non-Contiguity, )
I stopped to fix index_select, then need to finish vdot, then atomicadd.
Dimension Reordering Operations
Summary of Multiplication and Product Functions in PyTorch
| Multiplication Name | Function | Explanation | Symbol | Vector Or Matrix |
|---|---|---|---|---|
| Tensor Multiplication | .matmul @ | performs matrix multiplication or batch matrix multiplication | —— | Vectors/Matrix |
| Matrix Multiplication | .mm | performs matrix multiplication only on 2D matrices. | —— | 2D Matrix |
| Hadamard product / Element-Wise Multiplication | .mul .multiply * | performs Element-wise Multiplication | —— | Vectors/Matrix |
| Dot Product / Inner Product / Scalar Product | .dot .inner | Vectors, and .inner accepts Matrix but performs on last dim only. | ||
| Outer Product | .outer | —— | Vectors Only | |
| Kronecker Product | .kron | Vectors/Matrix | ||
| Cross Product | .cross | Vectors Only | ||
| Cartesian Product | .cartesian_prod | Returns 2D list of all combinations pairs (no multiplication) | Vectors Only |
To use any of these operations, both tensors must be of the same data type.
- Even if one is float32, and the other is float64, PyTorch will raise a runtime error of mismatched types.
✅ Best Practice:
Use float32 unless you explicitly need:
- Mixed precision (float16) for speed
- Double precision (float64) for numerical stability
- Complex numbers for Fourier or signal ops
- More details here: for more data types like
torch.bfloat16ortorch.qint8, etc.. - All data types are here:
Fused operations reduce the number of memory read/write cycles, leading to significant speedups.
Kernel Fusion: The primary benefit is saving Memory Bandwidth.
addr // outer product then add to matrix input
I still need to complete tensordot, cross, outer, kronecker, vdot, and so on…
torch.nn.Linear
torch.vdot
.outer ⊗ (Outer Product)
- It’s very SIMPLE.
- It takes just 1 dimensional tensors.

- If
inputis a vector of size n andvec2is a vector of size m, thenoutmust be a matrix of size (n×m).pythonv1 = torch.tensor((1, 2, 3, 4)) # Size [4] --> Treated as [4, 1] v2 = torch.tensor((1, 2, 3)) # Size[3] --> Treated as [1, 3]If we have , with a shape
🔑The outer product is the same as writing in Matrix Format.However, when working with 1D vectors, we typically omit the transpose for simplicity, since ⊗ implicitly imply an outer product.
- To use
matmul, to do the outer product, we can do the following:pythonv1 = torch.tensor((1, 2, 3, 4)).reshape(4, 1) # Written in Matrix Format v2 = torch.tensor((1, 2, 3)).reshape(3, 1) # Written in Matrix Formatpythona = torch.tensor([[1], [2], [3]]) b = torch.tensor([[1], [2], [3]]) # INNER PRODUCT: aᵀ @ b → scalar inner_product = torch.matmul(a.T, b) # OUTER PRODUCT: a @ bᵀ → 3×3 matrix outer_product = torch.matmul(a, b.T)
Outer product is Symmetric up to Transposition
Low-Rank Matrix Factorization
- It's a way to approximate a big matrix using two smaller matrices — so you save space, reduce noise, or learn meaningful patterns.
Use Cases:
- Constructing Covariance Matrices
- Remember that , and
.kron ⊗ (Kronecker Product) (Matrix Direct Product) (Generalized Outer Product)
- The Kronecker product (denoted ) is a generalization of the outer product from vectors to matrices.
- Given matrix and a matrix , the kronecker product will be with a shape of
A = torch.tensor([[1, 2],
[3, 4]])
B = torch.tensor([[0, 5],
[6, 7]])
mat1 = torch.eye(2)
mat2 = torch.arange(1, 5).reshape(2, 2)
The Kronecker product of two matrices A and B means: “Take every number in matrix A, and replace it with that number multiplied by all of matrix B.”
- In Matrix-Matrix Multiplication, we learned that each element of is
- In Kronecker Product, .
- It means move to the
- ✳️ If one tensor has fewer dimensions —> PyTorch will automatically add extra dimensions (using
.unsqueeze()) to the smaller one.pythona = torch.tensor([[1, 2]]) # shape: (1, 2) b = torch.tensor([3, 4, 5]) # shape: (3) result = torch.kron(a, b) # PyTorch treats b as shape (1, 3) # Finall result shape (1x1, 2x3) = (1, 6)
Basic Properties
- It does not matter where we place multiplication with a scalar
- Taking the transpose before carrying out the Kronecker product yields the same result as doing so afterwards
- The Kronecker product is associative
- The Kronecker product is right–distributive
- The Kronecker product is left–distributive
- The product of two Kronecker products yields another Kronecker product
- The trace of the Kronecker product of two matrices is the product of the traces of the matrices
- The trace of a square matrix is the sum of its diagonal elements.
Use Cases:
- Repeating a Pattern (Matrix Tiling)
- You’re building a large image or grid that follows a pattern — like tiles on a floor.
- Imagine a pattern like python
[0 1] [1 0]And you want to repeat it across a 4x4 area, scaled differently in each region (Brightness, Weights, …).
- Think of it like “Take this small checkerboard and copy it multiple times into a bigger grid — but scale each copy differently."python
# Tile pattern (e.g., black/white checkerboard) tile = torch.tensor([[0, 1], [1, 0]], dtype=torch.float32) # Control pattern (e.g., brightness or number of times to repeat) pattern = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32) # Kronecker product to scale and tile
- Combining Systems (Quantum / State Expansion)
- You have two small systems (like coin flips or binary states), and want to model the combined system.
- You want to simulate both together as one big system with 4 states (2 × 2).
- This is how quantum computing combines qubits.python
# 2-state systems (e.g., [1, 0] is "on", [0, 1] is "off") sys_A = torch.tensor([[1.], [0.]]) # A is ON sys_B = torch.tensor([[0.], [1.]]) # B is OFF # Combined system (A ⊗ B) combined = torch.kron(sys_A, sys_B)
- Building Smart Weight Matrices (Machine Learning) 🧠
- You’re training a neural network, and one of your layers has a huge matrix of weights.
- You realize:
- It’s slow
- Takes too much memory
- But most of the values are based on a smaller pattern.
- You can use Kronecker product to build that big matrix from small ones.python
# Base patterns A = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32) B = torch.tensor([[0.1, 0.2], [0.3, 0.4]], dtype=torch.float32) # Build a large structured weight matrix W = torch.kron(A, B) # Input to multiply (size must match) x = torch.randn(4) # W is 4x4 # Apply the layer y = W @ x
- Efficient Neural Network Layer
- A paper called “Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1/n Parameters” discusses this idea.
- Instead of training normal FC layers, we will have PHM layer (Parameterized Hypercomplex Multiplication).
- Each layer, will have the matrix , which is , and such construction reduces number of parameters to 1/n.
- “Instead of learning 1000 values, I’ll learn 10 values and repeat them smartly to build the big thing.”
Manual Implementation of Kronecker Product
def manual_kron(A: torch.Tensor, B: torch.Tensor):
a_rows, a_cols = A.shape
b_rows, b_cols = B.shape
# Shape of the output: (a_rows * b_rows, a_cols * b_cols)
C = torch.zeros((a_rows * b_rows, a_cols * b_cols), dtype=A.dtype)
for i in range(a_rows):
for j in range(a_cols):
# Multiply the scalar A[i, j] by the full matrix B
C[i*b_rows:(i+1)*b_rows, j*b_cols:(j+1)*b_cols] = A[i, j] * B
# Note that we for loop on Matrix A, and for every element of it, we generate a block
# Fully Vectorized Kronecker Product (2D matrices only)
• axes=([1, 2], [1, 2]) tells NumPy to sum over the 2nd and 3rd dimensions of both arrays.
.tensordot (Tensor Dot Product) (Generalized Contraction) HARD
- Many PyTorch users get stuck when they have to move beyond simple 2D matrix multiplication (
torch.matmulor@) into 3D, 4D, or 5D tensors.
But what happens when you are doing deep learning and you hit 4D image tensors? Suddenly, you find yourself desperately using .view(), .permute(), and .transpose() just to get dimensions to line up so you can use matmul. It’s messy, error-prone, and hard to read six months later.
What if I told you there is a single function that handles multiplying complex tensors without ever needing to reshape them first? Today, we are mastering torch.tensordot.

- What is Tensor Contraction?
- Don't let the name scare you. "Contraction" just means we are choosing specific dimensions (axes) from two different tensors, multiplying the elements along those axes, and summing them up.
- Because we sum them up, those dimensions "contract" —> they disappear from the final output.
.tensordottakes three main arguments: your first tensor, your second tensor, and the most importantlydimsparameter.- Basic Syntax:
torch.tensordot(A, B,dims) - The magic happens in
dims.
- A tuple of two lists: explicit axes from A and B,
dims=([List A], [List B]).- List A: The indices of the dimensions in the first tensor you want to contract.
- List B: The indices of the dimensions in the second tensor you want to contract against them.
Case 0: Dot Product .dot
A = torch.tensor([1, 2, 3]) # (3,)
B = torch.tensor([4, 5, 6]) # (3,)
result = torch.tensordot(A, B, dims=1) # Contract 3 from A with the 3 from B
print(result) # tensor(32)Case 1: 2D Matrix Multiplication .mm
A = torch.randn(3, 4)
B = torch.randn(4, 5)
torch.tensordot(A, B, dims=1).size() Case 2: dims = 0 → Outer Product
- dims = 0 means: “do NOT contract anything.”
- No summation, just multiplication one by one.
A = torch.randn(4)
B = torch.randn(5)
torch.tensordot(A, B, dims=0).size() # torch.Size([4, 5]), same as in normal outer product
# If A = [1, 2, 3, 4]
# and B = [6, 7, 8, 9, 10]
# Then it will be
- Now let’s say I have A of size (3,4), and B of size (5), and I said
torch.tensordot(A, B, dims=0), what will happen? - Say we have A is
[[1, 2, 3, 4] [5, 6, 7, 8] [9, 10, 11, 12]] - And B is
[1, 2, 3, 4, 5] - Since, dims=0, which means no contraction, so again outer product.
- We will take the first row from A and perform outer product with B, then second row from A, and outer product with B, and so on…
A = torch.randn- So real quick if
A = torch.randn(3, 4); B = torch.randn(4, 5)thentorch.tensordot(A, B, dims=0)will result in a 4-D tensor by taking the outer product of every element of A with every element of B, which gives a matrix of size (3, 4, 4, 5). - Even though both tensors have a matching dimension of size 4,
dims=0treats them as distinct independent axes. It does not attempt to align or broadcast them.
.outer().dims=0, tensordot creates an outer product: every element of A is multiplied with every element of B, producing a tensor whose shape is the concatenation of A’s shape and B’s shape. dims=([ ], [ ]) to get an outer product.The only correct way to get an outer product with tensordot is: dims = 0 .
Case 3 — dimension size 1
A = torch.tensor([[1, 2, 3]]) # shape (1, 3)
B = torch.tensor([[4, 5, 6]]) # shape (1, 3)
result = torch.tensordot(A, B, dims=1)
print(result) - We need to multiply across the last dimension of A (3) and first dimension of A (1).
Case 4 — explicit axes (Matrix Multiplication)
- Here, we say, contract the last index of A (axis = 0) with first index of B (axis = 1).
- It’s like normal matrix multiplication.
A = torch.randn(2, 3)
B = torch.randn(3, 2)
result = torch.tensordot(A, B, dims = ([1], [0])) # dims = 1
print(result) # Size: [2, 2]Case 5 — explicit axes (Double Contraction) ()
- Tensor contractions can be thought of as the higher-dimensional equivalent of matrix-matrix multiplications.
- The symbol “:” means double contraction (also called double dot product).
- Double Contraction is the tensor-analogue of the dot product but applied twice.
- Dot product = contract one index from each matrix.
- Double dot product = contract two indices from each matrix.
- A double dot product between two tensors of orders m and n will result in , which is
.dim(), because 2 axes have been removed from each matrix.
We are not doing matrix multiplication here per se, all we want is tensor contractions. This moves our thinking to having (Dot Product): vectors of matching lengths, that we multiply them by each other to get a scalar.
Let’s understand through an example:
- If is size which is 4 blocks of . is size which is 2 blocks of . Let’s say we want to do
torch.tensordot(A, B, dims =([2, 1], [0, 1])). - You might think that well, the matrices are organized very well for us for Matrix Multiplication where we want rows columns, i.e., vs .
- This is the most point of confusion with tensor contractions. We are not thinking of this as Matrix Multiplication, rather Vector Dot Products.
- We requested contracting using this mapping:
A axis B axis size 2 (size 2) 0 (size 2) ✓ matches 1 (size 3) 1 (size 3) ✓ matches - [IMPORTANT] But How tensordot actually executes this contraction?
- To perform this operation efficiently, PyTorch (and NumPy) follows a three-step process: Permute Reshape Matrix Multiply.
- You need to study this: to understand how is it different from
.reshape/.view.
- You need to study this: to understand how is it different from
- Our goal is to have something like , so applying
@to them is easy. - The answer that flies is then let’s flatten —>
A.reshape(A.shape[0], -1); B.reshape(-1, B.shape[0]), but it’s not that simple, and let’s understand why this needpermutefirst. - Let’s say we have the following: python
A = torch.randint(low=0, high=10, size=(4, 3, 2)) # integers 0–9 B = torch.randint(low=0, high=10, size=(2, 3, 5)) # integers 0–9 result = torch.tensordot(A, B, dims = ([2, 1], [0, 1])) # Let's say A : (4, 3, 2) tensor([[[0, 1], [5, 7], [9, 9]], [[2, 7], [3, 9], [4, 0]], [[2, 8], [4, 4], [7, 4]], [[1, 0], [5, 4], [8, 4]]]) B: (2, 3, 5) tensor([[[2, 7, 1, 8, 2], [9, 3, 6, 7, 3], [3, 0, 5, 4, 8]], [[3, 9, 1, 5, 1], [2, 6, 7, 7, 5], [5, 5, 1, 5, 2]]]) - In a normal 2D matrix multiplication, we loop
rowbyrowfrom A, andcolumnbycolumnfrom B, then in each iteration, we perform the dot product operation to get one cell value. - We have:
A.shape = (4, 3, 2) # think A[a,j, i]B.shape = (2, 3, 5) # think B[i, j, b]
- We want to perform:
result = torch.tensordot(A, B, dims=([2, 1], [0, 1]))
- We know the output should be of size (4, 5), as two dimensions has been contracted from each.
- Let’s trace
result[0,0]explicitly:a = 0(The cellb = 0i ∈ {0, 1}j ∈ {0, 1, 2}
- Now, since we are in need to double contract, the formula is sum over (i) then over (j) because we are doing
- Let’s perform the
The usefulness of permute and reshape functions is that they allow a contraction between a pair of tensors (which we call a binary tensor contraction) to be recast as a matrix multiplication.
Batch matrix multiplication is a special case of a tensor contraction.
# 1. Permute:
A_perm = A.permute(0, 1, 2) # (4, 3, 2), same as original
B_perm = B.
# If we flatten / Reshape A
- Flatten (Reshape) is basically for grouping into one shot operation of double contraction
- PyTorch effectively flattens the tensors so the Contracted dimensions are grouped on the "inside."
- Flattening / Reshaping A: We keep dim 0. We combine dims 1 and 2.
- Flattening / Reshaping B: We combine dims 0 and 1. We keep dim 2.
- Now just multiply, and we get the final result of .
# Final Result
tensor([[134, 111, 134, 170, 141],
[ 82, 140, 110, 151, 97],
[113, 142, 101, 160, 108],
[ 99, 66, 103, 123, 109]]).tensordotgeneralizes:- inner product
.dot - outer product
.outer - matrix multiply
.mm - batch
.matmul - multi-axis contraction
.tensordot - Einstein summation patterns
.einsum
- inner product
What is full geometric intuition behind tensor contractions???
Tensor product could be between 3D and 2D (doesnt have. to be of same order).
https://www.youtube.com/watch?v=RxbL5i8gczg (Log from Tensor Contraction Section).
tensordot vs einsum
The Rule of Thumb
- Default to
einsum: Use it for 95% of your code (modeling code, layers, loss functions). It is self-documenting and handles permutations automatically
- For Readability & Documentation
The equation
bij,bjk->bikdocuments itself. You can instantly see it is a batch matrix multiplication. The equivalenttensordotrequires you to mentally map indices to axis numbers.
einsumsupports 3+ tensors;tensordotis binary.
- Use
tensordotwhen you know the contraction pattern and want the fastest GEMM-like (General Matrix Multiply) path. - Use
tensordotwhen you want guaranteed “single matmul” behavior, because a singleeinsummaybe decomposed internally into (several matmuls, plus adds, plus permutes, plus buffer allocations, …).- This is because the behavior is always as:
1. Permute A and B so that the contracted axes are contiguous (if needed).
- If you know your contraction is really “one big GEMM”,
tensordotmakes that structure explicit and reduces the risk of the framework doing something surprising.
- This is because the behavior is always as:
tensordotis constrained —> only contracts equal-length dimensions.
.repeat (Whole Blocks)
- Similar to
numpy.tile()
.repeat_interleave (Individual Elements)
- Similar to
numpy.repeat() - Repeat elements of a tensor.
Slicing
What we studied so far does not Cover any of (torch.nn) Classes and Methods
- All of the following things can appear in a computation graph, but not all of them are “trainable building blocks”.
- We can divide them into three categories:
[1] TRAINABLE LAYERS[2] OPERATIONAL BUILDING BLOCKS (part of the graph but not trainable)[3] SUPPORT / META COMPONENTS (NOT part of the graph)
[1] TRAINABLE LAYERS
These absolutely become part of the real neural network structure.
[2] OPERATIONAL BUILDING BLOCKS
These do create computation graph connections, but they don’t hold weights, so they don’t show up in .parameters().
[3] SUPPORT / META COMPONENTS
These do not create operations in the graph. They are helpers.
The brain & muscles (trainable modules)
Conv, Linear, Embedding, Attention, LSTM…
The joints and wiring (ops without parameters)
ReLU, Pooling, Softmax, Dropout, reshape, matmul…
The toolbox (utilities)
ModuleList, prune, weight_norm, Lazy modules…
Torch.gather
Reproducibility (Seeds)
import torch
import random
import numpy as np
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) # if using multi-GPUFloating Point Associativity atomicAdd
Floating-point addition is not “order independent”
For real numbers (math world), we have:
But for floating-point numbers, that is not always true because of rounding.
