Reference:
There are 2 main approaches / schools to statistical inference, frequentist and Bayesian, differing in their interpretation of uncertainty.
- Each one will noisily argue their respective benefits.
Frequentist Approach
- Dominates the medical literature and consists of null hypothesis significance testing (think P values and confidence intervals).
Bayesian Approach
- Governed by Bayes’ theorem.

🔆

The fundamental difference between these 2 schools is their interpretation of `uncertainty` and `probability`

the frequentist approach assigns probabilities to data, not to hypotheses.
the Bayesian approach assigns probabilities to hypotheses + incorporate prior knowledge into the analysis + updates hypotheses probabilities as more data become available.

Introduction to Frequentist Statistics

Frequentist statistics is all about probability in the long run. (IMPORTANT)
- For example, the probability of getting heads when flipping a coin in the long run is 0.5; that’s because if we flip the coin many times, we would expect to see heads 50% of the time.
- whereas if we had flipped the coin only a few times we could reasonably expect to observe a different distribution (eg, all heads) just by chance.
Frequentist statistics focuses on the probability of events occurring over a large number of repetitions.
Example: A fair coin has a 50% probability of landing heads because, in a large number of flips, we expect approximately half to be heads.

Sampling Error and Hypothesis Testing

Hypothesis Testing is defined as a detailed protocol/steps we follow to make decision regarding the population by examining the sample from that population.
If we have a dataset, this dataset is always considered a sample — Any single dataset is considered one realization from a population of possible datasets.
- Each data set will have sampling error. What is that?
- Sampling error is the difference between the characteristics of a sample and the characteristics of the entire population from which that sample was drawn.
- Sampling error arises because each dataset will vary slightly due to random chance.
- Because a sample only represents a portion of the population, it's very unlikely that it will perfectly mirror the whole population.
- Very Unlikely مش متوقع خالص
- This discrepancy, this difference, is the sampling error.
- The smaller the sample size, the greater the potential for sampling error. A small sample is more likely to be affected by outliers or unusual data pointsز
Given the dataset, and acknowledging there’s always a Sampling Error —> The goal is to make inferences about the underlying population based on the observed / sample data.

Frequentist Inference Steps

🕵🏻

It's like a detective method: you start by assuming nothing strange is going on (that's called the null hypothesis), and then you collect clues (data) to see if that assumption holds up — or if something weird (interesting!) is going on instead.

You can read the details of Hypothesis Testing here:

1. Assume a Null Hypothesis (Start with "No Effect")

Before looking at any data, we assume that nothing is happening.

Example: Let’s say we’re testing a new medicine to see if it helps people live longer. The null hypothesis would be: “This medicine doesn’t help at all — it has no effect on survival.”

2. Collect Data (Run Your Experiment)

Now, we give the medicine to one group of patients and don’t give it to another group (maybe they get a placebo). Then we measure survival rates.

3. Analyze the Data (Check What Happened)

We look at the results. Maybe the medicine group lived longer. Now the big question is: “If the null hypothesis is true (medicine has no effect), how surprising are these results?”

4. Ask: How Likely Is This Result by Pure Chance?

Let’s imagine flipping a coin 100 times. Getting exactly 50 heads is expected. But if you get 90 heads, you’d be like: “Hmm… maybe this isn’t a fair coin.”

Same idea here: If the survival difference is too big to just be luck, then we say: “Whoa, this data would be super rare if the medicine had no effect.”

But it’s not super rare, this is the reality and we need to reject in the next step.

5. Reject the Null Hypothesis (Because It’s Too Surprising)

If the difference is big enough (statistically speaking), we say: “This result is unlikely under the ‘no effect’ idea. So let’s reject that idea.”

That doesn’t 100% prove the medicine works, but it gives strong evidence that it probably does.

بالعربي احنا بنقول انه النتيجة اللي طلعت من الداتاسيت بتاعتي دي حاجة نادرة جدا كده لو كان الفرضية صفر هي الواقع يعني. بس بما انها نادرة جدا لكن حصلت بطريقة فظيعة وحصلت على داتاسيت حجمها مش صغير يبقى كده محتاجين نغير رأينا عن الواقع ونقول ان ده مش حاجة نادرة، بل بالعكس الاحتمال واحد هو ده الحقيقة الجديدة

The Role of the p-value

You might’ve heard this word tossed around: the p-value.

بالعربي احنا بنقول ان عندنا الفرضية صفر كواقع حاصل، والداتاسيت بتاعتنا طلعت نادرة جدا مقارنة بالواقع، او نقدر نقول طلعت مفاجأة جدا جدا. لو بصينا لاحتمالية ان الداتاسيت دي تدي معلومات عكس الفرضية صفر هنلاقي ان الاحتمالية دي كانت صغيرة جدا ولنفرض انا حددتها 0.0001 ، بمعنى ان دايما الاحتمالية الاكبر انهم شبه بعض. ولكن مع ذلك ومع وجود الاحتمالية الصغيرة جدا دي، طلعت برضو الداتاسيت مختلفة عن الفرضية صفر. الاحتمالية دي اسمها p-value
It’s a number that tells you how surprising your result is under the null hypothesis.
P-value is a conditional probability of P(Observed | H0). The P-Value is significant if

So if your P-value is small (like 0.001), it’s saying:

“This data would be super rare if there were no real effect. Kinda suspicious, right?”

“This data is very unlikely غير متوقعة under the H0. Maybe the null is wrong! We should reject H0 and adapt H1 as the reality”

If your P-value is large (like 0.46), it’s saying:

“Hmm, this could easily happen even if the null is true — nothing weird here.”

“This kind of data happens all the time, even when nothing's going on. Not suspicious.”

Example 1

You’re testing a cancer treatment and find a huge survival difference between the two groups. “If the treatment actually does nothing, there's only a 0.1% chance you'd get data like this by luck. That’s rare.”

🎯 So: It suggests that maybe the treatment does something. You reject the null.

Example 2

Now you run a different study and you almost DO NOT find any difference in survival. When you calculate the P-Value

P = 0.46 means:

“If there’s no real treatment effect, there’s a 46% chance you’d get data like this. Nothing surprising.”

So, it’s like:

🤷 “Yeah, this result could easily happen if there’s no effect. Doesn’t mean there isn’t one, just means your data don’t stand out.”

🧠 What the P-value is NOT

This is where people mess up — especially in medicine.

Let’s bust some myths:

❌ It’s NOT the chance that your hypothesis is true or false.
P-value doesn’t address the prob that H0 is true
P-value of 0.001 doesn’t mean your treatment “works with 99.9% certainty.”
P-value is How surprising your data is if there's no effect.
❌ It’s NOT a measure of how important the effect is.
A tiny P-value doesn’t mean the effect is huge — it could be a tiny difference in a giant dataset.
❌ A big P-value doesn't prove null is correct!
It just means your data aren’t surprising to the H0 — but that could be because your sample was too small. So it suggests null might still be okay!
It means a Research can be true, but not necessarily highly important/significant to change H0.
It is important to understand that rejecting the null hypothesis does not prove the alternative hypothesis is true, but that the null hypothesis is unlikely.

🕵🏻

The key here in Frequentists is that probabilistic statements (ie, P values) can only be made about the data, not about hypotheses or parameters (ie, the treatment effect).

🕵🏻

The goal is to determine if the observed data are consistent with the null hypothesis.

Common P-Hacking Techniques

Selective Reporting (Cherry-Picking):
- Researchers may choose to report only the results that achieve statistical significance (p < 0.05), while ignoring or suppressing non-significant findings.
- Choosing among several outcome variables measured, reporting only those that yield significant results.
- This creates a biased view of the data.
Optional Stopping (Early Stopping)
- They continously check data and stop collecting data once a significant p-value is obtained, rather than adhering to a pre-determined sample size.
Variable Manipulation:
- This could be changing the way variables are defined or transformed during analysis, until a desired p-value is reached.
Post-hoc Hypothesizing (HARKing - Hypothesizing After Results are Known)
- Presenting a hypothesis developed after data analysis as if it was predetermined.
Data Dredging/Fishing (Flexible Data Analysis) (Fishing Expeditions):
- Trying multiple statistical analyses or transforming variables (e.g., log-transformations, dropping outliers) until a significant result emerges.
- It includes
  - Splitting data into various subgroups (age groups, gender, etc.)
  - Trying different statistical tests —> Switching tests (e.g., from parametric to non-parametric) after observing the data, depending on what yields significance.
  - Outlier Removal (Selective Exclusion): Including or excluding certain data points. —> Selectively removing outliers that cause the p value to be higher than the desired threshold. While removing outliers can be valid in some cases, it can also be used to intentionally lower the p value.
Manipulating Variables / Features / Covariates (Input):
- Adding or removing covariates selectively until achieving significant results.

Mitigating P-Hacking:

Pre-registration: Researchers should clearly pre-register their study design and analysis plan before collecting data, which reduces the flexibility to manipulate results.
Transparency: Researchers should be transparent about their data analysis methods and report all results, even those that are not statistically significant.
Emphasis on Effect Size and Confidence Intervals: Rather than relying solely on p-values, researchers should also report effect sizes and confidence intervals, which provide a more complete picture of the findings.
Replicate findings with independent data.
Multiple Testing Correction: Adjust for multiple comparisons using techniques like Bonferroni corrections or False Discovery Rate (FDR).
Use transparent reporting standards (e.g., CONSORT, PRISMA, Registered Reports).

Probability of Getting One False Positive

You have 20 datasets, and you have significance level of 0.05 to reject H0.

⚠️

If the tests / datasets are independent, the probability of not getting any false positives (that means like probability of getting tails every flip, which results in small probability) across 20 tests is (0.95)^20 ≈ 0.358 (or about 36%).

Therefore, the probability of getting at least one false positive across 20 tests is 1 - 0.358 ≈ 0.642 (or about 64%).

If you have 100 datasets, the chance of at least one false positive approaches certainty (>99%).

There's only a 36% chance that you manage to avoid a false positive across all 20 tests.
To make sense of this number, the trick is this probability is across multiple tests not per test.
For example: Flipping a fair coin, The chance of not getting heads in 1 flip is 0.5. But the change of not getting heads in 10 flips is 0.5^10 = 0.00098 —> Less than 1%

Multiple Testing Correction / Multiple Comparisons Problem

We said that we might consider a dataset statiscal significant if it’s very unlikely to happen under the assumption is H0 is True. The probability of having this unique dataset should be very rare so we say 0.001 or 0.05.
When we find a dataset having p-value that low, we say that we should start accepting H1 as the reality.
For each idea, you say:
“If the p-value is less than 0.05, I’ll consider it a significant result!”

How to manipulate
- Now consider you are a manipulator, you typically set a significance level of 0.05, often denoted as alpha (α), which means there’s a 5% chance of finding dataset that disagrees with H0.
- In your research, you want to prove something, so you have 20 different datasets. 5% chance is 1 dataset out of 20.
- You try 19 datasets, and they don’t provide small p-value. Nothing useful here to report.
- You try the last dataset, and you find a small p-value of 0.00125.
- You ignore the results about the 19 datasets, and you report that actually ya gama3a we found that H0 is incorrect and we should reject H0, and then go to H1.
- What you did in the previous step is called Type I Error (False Positive). You rejected H0 but this is not the reality. The reality is that you already found 19 datasets supporting H0 and you ignored them.
Definition of Type I Error: You say there's an effect (e.g., "Drug A works better than Drug B") when in reality, there is no effect.

🕵🏻

To summarize

We can protect ourselves from False Positives using (1) Bonferroni Correction (2) False Discovery Rate (FDR)

Bonferroni Correction (easy but strict)

It’s a Family-Wise Error Rate: The a probability of flagging true H0 as H1.

🔢 How it works:

Take your regular significance level (like 0.05).
Divide it by the number of tests.

For example:

So now, to call something "significant," the p-value must be less than 0.005 — much more strict. This helps prevent false positives.

✅ Good: Simple and strong protection against false positives

❌ Bad: Can be too strict and miss real effects (false negatives)

False Discovery Rate (FDR) (smarter and more balanced)

In many studies, especially those involving many tests (like testing thousands of genes), it's more useful to capture many true positives—even if that means allowing a small percentage of false positives.
- Scientists often want to test whether any of the thousands of genes have a significantly different activity in an outcome of interest.
- The idea is that a few false alarms are acceptable if you don't miss most of the true effects.
FDR is the fraction of false positives among all features flagged as positive $FDR = E( \frac{FP}{FP + TP})$
Controlling FDR: Instead of trying to eliminate all false positives (as in FWER) (which can be too strict and reduce your ability to find true effects), we control the False Discovery Rate. This means we accept that a small, predetermined fraction (e.g., 5%) of our discoveries might be false.

This method basically adjusts the critical values for each test depending on the number of tests and their rank.
- The new critical values are called (Q-Values)
FDR lets you detect more real effects than strict methods like Bonferroni.
- Great for situations where you're testing many hypotheses (e.g. 100s or 1000s).
- Tolerates Some False Positives, But Keeps Them in Check
  - Instead of trying to eliminate all false discoveries, it limits the proportion of them.
  - This makes it more practical when some mistakes are acceptable.
  - In Bonferroni Correction, we are very strict and reject a lot.
  - FDR gives you a middle ground between being too strict and too loose.
FDR Scales well with large datasets: Especially useful in fields like genomics, neuroscience, and machine learning, where you run tons of tests.

There are several methods to control FDR and one of the popular methods is Benjamini-Hochberg (BH) procedure.
- It assumes the tests are independent
Usually we set FDR to 5%, which means we accept FP among all Positives to be just 5%.
- It’s okay if some are wrong, as long as the rate of wrong ones stays low (like 5%).

Drug	P-value	Rank (i)
A	0.001	1
B	0.004	2
C	0.009	3
D	0.010	4
E	0.020	5
F	0.030	6
G	0.040	7
H	0.060	8
I	0.080	9
J	0.090	10

Example

You're a researcher testing 10 different drugs to see if they help with memory. You do 10 independent tests and get the following p-values:

Let’s say we want to control the FDR at 0.05 (that’s our q).
Sort the p-values
Assign a rank to each p-value
Compute the BH threshold for each test
- The threshold is ( $i/m *q$ ), where $m$ is number of tests. Only (i) changes!
- BH threshold = (i / m) * q = (i / 10) * 0.05
Stop at the test where P-Value ≤ Threshold

So under BH, you declare Drugs A through F (6 total) as significant.

Intuition

Early in the list, only the truly strong effects show up (very small p-values). so we try a very strict threshold, if it passes we go down.
As we go further down, the threshold increases because I say I have already seen a bunch of strong ones earlier, I trust these later ones now because others have been accepted.
The more significant results you have early, the more lenient you can be later — it’s like gaining confidence.
This gives you more discoveries without letting false ones dominate.
Essentially, the procedure looks at the ordered p-values and finds the "cut-off point" where the p-values start becoming too large relative to their rank to maintain the desired overall proportion of false discoveries (q).

Power analysis to avoid p hacking

💡

In summary, testing many hypotheses without adjustment inflates the risk of false positives (Type I errors), and selectively reporting only significant results (p-hacking) makes this far worse. FDR, often implemented via the Benjamini-Hochberg procedure, provides a way to correct for multiple testing by controlling the expected proportion of false discoveries among the significant results, using an adaptive threshold driven by the observed p-values, their ranks, the total number of tests, and the desired FDR level.

Drug	P-value	Rank (i)	BH Threshold (i/10 × 0.05)	Compare	Significant?
A	0.001	1	0.005	0.001 ≤ 0.005	✅ Yes
B	0.004	2	0.010	0.004 ≤ 0.010	✅ Yes
C	0.009	3	0.015	0.009 ≤ 0.015	✅ Yes
D	0.010	4	0.020	0.010 ≤ 0.020	✅ Yes
E	0.020	5	0.025	0.020 ≤ 0.025	✅ Yes
F	0.030	6	0.030	0.030 ≤ 0.030	✅ Yes
G	0.040	7	0.035	0.040 > 0.035	❌ No
H	0.060	8	0.040	0.060 > 0.040	❌ No
I	0.080	9	0.045	0.080 > 0.045	❌ No
J	0.090	10	0.050	0.090 > 0.050	❌ No

Confidence Intervals — A Better Way to Think About Results

Reporting confidence intervals can improve the interpretation of results compared with a P value alone and can give information on the size and direction of an effect.6 A 95% confidence interval tells us that if we were to repeat the experiment over and over (remember, frequentist statistics are long run), 95% of the computed confidence intervals would contain the true mean.7 This is different than saying there is 95% chance the true mean lies within the interval, because frequentist statistics cannot assign probabilities to parameters—the true mean either lies within the interval or it does not.8

Introduction to Bayesian Statistics

It’s named after the Reverend Thomas Bayes, whose theorem describes a method to update probabilities based on data and past knowledge.
IMPORTANT: In contrast to the frequentist approach, parameters and hypotheses are seen as probability distributions and the data as fixed.
- This idea is more intuitive because generally the data we collect are the only dataset we have, so it does not necessarily make sense to perform statistical analysis assuming it is one of many potential datasets (as Frequentists do). Unless it’s genes and so on.

✅ Frequentist methods dominate when:

Large sample sizes:
- Frequentist methods shine with lots of data.
- Confidence intervals and p-values become very reliable.
Standard, well-established problems:
- If you’re doing t-tests, ANOVA, regression, etc., with no need for prior knowledge, frequentist tools are the go-to.
No reliable prior information is available:
- Frequentist methods don’t require priors (which can be subjective in Bayesian methods).
Computational simplicity is key:
- Frequentist approaches are often less computationally intensive than full Bayesian methods.
Regulatory or academic requirements:
- Many fields (e.g., clinical trials) are traditionally grounded in frequentist methodology.

✅ Bayesian methods dominate when:

Small or limited data:
- Prior information can help stabilize estimates when data is scarce.
You have meaningful prior knowledge:
- Bayesian inference allows you to formally incorporate expert knowledge or historical data.
You need probabilistic interpretation:
- Bayesian outputs (like "there’s a 95% chance the parameter is in this range") are easier to interpret intuitively than frequentist confidence intervals.
Adaptive learning or online updating:
- Bayesian methods naturally update beliefs as new data comes in (great for real-time or sequential decision-making).
Hierarchical or complex models:
- Bayesian frameworks handle multi-level or hierarchical models elegantly, where frequentist approaches can get messy.
Decision-making under uncertainty:
- Bayesian approaches align well with decision theory, especially when actions depend on the probability of outcomes.

Frequentist vs Bayesian: Master Comparison

Category	Frequentist	Bayesian
View of Parameter θ\thetaθ	Fixed but unknown	Random variable
View of Data	Random	Fixed (after observing)
Probability Means	Long-run frequency	Degree of belief
Estimation	- Point Estimate (MLE) - Confidence Interval	- Posterior Distribution - Credible Interval
Parameter Inference	Confidence Intervals (CIs)	Credible Intervals
Hypothesis Testing	- Null vs Alternative Hypotheses - p-values - t-tests, ANOVA	- Posterior Probabilities - Bayes Factors - Decision rules
Prior Knowledge	Not used	Required (Priors)
Learning from Data	Estimate fixed parameters from data	Update belief using Bayes' Rule
Model Updating	Refit from scratch	Update posterior with new data
Prediction	Plug in estimate (e.g. MLE) into model	Average over posterior (Bayesian predictive distribution)
Tools	- MLE (Maximum Likelihood Estimation) - Least Squares - Confidence Intervals - Classical hypothesis tests	- Bayes’ Theorem - Priors and Posteriors - MCMC (sampling) - Bayesian Networks
Philosophy	Objective, no belief	Subjective, includes beliefs
Common Examples	- Linear Regression - Logistic Regression - ANOVA - Chi-squared tests	- Bayesian Linear Regression - Hierarchical Models - Bayesian Inference for proportions - Bayesian Neural Networks
Software/Packages	- `scikit-learn` - `statsmodels`	- `PyMC`, `Stan`, `JAGS` - `BayesPy`, `TensorFlow Probability`

Frequentist Tools (only)

t-test, z-test
p-value
ANOVA (Analysis of Variance)
Confidence intervals
Chi-squared test
Maximum likelihood estimation (MLE)
Classical linear regression (OLS)
Classical logistic regression
AIC / BIC (for model selection — though sometimes used in Bayesian too)

Bayesian Tools (only)

Prior / Posterior distributions
Bayesian updating
Bayes' Rule
Bayes Factors (alternative to p-values)
Credible intervals
Posterior Predictive Distributions
Hierarchical Bayesian models
Bayesian model averaging
MCMC (Markov Chain Monte Carlo)
Variational Inference

https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/

Bayesian Classifier, Bayesian Networks, Probability Trees, The first part of Lecture 7 DA and second Part Statistical Learning Pascal (Naive Bayes, Bayes Nets: Representation, Bayes Nets: Inference & D-Separation), EM Algorithm
Statistical Learning [2]: Multivariate Gaussians, Bayesian Decision Theory, Frequentists vs. Bayesians, Pratik Jain YouTube, Generative/Discriminative models, Credible Interval, Bayesian Estimation

Drug P-value Rank (i) BH Threshold (i/10 × 0.05) Compare Significant? A 0.001 1 0.005 0.001 ≤ 0.005 ✅ Yes B 0.004 2 0.010 0.004 ≤ 0.010 ✅ Yes C 0.009 3 0.015 0.009 ≤ 0.015 ✅ Yes D 0.010 4 0.020 0.010 ≤ 0.020 ✅ Yes E 0.020 5 0.025 0.020 ≤ 0.025 ✅ Yes F 0.030 6 0.030 0.030 ≤ 0.030 ✅ Yes G 0.040 7 0.035 0.040 > 0.035 ❌ No H 0.060 8 0.040 0.060 > 0.040 ❌ No I 0.080 9 0.045 0.080 > 0.045 ❌ No J 0.090 10 0.050 0.090 > 0.050 ❌ No

Frequentist vs Bayesian Statistics

The fundamental difference between these 2 schools is their interpretation of uncertainty and probability