Yusuf Elnady Logo
Back to Notes

Explainable AI (XAI) & Feature Selection/Engineering

Last updated: 6/9/2025
  • The primary goal of Explainable AI (XAI) is to provide a clear understanding of how machine learning models generate their predictions.
  • Model Improvement: Data scientists who develop these models can analyze their behavior, fine-tune parameters, and address prediction errors to enhance accuracy.
  • Regulatory Compliance: Regulations such as GDPR and the Ethics Guidelines for Trustworthy AI require transparency in model decision-making.
  • User Understanding: Individuals impacted by model predictions may seek explanations for the outcomes that affect them.
  • Building Trust: Providing clear explanations fosters greater trust in machine learning, which some perceive as an opaque or intimidating technology.

There are basically two approaches to achieve Interpretability (Interpretable ML vs Explainable ML):

  1. Transparent ML models (Simple Models) (Glassbox Models) Interpretable ML(one which is understandable without further intervention)
    • Linear Regression
    • Logistic Regression
    • GLM, GAM and more
    • Decision Tree
    • Decision Rules
    • RuleFit
    • Explainable Boosting Machine (New By Microsoft Research)
    • Naive Bayes Classifier
    • KNN
    • You can use a library like https://interpret.ml/docs/lr.html that provide Graph for Interpretations for the simple models (Local + Global).
  2. BlackBox Models Explainable ML: You will need a Post-Hoc technique (LIME, SHapley, Counterfactual Explanations, etc..) on top of it, to explain its very complex behavior.

Complex Models & XAI Techniques

  • Complex machine learning models—such as Gradient Boosting, Neural Networks, and Random Forests—are non-parametric methods. They allow the function 𝑓 to adapt its shape freely, capturing intricate patterns and effectively modeling complex relationships between input variables.
  • This flexibility often leads to superior predictive performance compared to traditional statistical models (Linear Regression, Logistic Regression, etc.).
  • However, a significant drawback: the exact mathematical form of 𝑓 is not explicitly available.
  • When we have just 1 or 2 variables we may draw ff on the geometrical space, but when it is more we have no way to understand the surface without ff.
  • To address this challenge, various techniques have been developed to enhance the interpretability of complex models:
    • Permutaion Feature Importance (SKLEARN): Assessing the contribution of each input feature to the model's predictions helps identify which variables significantly influence the output.
    • SHAP Values (SHapley Additive exPlanations): A unified approach that assigns each feature an importance value for a particular prediction, providing insights into individual predictions.
    • LIME (Local Interpretable Model-agnostic Explanations): This technique approximates the complex model locally with an interpretable model to explain individual predictions.
    • Counterfactual Explanations
    • Partial Dependence Plots (PDPs): These plots illustrate the relationship between a selected feature and the predicted outcome, marginalizing over the effects of other features.
    • Accumulated Local Effects (ALE) Plots
    • Individual Conditional Expectation (ICE) Plots
    • Layer-wise Relevance Propagation (LRP)
    • Scoped Rules (Anchors)
    • Morris Sensitivity Analysis

Global vs. Local Interpretability

  • Global interpretability refers to insights in how input features generally contribute to the target variable, either positive or negative.
    • It allows us to get a sense of the importance or relevance of the predictors in our model, and how they relate to the target variable.
  • Local interpretability refers to insights in why an individual case received its prediction and how predictor variables contributed to the prediction.
    • This is especially useful in models like RandomForest and XGBoost, because in these kind of models the contribution of variables to the prediction can vary per individual case.

Partial Dependence Plots (PDP)

https://interpret.ml/docs/pdp.html

Permutation Feature Importance (PFI) (Global)

  • We can consider part of Feature Selection Methods. (Which Feature the model relies on the most)
  • It is a model inspection technique that measures the contribution of each feature to a Trained ML model on Tabular Data.
  • As other methods, this technique is particularly useful for non-linear or opaque models.
  • It’s model agnostic, and works with any scoring metric (accuracy, loss, F1, …)
  • It’s also very easy to explain 🙂

Steps:

👌🏻
Note to myself: It’s very very easy.
  1. Take a model that was fit to the training dataset.

    randomforest.fit(X_train, y_train)

  2. Estimate the baseline value of the model using Validation Dataset, and record its accuracy

    acc = randomfroest.score(X_val, y_val); print(acc) #0.99

  3. For Each Feature (Column) jj:
    1. Randomly permute (shuffle) the column jj across all samples from the Validation Dataset.
      AgeRatingSizeHouse Price (y)
      209/101500$300k
      507/102000$200k
      1204/103000$150k
      Age (Shuffled)RatingSizeHouse Price (y)
      1209/101500$300k
      207/102000$200k
      504/103000$150k
    2. Record the accuracy of the model using the permuted (column) (The Right Table).

      acc_age = randomforest.score(X_val_perm, y_val, perm); print(acc_age) # 0.85

    3. Compute the Feature Importance as the difference between the baseline and (b).

      0.99 - 0.95 = 0.14

    4. We should repeat steps (a) to (c) for all different combinations of shuffling (Permutations), or for a large number of times (x50) and compute the feature importance as the average difference.
By breaking the relationship between the feature and the target, we determine how much the model relies on such particular feature.

https://scikit-learn.org/stable/modules/permutation_importance.html
https://scikit-learn.org/stable/modules/permutation_importance.html
  • On the right, we see the samples ground truth (Blue), and the predicted output (Orange).
  • On the left, we permuted a feature, and we see that the MAE increased to 2.28 from 0.51
  • So, permuting a predictive feature breaks the correlation between the feature and the target, and consequently the model performance decreases.

https://scikit-learn.org/stable/modules/permutation_importance.html
https://scikit-learn.org/stable/modules/permutation_importance.html
  • On the right, we see the samples ground truth (Blue), and the predicted output (Orange).
  • On the left, we permuted a feature, and we see that nothing that almost nothing has changed in the image.
  • So, permuting a non-predictive feature does not significantly degrade the model statistical performance.
  • In this figure, they trained a RandomForestClassifier on Titanic Dataset (Whether a passenger survived or not).
  • They added to random features to the Titanic dataset: random_cat (categorical) and random_num (numerical). They are not correlated in any way with the target variable.
  • The x-axis shows how much the accuracy decreases when a feature's values are shuffled.
  • A higher value on the x-axis means the feature is important because permuting (randomizing) it significantly reduces accuracy.
  • Features like sex and pclass have higher values, indicating that they are important for the model.
  • Features like random_cat and random_num have values near zero, confirming that they are not useful for prediction (as expected, since they were added as uncorrelated random features).

Important

Features that are considered of low importance for a bad model (low cross-validation score) could be very important for a good model. Therefore it is always important to evaluate the predictive power of a model using a cross-validation (Validation Dataset) prior to computing importances.

Permutation importance does not reflect to the actual predictive value of a feature by itself but how important this feature is for a particular model. Feature might be more/less important to another model.
💡
Feature importance is the increase in model error when the feature’s information is destroyed. Feature importance provides a highly compressed, global insight into the model’s behavior.
💡
Permutation feature importance does not require retraining the model. Some other methods suggest deleting a feature, retraining the model and then comparing the model error. That takes long time.
💡
In Permutation Feature Importance —> You need access to the true outcome. If you have only unlabeled data – but not the true outcome – you cannot compute the permutation feature importance.
🔑
We permute because it’s important to sample from same distribution as the original one, ensuring the permuted features are realistic. Sampling from a different distribution —> Unrealistic Scenarios —> Misleading Important Scores.

Disadvantages

  • Correlated Features Can Cause Misleading Results
    • In PFI, we assume features are independent, which not always True.
    • If two features are highly correlated, shuffling one of them may not affect model performance because the other still contains the same information.
    • Effect: PFI may underestimate the importance of correlated features, or distribute importance between them arbitrarily.
    • ✅ Example: In a housing price model, if square_feet and number_of_rooms are highly correlated, permuting one may not significantly impact the model's accuracy since the other still provides similar information.
    • Workaround: Use interaction-aware methods like SHAP
    • Workaround 2: Cluster features that are correlated and only keep one feature from each cluster.
  • PFI Measures Importance on a Specific Dataset Split
    • If the test set is small or unrepresentative, the results may not generalize well.
  • Computational Cost for Large Datasets
    • PFI requires multiple model evaluations (each feature shuffle requires a forward pass through the model).
    • Workaround: Reduce n_repeats (number of times permutation is performed) or limit max_samples.
  • Sensitivity to Model Stochasticity
    • If the model is not deterministic (e.g., it contains randomness like dropout in neural networks), the computed importance may vary between runs.
    • Workaround: Use fixed random seeds and increase n_repeats to average out randomness.

Other Approaches

  • A paper in 2018, suggests that the Validation dataset is split in half and swap the values of feature jj with each half instead of permuting feature jj, then score the model.
  • Another method is to convert your Validation dataset into a larger dataset —> This gives you a more accurate estimate
    • You have feature jj and you have nn samples —> For each sample, make it into (n-1) samples by changing the value of jj with values from the remaining samples.
    • We end up then with a dataset of size n(n1)n(n-1) samples.
    • This method is recommended if you really want getting extremely accurate estimates.

Why Validation not Training Dataset

  • If you measure the model error (or accuracy) on the training dataset, the measurement is usually too optimistic, which means that the model seems to work much better than it does in reality.
  • Since the permutation feature importance relies on measurements of the model error, we should use unseen validation or test data.
  • The feature importance based on training data makes us mistakenly believe that features are important for the predictions, when in reality the model was just overfitting and the features were not important at all.

Code

python
sklearn.inspection.permutation_importance
  • n_repeats: Number of times to permute a feature.
  • random_state: control the permutations of each feature. Pass an int to get reproducible results across function calls.
  • sample_weight: Sample weights used in scoring.
  • max_samples: The number of samples to draw from X to compute feature importance in each repeat (without replacement).
  • n_jobs: Since each column's importance is computed separately, this process can be parallelized across multiple CPU cores using n_jobs.
  • scoring:
    • For classifiers, it defaults to accuracy. There are others like recall, precision, F1, roc_auc, log_loss.
    • For regressors, it defaults to r2 (coefficient of determination). There are others like neg_mean_absolute_error, explained_variance, neg_root_mean_squared_error, ….
    • The scoring argument accepts multiple scorers, which is more computationally efficient than sequentially calling permutation_importance several times with a different scorer, as it reuses model predictions.
python
from sklearn.inspection import permutation_importance
r = permutation_importance(model, X_val, y_val, n_repeats=30, random_state=0)
for i in r.importances_mean.argsort()[::-1]: # Mean of feature importance over n_repeats.
			print(

Relation to impurity-based importance in trees

  • NOT STUDIED UYET

Counterfactual Explanations (2017) (Local)

  • Definition: Counterfactual is the smallest (minimum) change in the input features, that changes the prediction to another output.
    • Also called “Contrastive Explanations”.
    • They are human-friendly explanations.
  • A counterfactual explanation describes a causal situation in the form: “If X had not occurred, Y would not have occurred”.
    • Even if in reality the relationship between the inputs and the outcome might not be causal, we here consider the inputs as the cause of the prediction.
  • They are local —> Explaining individual instances.
  • Counterfactual explanation is itself a new instance —> It doesn’t have to be from the training dataset.
  • Scenario:
    • Anna wants to rent out her apartment, but she is not sure how much to charge for it, so she decides to train a ML model to predict the rent. (She is an awesome data scientist)
    • After entering all the details about size, location, whether pets are allowed and so on, the model tells her that she can charge 900 EUR. She expected 1000 EUR or more, but she trusts her model.
    • She decided to play with the feature values of the apartment to see how she can improve the value of the apartment.
    • She finds out that the apartment could be rented out for over 1000 EUR, if it were 15 m2 larger. Interesting, but non-actionable knowledge, because she cannot enlarge her apartment.
    • Finally, by tweaking only the feature values under her control (built-in kitchen yes/no, pets allowed yes/no, type of floor, etc.), she finds out that if she allows pets and installs windows with better insulation, she can charge 1000 EUR.
    • Anna has intuitively worked with counterfactuals to change the outcome. >_>

💡
A simple approach to generating counterfactual explanations is searching by trial and error —> randomly change feature values of the instance of interest and stop when the desired output is predicted.
💡

Important

There are better approaches than trial and error. First, we define a loss function. This loss takes as input the instance of interest, a counterfactual and the desired (counterfactual) outcome.

Then, find the counterfactual explanation that minimizes this loss using an optimization algorithm.

Many methods proceed in this way, but differ in their definition of the loss function and optimization method.

Properties and Desiderata of Counterfactual Explanations

  • Closeness to the predefined output
  • Closeness to the input
  • Sparsity (Change only few features)
  • Diversity and multiple explanations
  • Feasibility and Actionability (decreasing age OR height of 1.1 and weight 10kgs)

Wachter فاختر Method

  • He suggests minimizing the following function to produce a counterfactual explanation.
  • Optimization Problem: Minimize the distance between the actual and counterfactual that will cause a change in the output
x (Actual Instance), x` (Counterfactual Instance), y` (Desired Outcome), f(x`) is Output of x`,
x (Actual Instance), x` (Counterfactual Instance), y` (Desired Outcome), f(x`) is Output of x`,
  • IMPORTANT: The loss measures how far the predicted outcome of the counterfactual is from the predefined outcome and how far the counterfactual is from the instance of interest.
  • IMPORTANT: The parameter λ balances the distance in prediction (first term) against the distance in feature values (second term).
    • If Larger λ\lambda —> prefer CFs closer to Desired Outcome.
    • If small λ\lambda —> prefer CFs closer to Original Instance.

DiCE (Diverse Counterfactual Explanations)

  • Extends Wachter et al. to consider also the properties of Diversity and Feasibility

Relation to Adverserial

Code

Reference:

  • I am using a library called DiCE: Diverse Counterfactual Explanations (DiCE) for ML.
  • DiCE implements counterfactual (CF) explanations that provide show feature-perturbed versions of the same person who would have accepted for Credit Cart, e.g., you would have been accepted if your income was higher by $10,000.
  • This explanation is better than saying you are rejected because you have a poor credit history
  • It provides "what-if" explanations for model output.
  • Methods:
    • Random
    • Genetic: The genetic algorithm converges quickly, and promotes diverse counterfactuals.
    • KD Tree (for counterfactuals within the training data): Here, we show how to use DiCE can be used to generate CFs for any ML model by finding the closest points in the dataset that give the output as the desired class. We do this efficiently by building KD trees for each class, and querying the KD tree of the desired class to find the k closest counterfactuals from the dataset. The idea behind finding the closest points from the training data itself is to ensure that the counterfactuals displayed are feasible.

    Gradient-based methods

    • An explicit loss-based method described in Mothilal et al. (2020) (Default for deep learning models).
    • A Variational AutoEncoder (VAE)-based method described in Mahajan et al. (2019) (see the BaseVAE notebook).

    The last two methods require a differentiable model, such as a neural network.

python
from utils import DataLoader
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
python
# pip install dice-ml
import dice_ml

d = 
python
query_instance = X_test[0:1] # Generate CF for a given instance (Remember: It's a local Method)
cf = explainer.generate_counterfactuals(query_instance, 
			                                  total_CFs=3, # Generate 3 Classes 
			                                  desired_class="opposite") 
			                                  
cf.
  • Imagine if the hours_per_week was negative? Imagine if the gender changed from Male to Female? These are infeasible changes! We need to control 😊
python
# Create feasible (conditional) Counterfactuals

  • If a regression problem, we can use the following:
python
d_housing = dice_ml.Data(dataframe=df_housing, continuous_features=continuous_features_housing, outcome_name=outcome_name)
m_housing = dice_ml.Model(model=model_housing, backend="sklearn",

Disadvantages

  • You must perturb features that are actionable (can be changed) (from possible worlds).
    • Can’t change Gender, Can’t add 3 rooms to a house, Can upgrade kitchen, Can study more.
  • Rahsomon Effect: There are multiple counterfactual explanations.
    • Rashomon is a Japanese movie in which the murder of a Samurai is told by different people. Each of the stories explains the outcome equally well, but the stories contradict each other.
    • Therefore, each counterfactual tells a different “story” of how a certain outcome was reached. One counterfactual might say to change feature A, the other counterfactual might say to leave A the same but change feature B, which is a contradiction.
    • This issue of multiple truths can be addressed either by reporting all counterfactual explanations or by having a criterion to evaluate counterfactuals and select the best one.
    • Reporting all counterfactuals allows users to see the range of possible explanations, reducing the risk of bias or cherry-picking a particular explanation that might not be representative.
    • Evaluation Criterion:
      • Metrics as Plausibility (مدى معقولية تنفيذ هذا التغيير), minimal changes required, or causal consistency.
      • Multiple CFs: Increase income to $50,000 → Credit Card Approved ✅ OR Improve credit score to 650 → Credit Card Approved ✅ OR Reduce Debt-to-Income ratio to 40% → Credit Card Approved ✅
      • Increasing income by $10,000 might not be immediately achievable. [The hardest, least favorable]
      • Improving the credit score by 50 points could be more feasible (e.g., paying off small debts, reducing credit utilization). [If they have time, another reasonable path]
      • Reducing the debt-to-income ratio could be done by paying off some debts or increasing income. [The most actionable and plausible option]
  • Making sure the change is causal not only correlated (Notebook link)

Shapley (1953) (Local)

Let's say we have a model that predicts house prices based on: Size of the house (sq ft), Number of bedrooms, Nearby school rating, and Age of the house

  • If the model predicts a house price of $500,000, Shapley values (Ordered Values) can tell us:
    1. 🏠 "Size of the house" contributes $300,000
    2. 🏫 "School rating" contributes $120,000
    3. 🛏 "Number of bedrooms" contributes $50,000
    4. ⏳ "Age of the house" reduces value by $30,000
  • Shapley is game theory-based framework used to explain the output of machine learning models. It quantifies the contribution of each feature to a model's prediction for a specific instance, ensuring fair allocation of "credit" among features.
  • The Shapley value, coined by Shapley (1953), is a method for assigning payouts (contributions) to players depending on their contribution to the total payout (ML Model Output).
  • It’s a model agnostic Framework.
  • This method is actually coming from Cooperative Game Theory.

Shapley Example

  • Reference: https://christophm.github.io/interpretable-ml-book/shapley.html#shapley
  • You have a ML Model to predict apartment prices. For a specific house it predicts $300,000 when the apartment has a size of 1,500 sq ft, Age of 10 years, and Location (Neighborhoud Rating) of 8/10.
  • The average prediction across all houses (Baseline) is $200,000.
  • How much has each feature value contributed to the prediction compared to the average prediction?
💡
The answer is simple for linear regression models. The effect of each feature is the weight of the feature times the feature value. This only works because of the linearity of the model. For more complex models, we need a different solution.
  • In our house example, the feature values size-1500, age-10, rating-8/10 worked together to achieve the prediction of $300,000.
  • IMPORTANT: Our goal is to explain the difference between the actual prediction ($300,000) and the average prediction ($200,000): a difference of +$100,000.
  • The answer could be: The size-1500 contributed $36,000; area-50 contributed €10,000; floor-2nd contributed €0; cat-banned contributed -€50,000. The contributions add up to -€10,000, the final prediction minus the average predicted apartment price.
  • The question now is given a sample, how do we calculate the Shapley value of a feature? Let’s say (cat-banned). How do we know its marginal contribution?
    💡
    The Shapley value is the average (Expected) marginal contribution of a feature value across all possible (subsets) coalitions.
    • The definition will be clear after the following explanation.
    • We have cat-banned at hand, and we have 2^3 = 8 subsets (or coalitions) that can happen.
    • Let’s evaluate the contribution of the cat-banned feature value when it is added to a coalition of park-nearby and area-50
    • We simulate that only park-nearby, cat-banned and area-50 are in a coalition by keeping their values.
    • We simulate excluding a value, by getting its value from a random sample (This is because our ML model still expects a value for the excluded features).
    • We randomly select another apartment from the training data and use its value for the floor feature. —> floor-1st.
      • Contribution of Coalition of (park, area) when joined by (cat)
      • The new input is (park-nearby, cat-banned , area-50, floor-1st) —> €310,000
    • Now, we also replace cat-banned with the value of the same random apartment, let’s say cat-allowed.
      • Contribution of Coalition of (park, area) when not joined by (cat)
      • The new input is (park-nearby, cat-allowed , area-50, floor-1st) —> €320,000
    • The contribution of cat-banned was €310,000 - €320,000 = -€10,000 for this coalition.
    • All Possible Coalitions:
      • No feature values
      • park-nearby
      • area-50
      • floor-2nd
      • park-nearby+area-50 (Our Analysis)
      • park-nearby+floor-2nd
      • area-50+floor-2nd
      • park-nearby+area-50+floor-2nd

  • IMPORTANT: We repeat this computation for all possible coalitions that doesn’t contain the feature at hand. —> The Shapley value is the average of all the marginal contributions to all possible coalitions.
    • This is more obvious in the Shapley equation.
💡
For Shapley Regression Values, we can exclude other features by setting their weights to 0, therefore getting prediction without them!
  • Note that a random feature has usually no predictive power.

More Details about Shapley

Random forest model predicting cervical cancer. With a prediction of 0.57, this woman’s cancer probability is
Random forest model predicting cervical cancer. With a prediction of 0.57, this woman’s cancer probability is
Random forest to predict the number of rented bikes for a day, given weather and calendar information. This is Shapley values for day 285. With a predicted 2409 rental bikes, this day is -2108 below the average prediction of 4518. The weather situation and humidity had the largest negative contributions. The temperature on this day had a positive contribution. The sum of Shapley values yields the difference of actual and average prediction (-2108).
Random forest to predict the number of rented bikes for a day, given weather and calendar information. This is Shapley values for day 285. With a predicted 2409 rental bikes, this day is -2108 below the average prediction of 4518. The weather situation and humidity had the largest negative contributions. The temperature on this day had a positive contribution. The sum of Shapley values yields the difference of actual and average prediction (-2108).
💡
The Shapley value is NOT the difference in prediction when we would remove the feature from the model.
💡
Simply removing a feature and comparing the model’s prediction (full model vs. reduced model) gives you only one difference. However, some features rely on others to be useful. A feature’s impact on the prediction might depend on which other features are included. Imagine a Kaggle competition where the Coding Expert writes the code, and the Data Scientist guides building the model. If you remove the Coding Expert, the Data Scientist might struggle, making it unfair to measure the Coding Expert’s contribution in isolation.

Similarly, adding the Coding Expert boosts the Data Scientist’s effectiveness.

The Shapley value fairly calculates each feature’s contribution by considering all possible feature combinations and averaging their impact across different contexts.

💡
The Shapley value is the weighted average contribution of a feature value to the prediction in different coalitions.
  • Later, you can take the absolute value |Shapely| when comparing or ranking, because we just care about the magnitude of each feature.
  • We can even calculate the shapely values across all samples for each feature, and average across samples and report a histogram.

Shapley Formula #1 (Weighted Average of the Marginal Contributions)

  • This function calculates the (Shapley Value) weighted average contribution of a feature value (i) to the prediction in different coalitions.
  • It loops over all subsets and calculates what happens with the feature and without it. —> [v(S{i})v(S)]\left[v(S \cup \{i\}) - v(S)\right].
  • S!(NS1)!N!\frac{|S|! (|N| - |S| - 1)!}{|N|!} This is all is considered the weighting factor.
  • NN is the set of all features. SS is any subset (iterator).
  • SN{i}S⊆N∖\{i\} means we consider every possible subset S that does not include feature i. (So we can later join it with {i})
    • If we have 2 features, then we will loop over 2 coalitions only for x1x_1 which are \empty and {x2}\{x_2\}, and for x2x_2 will be 2 coalitions which are \empty and {x1}\{x_1\}.
    • It is tricky, because first I thought we loop over all coalitions for each feature, which is incorrect.
  • S∣S∣ is the number of features in subset S.
  • N|N| is the total number of features.
  • The weight is confusing, and crafted to support specific properties of Efficiency, Symmetry, Dummy (Null Players), Linearity.

Shapley Formula #2 (Weighted

Why Do We Need a Weight?

  • asda
  • asdasd???
  • Ordering of Joining???

LIME (2016) (Local)

  • Local Interpretable Model-Agnostic Explanations.
    • Agnostic as well as SHAPley.
  • Another popular method for explaining machine learning models is LIME.
    • It’s explaining individual predictions of black box machine learning models.
  • We mention it because we will need it for SHAP.
  • LIME works by fitting a so-called local surrogate model بديل محلي (i.e. interpretable model) that approximates the predictions of the underlying black box model.
  • In contrast to SHAP, LIME can only be used for local interpretation, because the locally fitted surrogate model only applies in proximity of the data point being explained.
  • It works on any black box model —— It works with tabular data, text, images, and graphs.
  • In the following graph, the ML model is non-liner, and it becomes very difficult to explain why the red point is classified as Stroke.
    https://www.youtube.com/watch?v=d6j6bofhj2M&ab_channel=DeepFindr
    https://www.youtube.com/watch?v=d6j6bofhj2M&ab_channel=DeepFindr
  • What LIME will do is —> get very close look and we can say that for this specific instance, feature 2 is more critical and mainly determines the output class, and changes in feature 1 has no impact or less impact. We will be able using a linear model like y=w1x1+w2x2y = w^1 x^1 + w^2 x^2 to explain which weight has higher effect.
https://www.youtube.com/watch?v=d6j6bofhj2M&ab_channel=DeepFindr
https://www.youtube.com/watch?v=d6j6bofhj2M&ab_channel=DeepFindr

  • How does it work (IMPORTANT):
    1. Your goal is to understand why the machine learning model made a certain prediction.
    2. LIME generates a new dataset consisting of perturbed samples along with their corresponding predictions using the black box model.
      • These new data points are created by perturbations to the red point. The new feature values can be sampled using the normal distribution of each feature (each with mean and std).
      • Hence to sample using a normal (Gaussian) distributions, LIME needs your training dataset to create new points.
      • For text and images, the way is to turn single words or super-pixels on or off. (Explained Below)
    3. On this new dataset (perturbed samples) LIME then trains an interpretable model, which is weighted by the proximity of the sampled instances to the instance of interest.  (will see the formula below)
    4. The interpretable model can be Linear Model or Decision Tree.
💡
Limiting the interpretable model to these two categories makes it simple to explain which feature had an effect on the output of the model.
  • Linear Models: A linear model expresses the prediction as a weighted sum of features. The weights (or coefficients) are easy to understand, as they directly represent the contribution of each feature.
  • Decision Trees: Shallow decision trees can be visualized and easily interpreted as a sequence of if-then rules.
💡
Fidelity: The surrogate (interpretable) model should closely approximate the predictions of the complex model in the local area.
  • When generating perturbed samples for Text data, the process is straightforward if words are represented as binary features, such as in a bag-of-words model. Just turn some OFF or ON, and generate new samples.
  • In Image data, perturbing individual pixels is not meaningful. Instead, groups of similar pixels, known as superpixels, are perturbed together by "blanking" them—removing them from the image.
    • These superpixels are clusters of interconnected pixels with similar colors, which can be identified using techniques like k-means clustering.

Choosing the Interpretable Model

  • The goal isn’t to achieve the highest predictive performance but to capture the behavior of the complex model in a simple, understandable way.
  • For Linear Models
    • LASSO Regression (L1 Regularized Linear Regression) (THE CHOSEN MODEL):

      LASSO is often preferred because it tends to zero out the coefficients of less important features, resulting in a sparse model. This sparsity makes it easier to identify which features have the most influence on the prediction.

    • Elastic Net Regression:

      This model combines both L1 (LASSO) and L2 (Ridge) regularization. It is particularly useful when features are correlated because it can balance between enforcing sparsity and keeping correlated features in the model. The result is still relatively simple and interpretable.

  • For Decision Trees
    • CART, ID3, or C4.5, By limiting the maximum depth to 2 or 3.
    • Regardless of the algorithm, pruning is key. A pruned tree removes branches that have little power in explaining the local behavior, which keeps the model simple and focused on the most relevant decisions.
💡
Even though these models are simple, when they are trained on data sampled around the instance of interest (weighted by a proximity measure), they can effectively approximate the complex model’s behavior locally.
  • If the relationships are largely linear in the local neighborhood, LASSO or Elastic Net might be more suitable.
    • For more complex local structures, a shallow decision tree might better capture non-linear interactions in an understandable form.
  • In Linear Regression, you can get also to choose k, the number of features you want to have in your interpretable model.

LIME Generic Formula

g^=arg mingG L(f,g,πx)+Ω(g)\hat{g} = \operatorname*{arg\,min}_{g \in G} \ \mathcal{L}(f, g, \pi_x) + \Omega(g)
  • ff: The Original Complex black-box model you want to explain.
  • gg: An interpretable model (for example, a linear model) selected from a set GG of possible explanation models.
  • πx\pi_x: A kernel function that assigns a weight to each perturbed sample based on its "closeness" to the original instance xx.
  • L(f,g,πx) \mathcal{L}(f, g, \pi_x) : A loss function measuring how well gg approximates 𝑓 𝑓 . For example: Weighted Square Error for Linear Model.
  • Ω(g)\Omega(g): A regularization term that penalizes the complexity of ff to ensure the explanation remains interpretable.
    💡
    For example, Ω(g)\Omega(g) might enforce sparsity (by encouraging many coefficients to be zero) or limit the depth of a decision tree. This ensures that the surrogate model remains simple enough for human interpretation.
  • GG: The family of possible explanations, for example all possible linear regression models. But anyway, everyone use LASSO for LIME.

How do we calculate the proximity function πx\pi_x? (Exponential Kernel Function)

  • A common choice is the Exponential (Smoothing) Kernel Function.
    • It returns a proximity measure. (Opposite of distance)
  • The distance functions used are the Euclidean Distance for Tabular Data and Images, and the Cosine Similarity measure for Text data and
    • Python LIME allows you providing any other distance function.
  • D2D^2 : Squaring amplifies the penalty for larger distances, ensuring weights drop sharply for far-away points.
  • Negative Exponent: We want to give less importance to far values, so the negative is needed in the weight. (Switching from Distance to Proximity)
  • Exponent: Maps distances to weights in (0,1], ensuring:
    • Smooth decay: Weights decrease gradually (no abrupt cutoffs).
    • Non-negativity: All weights are positive. (Y-Axis)
  • Bandwidth (Kernel) (σ)
    • The kernel width determines how large the neighborhood is.
    • A small kernel width means that an instance must be very close to influence the local model.
    • A larger kernel width means that instances that are farther away also influence the model.
    • IMPORTANT: They (heuristically) chose to use a kernel width of 0.75n_features0.75 * \sqrt{{n\_features}} of the training data.
    💡
    We square σ\sigma to leave the exponent dimensionless (unitless).
    • Say our point of focus is (1,1), and the perturbed samples closet (1.5, 1) , middle (2, 1.5), and the farthest (5,7).
      • Their dist is 0.5, 1.1, 7.2
      • Their dist ** 2 is 0.25, 1.25, 52
    • If we just calculate (dist ** 2) / (sigma ** 2)
      • σ\sigma = 0.1 —> σ2\sigma^2= 0.01, we will have the calculation as 25, 125, 5200 —> With Exponent and Negative, it becomes 0, 0, 0.
        • That means all these values are not considered in the calculation, and we need much much closer points.
        • So, let’s try a higher sigma/
      • σ\sigma = 0.75 —> σ2\sigma^2=1.5, we will have the values as 0.45, 2.2, 92.4 —> With Exponent and Negative, it becomes 0.6412, 0.1084, 0.
        • This is better, because it is now focusing more on the closer one, and not on the farthest.
      • σ=0.752=1.06\sigma = 0.75 * \sqrt{2} = 1.06 —> σ2\sigma^2 =1.125, we will have values as 0.2, 1.1, 46.2 —> With Exponent and Negative, it becomes 0.8007, 0.3292, 0.
        • This is the approach they chose in implementation of LIME.
        • We see it started giving more importance to the close one, and the middle one as well.
      • σ\sigma = 3 —> σ2\sigma^2=9, we will have the values as 0.0278, 0.138, 5.77 —> With Exponent and Negative, it becomes 0.97, 0.87, 0.0031.
        • We see that as we start giving higher width, we divide by a larger and larger number, which is now (52/9 = 5.77), so it starts giving us lower values (5.7 compared to 5200)
        • Therefore we get above than zero in the exponential which is (0.0031).
  • Problem of Big Width
    When farther points get weighted, so model will not be correctly explaining the local point.
    • The black line is the actual data points, and we have a point (X).
    • In this example, the best width is 0.1 as it’s not being influenced by other black points. It’s doing regression correctly for the close points.
    • Of course 0.2 is the worst, as it’s just trying to capture all data. 0.75 is also bad, but not the worst here.
💡
The width is highly application and data-dependent. It’s not clear how to correctly pick it. 😊

LIME (How to Choose Number of Features) TO STUDY AFTER LASSO

  • Complex models often use many features, but not all contribute equally to the prediction. Selecting a subset helps focus on the most influential factors.
  • Here we want to see what is the best number of features to choose?
  • Option #1: none
    • Use all available features for the explanation.
    • Not recommended unless you have a very limited number of features.
    • It will ignore the parameter num_features
  • Option #2: forward_selection
    • Features are added one by one.
    • At each step, the feature that most improves the interperable model output is chosen.
    • This is costly when num_features is high
  • Option #3: highest_weights
    • A ridge regression model is fit to the complex model's predictions.
    • The m features with the highest absolute weights (i.e., the largest impact) are selected.

      mm

  • Lasso:
    • Uses the regularization path from a Lasso regression fit.
    • Chooses the m features that are least prone to shrinkage, meaning they are less penalized and likely more important.

      mm

  • Tree:
    • A decision tree is fit with approximately log2(m) splits.

      log⁡2(m)\log_2(m)

    • This approach uses up to m features, though it might select fewer depending on the splits.

      mm

  • Auto: Automatically selects the method based on the number of features m:
    • Uses forward selection if m≤6.

      m≤6m \leq 6

    • Uses highest weights if m>6.

      m>6m > 6

LIME in CODE

python
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from lime import lime_tabular

# Load dataset
data = load_breast_cancer()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train RandomForest classifier
forest_clf = RandomForestClassifier()
forest_clf.fit(X_train, y_train)

# Print model accuracy
print(forest_clf.score(X_test, y_test))

# Initialize LIME explainer
explainer = lime_tabular.LimeTabularExplainer(
    training_data=X_train,
    feature_names=data['feature_names'],
    class_names=data['target_names'],
    mode='classification'
)

# Explain 20 instances
for i in range(20):
    true_label = 'benign' if y_test[i] else 'malignant'
    predicted_label_index = forest_clf.predict([X_test[i]])[0]  # Get the predicted class index
    predicted_label = data['target_names'][predicted_label_index]  # Convert index to class name

    print(f"Real: {true_label}")
    print(f"Predicted: {predicted_label}")
    print(dict(zip(data['feature_names'], X_test[i])))

    explanation = explainer.explain_instance(
        data_row=X_test[i],
        predict_fn=forest_clf.predict_proba,
        num_features=30
    )

    fig = explanation.as_pyplot_figure()
    fig.set_size_inches(10, 8)  # Increase figure size

    # Manually set the correct title
    plt.title(f"LIME Explanation - True: {true_label}, Predicted: {predicted_label}")

    plt.show()
    break  # Only explain one instance for now
  • The longer the bar, the stronger the influence of that feature on the classification decision.
  • The feature "worst concave points > 0.16" has a large negative value, meaning it strongly contributes to classifying the sample as malignant.
  • Features with small positive values (green bars) contribute towards classifying the sample as benign, but their influence is weaker.

(Showing First Transparanet ML Model (Decision Tree Classifier), then building Random Forest Classifier

Problems in LIME

  • Problem #1: The kernel function πx \pi_x treats distances in all feature directions the same.
    • Someone can say, let’s standardize all features to a similar scale. That’s actually good and needed!
    • But the problem remains: some of the features might be completely irrelevant to your problem, but still get a large “say” in the distance computation. That’s a typical problem in unsupervised learning as well (Clustering based on Features).
    • Also, the curse of dimensionality strikes: If the number of features becomes increasingly sparse —> difficult to compute meaningful distances.
  • Problem #2: How to choose the Kernel Width σ. (BIGGEST ISSUE)
    • Small values of σ result in a only data points extremely close to the instance 𝑥 will receive a significant weight. As we discussed its exponential decay
    • Large values of σ makes all points get a similar weight, so we go toward a global surrogate model. It’s like giving weight 1 to every perturbed sample.
  • Problem #3: The generation step is an open issue!
    • The current implementation generates points all over the space of the X variables, then gives importance only to the close ones.
    • There are some works on local generation techniques for LIME, but still have problems.
  • Problem #4: Only Local Interpretation.
  • Problem #5: Instability
    • It’s shown in research that the explanations of two very close points varied greatly in a simulated setting.
    • Also, if you repeat the sampling process, then the explanations that come out can be different. Instability means that it is difficult to trust the explanations, and you should be very critical.
    • LIME explanations can be manipulated by the data scientist to hide biases (For example by changing the radial basis / Bandwidth / Kernel)
💡
There are plenty of other options that don’t require to specify a neighborhood. Like Shapley values, counterfactual explanations, and what-if analysis.

SHAP (SHapley Additive exPlanations) (2017)

💡
SHAP is a famous approximation of Shapley to make them computable.
  • The authors of SHAP is actually just using Shapley that we studied above, but they propose:
    1. KernelSHAP, an alternative, kernel-based estimation approach for Shapley values
    2. TreeSHAP, an efficient estimation approach for tree-based models.
    3. Multiple Global Interpretation Methods based on aggregations of Shapley values.

SHAP Equation

g(z)=ϕ0+i=1Mϕizig(z') = \phi_0 + \sum_{i=1}^{M} \phi_i z'_i
  • This is a way to say given any point, you can select which features you want to get their effect, and I will do it for you
  • ff: The Original Prediction Model to be explained
  • gg: The Explanation Model (Any interpretable approximation of the original model)
    • It’s a local explanation function for a single instance. See the example below on how to use it.
  • ϕ0\phi_0: Baseline Value. (Average Prediction / Expected Value of Original Model).
  • MM: The Total Number of Features.
  • ii: Iterating over features: Age, Degree, Height, etc…
  • zz': The coalition vector — It says a vector of [0,0,1,0,1,1] of size MM.
    • Coalition Vector is also called Simplified Features, and the name was chosen because because for image data, the images are not represented on the pixel level, but aggregated to superpixels.
    • A value of 1 means feature is present, and 0 means absent.
  • ϕi\phi_i: The shapley value of feature ii. How much did feature ii change the output of the model.

Example Setup

  1. Assume we have a complex model that predicts house prices based on two features: Number of bedrooms (x1) and Size in square feet (x2)
  2. The predictions of f(x)f(x) for our given point is as follows:

    Since a regression problem here, to exclude a feature, we can just set it to zero. If was classification, we would have got excluded values from a random sample.

    Feature Set (Included Features) Model Prediction f(x)
    No features (baseline)$100K$
    Only bedrooms $130K$
    Only square feet$160K$
    Both bedrooms & square feet$200K$
  1. Bedroom Contribution (Shapley Value):
    • Bedroom in the presence of No one. [0,0]
      • No One: f(,)f(\empty, \empty) = 100K$
      • Bedroom is introduced: f(x1,)f(x_1, \empty) = 130K$
      • Contribution = $130K - 100K = 30K$
    • Adding only bedrooms. [1,0]
      • Only Bedroom: f(x1,)f(x_1, \empty) = 130K$
      • When No Bedroom
      • Contribution = $130K - 100K = 30K$
    • Adding bedrooms after square feet is already included:
      • With only square feet → Model predicts $160K$.
      • With both features → Model predicts $200K$.
      • Contribution = $200K - 160K = 40K$.
    • Avg: (30K+40K)/2 = $35K

[1,1]

  1. Size Contribution (Shapley Value):
    • Without any features → Model predicts $100K$.
    • Adding only square feet → Model predicts $160K$.
      • Contribution = $160K - 100K = 60K$.
    • Adding square feet after bedrooms are already included:
      • With only bedrooms → Model predicts $130K$.
      • With both features → Model predicts $200K$.
      • Contribution = $200K - 130K = 70K$.
    • Avg: (60K+70K)/2 = $65K
  1. Form the Additive Explanation Function
    • Now that we have the shapley values, we can build the Additive Explanation Function.
    • In this case the Explanation Function will be: g(z)=ϕ0+ϕ1z1+ϕ2z2g(z')=ϕ_0​ +ϕ_1​ z'_1+ϕ_2​ z'_2
    • IMPORTANT: Let’s say I want to see the effect of Bedrooms only (Remember this model explains one instance only) —> That means our point is (1,0) —> g(1,0) = 100K + (35K x 1) + (65K x 0) = 135K.
    • Let’s say again the effect of Size and Bedrooms together —> Our point is (1,1) —> g(1,0) = 100K + (35K x 1) + (65K x 1) = 100K.

💡
Shapley and SHAP explain how we arrived at a prediction by comparing it to the Expected Value (i.e., the Average of training dataset predictions).

So the question is: Can you clarify the difference in output relative to this average rather than from zero, since zero is just an arbitrary reference point?

Additive Explanations

https://interpret.ml/docs/shap.html

Deep SHAP

Waterfall Plot

  • In the first place, SHAP is a method for local interpretation of the model, because SHAP values quantify the contribution of each feature to a single prediction case.
  • However, SHAP can also be used for global interpretation, where the SHAP values of multiple prediction cases are combined or aggregated to get a sense of the more general contribution of the features to the outcome.

Comparing SHAP to LIME

  • At first sight, SHAP seems more versatile: it offers methods for both local and global interpretation of models, and there are multiple options with regard to visualization.
  • In contrast, LIME only offers a method for local interpretation of models, and is more restricted in terms of visualization.
  • However, the running time for SHAP, computing the SHAP values for a subset of only 5000 samples already took 3 minutes.
    • Furthermore, the SHAP waterfall plot, takes a very long time for the algorithm to compute the values that indicate the contribution of each variable to the output.
    • But a similar plot in LIME computes much faster.

💡
A good practice to choose SHAP over LIME when you want to explain your model on a global level, or when you only want to explain a few specific predictions. If explaining large volumes of predictions and SHAP gets too slow, LIME can be a nice alternative!

Scikit-learn has built in the attribute feature_importances_ to get a first idea of what features are important in the model.

learning to rank (LTR) or Collaborative Filtering (CF).

python
import shap
import xgboost
import pandas as pd

# Ensure SHAP visualizations work in a Jupyter Notebook
shap.initjs()

# Load dataset
X, y = shap.datasets.boston()  # Example dataset
model = 

How to Read SHAP Summary Plot

  • The dots in the plot visualize the SHAP values. Each row has same number of dots.
    • Some features may be missing for certain rows in the dataset, so that’s why it could be some rows are having fewer points.
    • Also if one-hot-encoded columns. This can cause some rows to represent fewer data points if certain categories are rare.
  • The color of a dot represents the feature value for the specific data point. If it’s red, then high value for this feature in this datapoint.
  • The position on the x-axis displays the corresponding SHAP value.
    • A high SHAP value indicates a very high positive impact on the output, whereas a negative SHAP value means this feature negatively impact the output.
  • The higher a feature appears in the summary plot, the more important or relevant the feature is in the model.
💡
When you see a “crowded” area (a large cluster of dots) at a particular SHAP value for a given feature, it means many data points share a similar contribution from that feature to the model’s prediction.
python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import plot_roc_curve

# Specifying our target variable:
target = 'subscr_default'

# Seperating our target variable from our features:
y = df[target]
X = df.drop(target, axis = 1)

# Splitting the dataset into a 70% train and 30% test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 17)

# Initialize and fit the model:
rf = 
python
# Specific Point using SHAP

shap.initjs()

j = 49 # index of person with high probability (>0.8)

explainer = shap.TreeExplainer(rf)
shap.force_plot(explainer.expected_value[1], # Note that we use the expected value of class [1] (defaulter) as a base value
                shap_values[1][j, :], # So we also want the SHAP vaues for class [1]
                X_train_sample.iloc[j, :], 
                matplotlib=True,
                text_rotation=20)

Explainable Boosting Machine

https://www.youtube.com/watch?v=MREiHgHgl0k&ab_channel=MicrosoftDeveloper

The Big Comparison

FeatureLIME (Local Interpretable Model-agnostic Explanations)SHAP (SHapley Additive Explanations)Counterfactual ExplanationsPermutation Feature Importance (PFI)Partial Dependence Plots (PDP)
Type of ExplanationLocal (Instance-level)Local & GlobalInstance-levelGlobalGlobal
How It WorksApproximates the model locally with a simpler interpretable model (e.g., linear model).Uses Shapley values from cooperative game theory to fairly assign importance to featuresFinds alternative feature values that would change the model's decisionMeasures feature importance by randomly shuffling feature values and observing the decrease in model performance when a feature's values are randomly permuted.Shows marginal effect of a feature by averaging predictions over a range of values
Year20162017201720012001
Computational CostMediumHighHighLowMedium
Model-Agnostic?YesYesYesYesYes
Handles Feature Interactions?NoYesNoNoYes
InterpretabilityEasy to understand locallyMore complex but theoretically soundEasy to understand for individualsIntuitiveIntuitive for continuous features
Use Case ExampleExplaining a single instance prediction in a black-box modelFair feature attribution across many samplesHelping users understand why a model made a decision and what could change itUnderstanding which features contribute most to the model's performanceVisualizing how one feature impacts the model’s output
Main LimitationMay be unstable (different explanations for small perturbations)Computationally expensive for large datasetsCan be unrealistic (some counterfactuals may be impossible in real life)Assumes independent features (ignores interactions)Only works well for low-dimensional continuous features
Surrogate Model?Yes. LIME explicitly builds a new local model for each instance explained.No (uses original model). It directly computes feature attributions using the original model’s output. However: Internally, the KernelSHAP algorithm does fit a weighted linear model to estimate Shapley values, but this is an internal step. The final explanation is just the set of feature contributions, not a separate model presented to the user.No. The original model is used to evaluate outcomes. Counterfactual methods search for a new input instance that yields the desired prediction, often through optimization, without approximating the model itselfNo. We simply shuffle feature columns and observe the effect on the model’s predictions.No. It simply queries the model at many points in the feature space and averages the results.
FeatureLIME (Local Interpretable Model-agnostic Explanations)SHAP (SHapley Additive exPlanations)SHAPley ValuesCounterfactual ExplanationsPermutation Feature ImportancePDP (Partial Dependence Plots)
Scope
MethodologyGenerates alternative scenarios to understand how changes in features would affect the prediction.Visualizes the average effect of a feature on the predicted outcome.
Model AgnosticYesYesYesYesYesYes
OutputFeature importance for a single prediction.Feature importance for individual predictions and overall model behavior.Feature importance for individual predictions and overall model behavior.Alternative feature values that would change the prediction.Global feature importance ranking.Visual representation of the average feature effect.
StabilityCan be unstable due to random sampling.More stable, based on solid theoretical foundation.More stable, based on solid theoretical foundation.Can be sensitive to the method used to generate counterfactuals.Stable, but can be impacted by correlated features.Stable, showing average effects.
Computational CostRelatively low.Can be computationally expensive, especially for complex models.Can be computationally expensive, especially for complex models.Varies depending on the method.Moderate.Moderate.
Use CasesExplaining individual predictions, particularly in simpler models.Understanding feature contributions in complex models, both locally and globally.Understanding feature contributions in complex models, both locally and globally.Understanding "what-if" scenarios, especially in critical applications.Identifying the most important features in a model.Visualizing and understanding the average relationship between a feature and the prediction.
Key Differenceslocal explainer, with possible instability.Provides consistent local and global explainations.The mathematical foundation of SHAP.provides alternate realities.Provides a global ranking of feature importance based on model performance reduction.Shows the average change in prediction based on a single features value.
TechniqueMain ApproachScopeAdvantagesLimitationsUse Cases
LIME (Local Interpretable Model-agnostic Explanations)Creates a simple, interpretable model that approximates the complex model locallyLocal (individual predictions)- Model-agnostic<br>- Intuitive explanations<br>- Works with any data type- Unstable explanations<br>- Sensitive to sampling parameters<br>- Explanations limited to local region- Explaining individual predictions<br>- Text, image, or tabular data
SHAP (SHAPley Additive exPlanations)Assigns each feature an importance value based on game theoryBoth local and global- Consistent theoretical foundation<br>- Combines advantages of other methods<br>- Fair attribution of feature importance- Computationally expensive<br>- Can be slow for complex models<br>- Assumes feature independence- When consistency and theoretical guarantees are needed<br>- High-stakes decisions requiring rigorous explanations
Shapley ValuesMeasures feature contribution based on game theory conceptsBoth local and global- Mathematically sound<br>- Fair attribution of importance<br>- Accounts for feature interactions- Extremely computationally expensive<br>- Requires many model evaluations- Critical applications where exactness matters more than speed
Counterfactual ExplanationsIdentifies minimal changes needed to get a different predictionLocal- Intuitive "what-if" scenarios<br>- Actionable insights<br>- Doesn't require model internals- Finding optimal counterfactuals is challenging<br>- May produce unrealistic examples- Customer-facing explanations<br>- Actionable feedback<br>- Regulatory compliance
Permutation Feature ImportanceMeasures prediction error after shuffling feature valuesGlobal- Simple to implement<br>- Model-agnostic<br>- Accounts for interactions- Can be misleading with correlated features<br>- Requires retraining model multiple times- Model development<br>- Feature selection<br>- Understanding overall model behavior
PDP (Partial Dependence Plots)Shows relationship between features and predictions after marginalizing other featuresGlobal- Visualizes feature effects<br>- Shows non-linear relationships<br>- Relatively simple to implement- Assumes feature independence<br>- Can be misleading with correlated features<br>- Limited to 1-2 features at a time- Understanding feature relationships<br>- Detecting non-linear effects<br>- Communicating model behavior

Feature Selection