⚠️ These are raw notes from an old internal debate on the utility of interpretability. If some thoughts are outdated (or wrong), please reach out.
Consider my credit score. It is apparently quite a big deal in getting access to resources like credit, apartments, loans, fee waivers, etc. For over a year, my score was terribly low, and without precise reasons, I'd have been frustrated customer. In the big picture of science too, often our real goal is to find causal associations. Interpretability then is the "informative" bridge between correlative judgements from a machine learning model and the causal associations.
Part of the "charm" of interpretability comes from the fact that an average human takes touts to being able to justify decisions post-hoc. It is the urge to manage social interactions that we seek interpretability1. To really argue for an interpretable machine, it is imperative to clarify, (i) when do we care about interpretability, and (ii) what it really means to be interpretable.
Interpretability serves those (high-stakes) objectives that we deem important but struggle to model formally.2
Examples of such objectives are safety, fairness, ethics, legality, reliability, robustness, trust, etc. It remains almost impossible to quantify such notions that vary in meaning from an individual to a group level. Decisions that have consequences on society demand accountability. We may not be always able to enumerate all possible scenarios (e.g. autonomous driving cars), and need a fallback. Interpretability is that fallback.
Nosedive, a Black Mirror episode, is a dark take on a world where socioeconomic status is decided by ratings. Lacie Pounds, the protagonist, wants to raise her rating. Because there was an explainable mechanism behind the ratings, just enough to translate into actionable advice.
There are at least four entities whose interpretability we could discuss:3 the data, the algorithm, the model found by the algorithm, and the decisions made by the model.
Interpretability for humans is often post-hoc "descriptions" by example. But with machines, we have a chance to have a broader definition - in terms of simulatability, decomposability, and algorithmic transparency.
Recent exposition on Foundation Models4 heightens the need for interpretability. Interpretability does not remain limited to understand the internals of a model, but also understanding its capabilities. With wider penetration into application domains, it will be critical to understand the mechanisms behind the building blocks of decision making via such models.
In a world where Codex starts writing a significant chunk of the code, we are legally on shaky grounds when it comes to accountability. Interpretability again will become the fallback tool to make legal assessments. Indeed, post-hoc interpretations run the risk of being faulty too, but at least having a human-in-the-loop can mitigate obvious risks. Who holds the moral agency of algorithmic misgivings?
Interpretability will be the bridge between researchers and regulatory agencies. If we as a community are to be smart at all about getting through legislation and make continue building on the success of ML so far, this is going to be an important step. Explainability is now even a legal requirement in the EU.
The grand goal of machine learning, and AI in general will always be abstraction and reasoning 5. Existing machine learning systems are just not designed to do that at all. Interpretability research is probably again the proxy we employ to understand what existing systems are capable of reasoning about.
Suppose you have cancer and you have to choose between a black box AI surgeon that cannot explain how it works but has a 90% cure rate and a human surgeon with an 80% cure rate. Do you want the AI surgeon to be illegal? - Geoffrey Hinton
On the face of it, this tweet is careless. It leaves out a lot of details and nuance, but this is precisely why this was polarizing. The Great AI Debate at NIPS 2017 debated a similar premise.
See also: Trustworthy ML - Resources.