Most statistical learning theory settings (such as PAC learning and online learning) do not provide good enough bounds in realistic high-stakes settings, where the test data does not always look at the training data and a single mistaken prediction can cause catastrophe. It is important to design predictors that can either reliably predict verifiable facts (such as human behavior over a short time scale) or indicate when their predictions might be unreliable.
Imagine that we have a powerful (but not especially reliable) prediction system for predicting human answers to binary questions. We would like to use it to predict human answers whenever we expect these predictions to be reliable, and avoid outputting predictions wherever we expect the predictions to be unreliable (so we can gather training data instead).
In [active learning](https://en.wikipedia.org/wiki/Active_learning_(machine_learning)), the learner decides which questions to query the human about. For example, it may query the human about the most ambiguous questions, which provide the most information about how the human answers questions. Unfortunately, this may result in asking the human “weird” questions rather than actually informative ones (this problem is documented in this literature survey). There are some strategies for reducing this problem, such as only asking questions sampled from some realistic generative model for questions.
KWIK (“knows what it knows”) learning
In KWIK learning (a variant of online selective sampling, which is itself a hybrid of [active learning](https://en.wikipedia.org/wiki/Active_learning_(machine_learning)) and online learning), a learner sees an arbitrary sequence of questions. The learner has some class of hypotheses for predicting the answers to questions, one of which is good (efficient relative to the learner). For each question, the learner may either output a prediction or ⊥ . If the learner outputs a prediction, the prediction must be within ε of a good prediction. If the learner outputs ⊥ , then it receives the answer to the question.
This is easy to imagine in the case where there are 100 experts, one of which outputs predictions that are efficient relative to the other experts. Upon receiving a question, the learner asks each expert for its prediction of the answer to the question (as a probability). If all predictions by experts who have done well so far are within ε of each other, then the learner outputs one of these predictions. Otherwise, the learner outputs ⊥, sees the human’s answer to the question, and rewards/penalizes the experts according to their predictions. Eventually, all experts either output good predictions or get penalized for outputting enough bad predictions over time.
Unfortunately, it is hard to show KWIK-learnability for hypothesis classes more complicated than a small finite set of experts or a class of linear predictors.
- AI alignment
The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.