Research Questions
A selection of questions I find interesting.
What objectives should a generalist agent pursue when the humans it serves are uncertain, inconsistent, and disagree with each other? Often alignment work treats the empirical human as the target, but empirical humans are deeply flawed. Axiomatic decision theory and social choice can help us reason about “ideal” humans, but they remain largely absent from the standard pipeline.
What does it mean to aggregate principals who disagree about facts, not just values? Social choice mostly assumes agreement on the state of the world; subjective expected utility mostly assumes a single decision-maker. How should disagreement about facts propagate into the aggregate objective?
Relatedly, what causes the preferences people actually express, and how does that structure constrain generalization across people?
How should rewards and time preferences be structured for agents that operate over long horizons? How much should an agent respect precedent, and when should past commitments give way to the preferences of future principals?
When is natural language enough to specify a goal, and when does it stop being enough? Natural language is underspecified by construction, and many disagreements about LM behavior are really disagreements about context that nobody wrote down. Which goals can be pinned down by language given rich enough context, and which need a formal object underneath?
How do you evaluate an agent whose space of possible behaviors is essentially unbounded? Benchmarks miss the long tail; adversarial testing finds failures but not their structure. How do you combine the two into evaluation rigorous enough to drive deployment decisions?
And more recently:
As AI shifts value creation from labor to capital, how do you rebuild taxation and redistribution around capital, and what fills the role that employment plays in leading a purposeful life?
Interesting Papers
A selection of papers I like.
- Cardinal Welfare, Individualistic Ethics, and Interpersonal Comparisons of Utility (Harsanyi, 1955). VNM + Pareto implies social utility can be represented as a weighted sum of individual utilities.
- Collective Choice and Social Welfare (Sen, 1970). The foundations of social choice theory, including some nice discussions about the value of axiomatic work.
- Dynamic Consistency and Non-Expected Utility Models of Choice Under Uncertainty (Machina, 1989). Abandoning the independence axiom forces a choice between dynamic consistency and consequentialism.
- Algorithms for Inverse Reinforcement Learning (Ng & Russell, 2000). The canonical “learn the reward from behavior” formulation.
- Probability Theory: The Logic of Science (Jaynes, 2003). Nice discussion that derives (Bayesian) probability as a system of extended logic.
- Thinking, Fast and Slow (Kahneman, 2011). An accessible synthesis of behavioral economics and Kahneman & Tversky’s work.
- Horde (Sutton et al., 2011) and Universal Value Function Approximators (Schaul et al., 2015). On representing multiple goals, subgoals, and more general measurements using “general” value functions.
- Occam’s Razor Is Insufficient to Infer the Preferences of Irrational Agents (Armstrong & Mindermann, 2018). You cannot recover the preferences of a boundedly rational agent without modeling the irrationality.
- Reward-Rational (Implicit) Choice (Jeon, Milli & Dragan, 2020). A nice framework that unifies reward inference from demonstrations, comparisons, corrections, and other feedback modalities.
- On the Expressivity of Markov Reward (Abel et al., 2021) and On the Limitations of Markovian Rewards (Skalse & Abate, 2024). There exist tasks that no scalar Markov reward can capture, including multi-objective and risk-sensitive tasks; these results sharpen the case that the standard MDP formulation is too narrow for general-purpose agents.
- Constitutional AI (Bai et al., 2022). Alignment via written principles an agent can apply to itself.
- Beyond Preferences in AI Alignment (Tan Zhi-Xuan et al., 2024). The case that the preference-maximization framing is itself a constraint on alignment.
- AI Can Help Humans Find Common Ground in Democratic Deliberation (Tessler et al., 2024). The “Habermas Machine”: AI-facilitated deliberation that finds consensus statements among diverse participants.