Research Questions

A selection of questions I find interesting.

What objectives should a generalist agent pursue when the humans it serves are uncertain, inconsistent, and disagree with each other? Often alignment work treats the empirical human as the target, but empirical humans are deeply flawed. Axiomatic decision theory and social choice can help us reason about “ideal” humans, but they remain largely absent from the standard pipeline.

What does it mean to aggregate principals who disagree about facts, not just values? Social choice mostly assumes agreement on the state of the world; subjective expected utility mostly assumes a single decision-maker. How should disagreement about facts propagate into the aggregate objective?

Relatedly, what causes the preferences people actually express, and how does that structure constrain generalization across people?

How should rewards and time preferences be structured for agents that operate over long horizons? How much should an agent respect precedent, and when should past commitments give way to the preferences of future principals?

When is natural language enough to specify a goal, and when does it stop being enough? Natural language is underspecified by construction, and many disagreements about LM behavior are really disagreements about context that nobody wrote down. Which goals can be pinned down by language given rich enough context, and which need a formal object underneath?

How do you evaluate an agent whose space of possible behaviors is essentially unbounded? Benchmarks miss the long tail; adversarial testing finds failures but not their structure. How do you combine the two into evaluation rigorous enough to drive deployment decisions?

And more recently:

As AI shifts value creation from labor to capital, how do you rebuild taxation and redistribution around capital, and what fills the role that employment plays in leading a purposeful life?

Interesting Papers

A selection of papers I like.