Research

My research is on the normative design of general-purpose AI agents: what objectives should generalist AI agents pursue, how can we evaluate their success, and what does it mean for an agent to be “aligned” with humanity? My work draws on language modeling, reinforcement learning, decision theory, social choice, and causal modeling.

My PhD thesis was on Leveraging Structure to Represent Tasks in Sequential Decision Making (University of Toronto, 2024).

Research statement (2024) Research questions

Normative Goal Design

The axiomatic approach starts with a set of simple properties and derives powerful conclusions about the agents that satisfy them; e.g., that a “rational” agent is an expected-utility maximizer, or that an agent serving multiple principals must be able to compare and aggregate their utilities. A core focus of my work has been to apply and extend normatively satisfying results from decision theory and social choice to reinforcement-learning agents, where the existence of “rewards” specifying general-purpose objectives had previously been assumed without justification.

  1. Rationalizing Boltzmann Rationality: An Axiomatic Characterization of Entropy-Regularized Policies
    Silviu Pitis
    An axiomatic characterization of the soft Bellman equation: separating environmental chance from agent choice reconciles entropy bonuses with expected utility, and independence-style axioms at decision nodes pin down the softmax form.
  2. Consistent Aggregation of Objectives with Diverse Time Preferences Requires Non-Markovian Rewards
    Silviu Pitis
    When an agent’s principals discount the future at different rates, no Markovian reward can faithfully aggregate their objectives; we derive a practical approach to the resulting non-Markovian reward aggregation.
  3. Objective Social Choice: Using Auxiliary Information to Improve Voting Outcomes
    Silviu Pitis and Michael R. Zhang
    When voters are noisy reflections of an underlying ground truth, whose votes are independent but not identically distributed, we show how auxiliary information can improve vote aggregation.
  4. Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach
    Silviu Pitis
    Rationality axioms imply a more expressive RL objective and reward structure than the default fixed-discount MDP return, including state-action-dependent, but not (s, a, s’)-dependent, discounting.
Language-Based Specification & Evaluation

General-purpose language models give AI agents a powerful, human-compatible interface for specifying and interpreting goals, but natural language is inherently underspecified, which can lead to incomplete instructions, disagreement between principals, and misunderstandings on the part of agents. How can deployers express their requirements for AI agents in an author-legible way, how can those requirements be evaluated or enforced at runtime, and how should we evaluate whether current alignment and verification methods are measuring the right things?

  1. Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models
    Christopher Chiu, Silviu Pitis, and Mihaela van der Schaar
    VivaBench: a multi-turn benchmark of 1762 physician-curated clinical vignettes that exposes brittle sequential reasoning in medical LMs.
  2. Improving Context-Aware Preference Modeling for Language Models
    Silviu Pitis, Ziang Xiao, Nicolas Le Roux, and Alessandro Sordoni
    Splits reward modeling into context selection and context-conditioned preference, and shows that this can increase annotator agreement. Constructs a reasonable-preference-reversal dataset for training context-aware preference and reward models.
  3. Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
    Blair Yang, Fuyang Cui, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, Silviu Pitis, and Michael R. Zhang
    Using language models to write fine-grained qualitative report cards of a model’s strengths and weaknesses.
  4. Identifying the Risks of LM Agents with an LM-Emulated Sandbox
    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto
    An LM-emulated sandbox that surfaces the long-tailed failure modes of tool-using LM agents.
  5. Large Language Models Are Human-Level Prompt Engineers
    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba
    Language models can self-generate effective instructions given enough coverage of the target behavior.
  6. Failure Modes of Learning Reward Models for LLMs and Other Sequence Models
    Silviu Pitis
    Surveys how learned reward models for LLMs fail (model misspecification, ambiguous preferences, and unidentifiable rewards) and what is needed to fix each.
Structured Generalization in RL

Long-horizon tasks can be broken into smaller, more tractable sub-parts, and the structure of an agent’s environment tells us how. A causal approach decomposes a task into subprocesses that are sufficiently independent that reasoning about them separately improves sample efficiency and enables out-of-distribution generalization. A geometric approach instead embeds tasks in a space whose structure lets an agent reason about subgoals and about the frontier of its own knowledge. This is the empirical and methodological strand of my work.

  1. MoCoDA: Model-based Counterfactual Data Augmentation
    Silviu Pitis, Elliot Creager, Ajay Mandlekar, and Animesh Garg
    Extends the local causal model framework to model-based RL, enabling generalization to unseen states.
  2. Counterfactual Data Augmentation using Locally Factored Dynamics
    Silviu Pitis, Elliot Creager, and Animesh Garg
    A local causal model and counterfactual data augmentation that more than doubles RL sample efficiency.
  3. Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning
    Silviu Pitis*, Harris Chan*, Stephen Zhao, Bradly Stadie, and Jimmy Ba
    Sets intrinsic goals on the frontier of the achieved-goal distribution to explore long-horizon multi-goal tasks.
  4. An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality
    Silviu Pitis*, Harris Chan*, Kiarash Jamali, and Jimmy Ba
    Neural architectures guaranteed to respect the triangle inequality, for asymmetric metric learning.
  5. ProtoGE: Prototype Goal Encodings for Multi-goal Reinforcement Learning
    Silviu Pitis*, Harris Chan*, and Jimmy Ba
    Prototype goal encodings use a finer goal topology to solve coarse multi-goal tasks more efficiently.
Other work
  1. Optimizing a Margin of Safety via Prompt Repair for Large Language Models
    Jessica Tang, Silviu Pitis, Sheila McIlraith · Workshop 2026 · PDF
  2. Canonical Design for Language Agents using Natural Language Reward Models
    Silviu Pitis, Ziang Xiao, Alessandro Sordoni · Workshop 2023 · PDF
  3. Calibrating Language Models via Augmented Prompt Ensembles
    Mingjian Jiang, Yangjun Ruan, Sicong Huang, Saifei Liao, Silviu Pitis, Roger Grosse, Jimmy Ba · Workshop 2023 · Link
  4. Return Augmentation Gives Supervised RL Temporal Compositionality
    Keiran Paster, Silviu Pitis, Sheila McIlraith, Jimmy Ba · Workshop 2022 · Link
  5. Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning
    Kristopher De Asis, Alan Chan, Silviu Pitis, Richard S. Sutton, Daniel Graves · AAAI 2020 · arXiv
  6. Source Traces for Temporal Difference Learning
    Silviu Pitis · AAAI 2018 · arXiv
  7. Reasoning for Reinforcement Learning
    Silviu Pitis · Workshop 2017 · PDF
  8. Methods for Retrieving Alternative Contract Language Using a Prototype
    Silviu Pitis · ICAIL 2017 · PDF
  9. Designing Optimal Takeover Defenses
    Silviu Pitis 2013 · PDF
  10. Examining Expected Utility Theory from Descriptive and Prescriptive Perspectives
    Silviu Pitis 2010 · PDF