Research | Silviu Pitis

My research is on the normative design of general-purpose AI agents: what objectives should generalist AI agents pursue, how can we evaluate their success, and what does it mean for an agent to be “aligned” with humanity? My work draws on language modeling, reinforcement learning, decision theory, social choice, and causal modeling.

My PhD thesis was on Leveraging Structure to Represent Tasks in Sequential Decision Making (University of Toronto, 2024).

Research statement (2024) Research questions

Normative Goal Design

The axiomatic approach starts with a set of simple properties and derives powerful conclusions about the agents that satisfy them; e.g., that a “rational” agent is an expected-utility maximizer, or that an agent serving multiple principals must be able to compare and aggregate their utilities. A core focus of my work has been to apply and extend normatively satisfying results from decision theory and social choice to reinforcement-learning agents, where the existence of “rewards” specifying general-purpose objectives had previously been assumed without justification.

Rationalizing Boltzmann Rationality: An Axiomatic Characterization of Entropy-Regularized Policies

Silviu Pitis

An axiomatic characterization of the soft Bellman equation: separating environmental chance from agent choice reconciles entropy bonuses with expected utility, and independence-style axioms at decision nodes pin down the softmax form.

RLC 2026
Consistent Aggregation of Objectives with Diverse Time Preferences Requires Non-Markovian Rewards

Silviu Pitis

When an agent’s principals discount the future at different rates, no Markovian reward can faithfully aggregate their objectives; we derive a practical approach to the resulting non-Markovian reward aggregation.

NeurIPS 2023 arXiv

From a set of intuitively appealing axioms, I show that Markovian aggregation of Markovian reward functions is not possible when the time preference for each objective may vary. It follows that optimal multi-objective agents must admit rewards that are non-Markovian with respect to the individual objectives. Our work offers new insights into sequential, multi-objective agency and intertemporal choice, and has practical implications for the design of AI systems deployed to serve multiple generations of principals with varying time preference.
Objective Social Choice: Using Auxiliary Information to Improve Voting Outcomes

Silviu Pitis and Michael R. Zhang

When voters are noisy reflections of an underlying ground truth, whose votes are independent but not identically distributed, we show how auxiliary information can improve vote aggregation.

AAMAS 2020 arXiv HTML Code

How should one combine noisy information from diverse sources to make an inference about an objective ground truth? Past studies typically assume that noisy votes are identically and independently distributed (i.i.d.), but this assumption is often unrealistic. Instead, we assume that votes are independent but not necessarily identically distributed and that our ensembling algorithm has access to certain auxiliary information related to the underlying model governing the noise in each vote.
Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach

Silviu Pitis

Rationality axioms imply a more expressive RL objective and reward structure than the default fixed-discount MDP return, including state-action-dependent, but not (s, a, s’)-dependent, discounting.

AAAI 2019 arXiv Poster Slides

Can all "rational" preference structures be represented using the standard RL model (the MDP)? This paper presents a minimal axiomatic framework for rationality in sequential decision making and shows that the implied cardinal utility function is of a more general form than the discounted additive utility function of an MDP. In particular, the developed framework allows for a state-action dependent "discount" factor that is not constrained to be less than 1 (so long as there is eventual long run discounting).

Language-Based Specification & Evaluation

General-purpose language models give AI agents a powerful, human-compatible interface for specifying and interpreting goals, but natural language is inherently underspecified, which can lead to incomplete instructions, disagreement between principals, and misunderstandings on the part of agents. How can deployers express their requirements for AI agents in an author-legible way, how can those requirements be evaluated or enforced at runtime, and how should we evaluate whether current alignment and verification methods are measuring the right things?

Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

Christopher Chiu, Silviu Pitis, and Mihaela van der Schaar

VivaBench: a multi-turn benchmark of 1762 physician-curated clinical vignettes that exposes brittle sequential reasoning in medical LMs.

NeurIPS 2025 arXiv
Improving Context-Aware Preference Modeling for Language Models

Silviu Pitis, Ziang Xiao, Nicolas Le Roux, and Alessandro Sordoni

Splits reward modeling into context selection and context-conditioned preference, and shows that this can increase annotator agreement. Constructs a reasonable-preference-reversal dataset for training context-aware preference and reward models.

NeurIPS 2024 arXiv

We propose context-specific preference datasets and conduct experiments to investigate the potential of context-specific preference modeling.
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries

Blair Yang, Fuyang Cui, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, Silviu Pitis, and Michael R. Zhang

Using language models to write fine-grained qualitative report cards of a model’s strengths and weaknesses.

Workshop 2024 arXiv

We propose to use LMs to generate Report Cards, which are fine-grained qualitative evaluations of a model’s behaviors, including its strengths and weaknesses, with respect to specific topics or datasets.
Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto

An LM-emulated sandbox that surfaces the long-tailed failure modes of tool-using LM agents.

ICLR 2024 arXiv Code Slides Spotlight
Large Language Models Are Human-Level Prompt Engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba

Language models can self-generate effective instructions given enough coverage of the target behavior.

ICLR 2023 arXiv HTML Code
Failure Modes of Learning Reward Models for LLMs and Other Sequence Models

Silviu Pitis

Surveys how learned reward models for LLMs fail (model misspecification, ambiguous preferences, and unidentifiable rewards) and what is needed to fix each.

Workshop 2023 PDF

Structured Generalization in RL

Long-horizon tasks can be broken into smaller, more tractable sub-parts, and the structure of an agent’s environment tells us how. A causal approach decomposes a task into subprocesses that are sufficiently independent that reasoning about them separately improves sample efficiency and enables out-of-distribution generalization. A geometric approach instead embeds tasks in a space whose structure lets an agent reason about subgoals and about the frontier of its own knowledge. This is the empirical and methodological strand of my work.

MoCoDA: Model-based Counterfactual Data Augmentation

Silviu Pitis, Elliot Creager, Ajay Mandlekar, and Animesh Garg

Extends the local causal model framework to model-based RL, enabling generalization to unseen states.

NeurIPS 2022 arXiv HTML Video Code

Can RL agents generalize to new tasks with unseen states? We extend our local causal model framework to model-based RL and show that this is possible, both theoretically and empirically.
Counterfactual Data Augmentation using Locally Factored Dynamics

Silviu Pitis, Elliot Creager, and Animesh Garg

A local causal model and counterfactual data augmentation that more than doubles RL sample efficiency.

NeurIPS 2020 arXiv Video Code Poster Outstanding Paper (Object-Oriented Learning Workshop, ICML 2020)

We propose a local causal model (LCM) framework that captures the benefits of decomposition in settings where the global causal model is densely connected. We used our framework to design a local Counterfactual Data Augmentation (CoDA) algorithm that expands available training data with counterfactual samples by stitching together locally independent subsamples from the environment. Empirically, we showed that CoDA can more than double the sample efficiency and final performance of reinforcement learning agents in locally factored environments.
Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning

Silviu Pitis^*, Harris Chan^*, Stephen Zhao, Bradly Stadie, and Jimmy Ba

Sets intrinsic goals on the frontier of the achieved-goal distribution to explore long-horizon multi-goal tasks.

ICML 2020 arXiv Video Code Best Paper (Adaptive and Learning Agents Workshop, AAMAS 2020)

What goals should a multi-goal reinforcement learning agent pursue during training in long-horizon tasks? Our MEGA and OMEGA agents set achievable goals in sparsely explored areas of the goal space to maximize the entropy of the historical achieved goal distribution. This lets them learn to navigate mazes and manipulate blocks with a fraction of the samples used by prior approaches.
An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality

Silviu Pitis^*, Harris Chan^*, Kiarash Jamali, and Jimmy Ba

Neural architectures guaranteed to respect the triangle inequality, for asymmetric metric learning.

ICLR 2020 arXiv HTML Code

We propose novel neural network architectures, guaranteed to satisfy the triangle inequality, for purposes of (asymmetric) metric learning and modeling graph distances.
ProtoGE: Prototype Goal Encodings for Multi-goal Reinforcement Learning

Silviu Pitis^*, Harris Chan^*, and Jimmy Ba

Prototype goal encodings use a finer goal topology to solve coarse multi-goal tasks more efficiently.

RLDM 2019 PDF

Other work

Optimizing a Margin of Safety via Prompt Repair for Large Language Models
Jessica Tang, Silviu Pitis, Sheila McIlraith · Workshop 2026 · PDF
Canonical Design for Language Agents using Natural Language Reward Models
Silviu Pitis, Ziang Xiao, Alessandro Sordoni · Workshop 2023 · PDF
Calibrating Language Models via Augmented Prompt Ensembles
Mingjian Jiang, Yangjun Ruan, Sicong Huang, Saifei Liao, Silviu Pitis, Roger Grosse, Jimmy Ba · Workshop 2023 · Link
Return Augmentation Gives Supervised RL Temporal Compositionality
Keiran Paster, Silviu Pitis, Sheila McIlraith, Jimmy Ba · Workshop 2022 · Link
Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning
Kristopher De Asis, Alan Chan, Silviu Pitis, Richard S. Sutton, Daniel Graves · AAAI 2020 · arXiv
Source Traces for Temporal Difference Learning
Silviu Pitis · AAAI 2018 · arXiv
Reasoning for Reinforcement Learning
Silviu Pitis · Workshop 2017 · PDF
Methods for Retrieving Alternative Contract Language Using a Prototype
Silviu Pitis · ICAIL 2017 · PDF
Designing Optimal Takeover Defenses
Silviu Pitis 2013 · PDF
Examining Expected Utility Theory from Descriptive and Prescriptive Perspectives
Silviu Pitis 2010 · PDF