Silviu Pitis

PhD Student, Machine Learning
University of Toronto
Vector Institute


I am a PhD student with the University of Toronto Machine Learning Group and Vector Institute working with Jimmy Ba. My research focuses on the design of goals, rewards and abstractions for intelligent agents.

My research is funded by an NSERC CGS-D award and a Vector Research Grant. Previously, I was funded by OGS and UofT FAST scholarships.

I completed my master’s in computer science at Georgia Tech. Before this, I was a lawyer at Kirkland & Ellis in New York, where I worked on big corporate transactions (e.g., this and this). Before becoming a lawyer I was a fairly successful online poker player.

I received my J.D. in 2014 from Harvard Law School, where I was a fellow at the Olin Center for Law, Economics, and Business. My undergrad was in finance and economics at the Schulich School of Business in Toronto.


My ultimate research interest lies in the normative design of general purpose artificial agency: how should we design AIs that solve general tasks and contribute positively to society? I’m currently working toward:

For summaries of my current and historical research interests, you’re welcome to browse the agendas below:

Or check out my papers below. If we share research interests or you have an idea you’d like to collaborate on, I’d be excited to talk to you!


Rational Multi-Objective Agents Must Admit Non-Markov Reward Representations

Silviu Pitis, Duncan Bailey, Jimmy Ba.

Return Augmentation Gives Supervised RL Temporal Compositionality

Keiran Paster, Silviu Pitis, Sheila McIlraith, Jimmy Ba.

Large Language Models Are Human-Level Prompt Engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba. (Arxiv, Website)

MoCoDA: Model-based Counterfactual Data Augmentation

Silviu Pitis, Elliot Creager, Ajay Mandlekar, Animesh Garg. NeurIPS 2022. (Arxiv, Website)

Can RL agents generalize to new tasks w/ unseen states? Read the twitter thread to find out.

Counterfactual Data Augmentation using Locally Factored Dynamics

Silviu Pitis, Elliot Creager, Animesh Garg. NeurIPS 2020. Object-Oriented Learning Workshop at ICML 2020 (Outstanding Paper). (Arxiv, Talk, Code, Poster, OOL Workshop)

In this paper we proposed a local causal model (LCM) framework that captures the benefits of decomposition in settings where the global causal model is densely connected. We used our framework to design a local Counterfactual Data Augmentation (CoDA) algorithm that expands available training data with counterfactual samples by stitching together locally independent subsamples from the environment. Empirically, we showed that CoDA can more than double the sample efficiency and final performance of reinforcement learning agents in locally factored environments.

Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning

Silviu Pitis*, Harris Chan*, Stephen Zhao, Bradly Stadie, Jimmy Ba. ICML 2020. Adaptive and Learning Agents Workshop at AAMAS 2020 (Best Paper). (Arxiv, Talk, Code)

What goals should a multi-goal reinforcement learning agent pursue during training in long-horizon tasks? Our MEGA and OMEGA agents set achievable goals in sparsely explored areas of the goal space to maximize the entropy of the historical achieved goal distribution. This lets them learn to navigate mazes and manipulate blocks with a fraction of the samples used by prior approaches.

An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality

Silviu Pitis*, Harris Chan*, Kiarash Jamali, Jimmy Ba. ICLR 2020. (Arxiv, OpenReview, Talk, Code)

We propose novel neural network architectures, guaranteed to satisfy the triangle inequality, for purposes of (asymmetric) metric learning and modeling graph distances.

Objective Social Choice: Using Auxiliary Information to Improve Voting Outcomes

Silviu Pitis and Michael R. Zhang. AAMAS 2020. (Arxiv, Talk, Code)

How should one combine noisy information from diverse sources to make an inference about an objective ground truth? Past studies typically assume that noisy votes are identically and independently distributed (i.i.d.), but this assumption is often unrealistic. Instead, we assume that votes are independent but not necessarily identically distributed and that our ensembling algorithm has access to certain auxiliary information related to the underlying model governing the noise in each vote.

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Kristopher De Asis, Alan Chan, Silviu Pitis, Richard S. Sutton, Daniel Graves. AAAI 2020. (Arxiv, Talk)

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation.

ProtoGE: Prototype Goal Encodings for Multi-goal Reinforcement Learning

Silviu Pitis*, Harris Chan*, Jimmy Ba. RLDM 2019. (Paper)

Humans can often accomplish specific goals more readily that general ones. Although more specific goals are, by definition, more challenging to accomplish than more general goals, evidence from management and educational sciences supports the idea that “specific, challenging goals lead to higher performance than easy goals”. We find evidence of this same effect for RL agents in multi-goal environments. Our work establishes a new state-of-the-art in standard multi-goal MuJoCo environments and suggests several novel research directions.

Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach

Silviu Pitis. AAAI 2019. (Paper, Slides, Poster)

Can all “rational” preference structures be represented using the standard RL model (the MDP)? This paper presents a minimal axiomatic framework for rationality in sequential decision making and shows that the implied cardinal utility function is of a more general form than the discounted additive utility function of an MDP. In particular, the developed framework allows for a state-action dependent “discount” factor that is not constrained to be less than 1 (so long as there is eventual long run discounting).

Source Traces for Temporal Difference Learning

Silviu Pitis. AAAI 2018. (Paper, Slides)

This paper develops source traces for reinforcement learning agents. Source traces provide agents with a causal model, and are related to both eligibility traces and the successor representation. They allow agents to propagate surprises (temporal differences) to known potential causes, which speeds up learning. One of the interesting things about source traces is that they are time-scale invariant, and could potentially be used to provide interpretable answers to questions of causality, such as “What is likely to cause X?”

Reasoning for Reinforcement Learning

Silviu Pitis. NeurIPS HRL Workshop 2017. (Paper, Poster)

This is a short abstract about some ideas that connect implicit understanding (value functions) and explicit reasoning in the context of reinforcement learning.

Methods for Retrieving Alternative Contract Language Using a Prototype

Silviu Pitis. ICAIL 2017. (Best Student Paper). (Paper, Slides)

This paper presents a search engine that finds similar language to a given query (the prototype) in a database of contracts. Results are clustered so as to maximize both coverage and diversity. This is useful for contract drafting and negotiation, administrative tasks and legal research.

An Alternative Arithmetic for Word Vector Analogies

Silviu Pitis. 2016. (Paper)

This paper looks at word vector arithmetic of the type “king - man + woman = queen” and investigates treating the relationships between word vectors as rotations of the embedding space instead of as vector differences. This was a one week project of little practical significance, but with the advent of latent vector arithmetic (e.g., for GANs), it may be worth revisiting.

Punitive Damages in International Trade

Silviu Pitis. 2014. (Paper)

What should the structure of rights, remedies, and enforcement look like in an efficient international trade agreement? Do punitive damages have a place?

Designing Optimal Takeover Defenses

Silviu Pitis. 2013. (Paper)

This paper analyzes the economic value of corporate takeover defenses, and argues for designing intermediate takeover defenses that balance (1) the interest of shareholders in management’s exploitation of insider information and (2) the entrenchment interest of management.

Examining Expected Utility Theory from Descriptive and Prescriptive Perspectives

Silviu Pitis. 2010. (Paper)

This paper examines the history and validity of Expected Utility theory, with focus on its failures a descriptive model of human decisions.


I was course instructor for the first virtual iteration of Introduction to Machine Learning (CSC 311) in Fall 2020, together with Roger Grosse, Chris Maddison, and Juhan Bae.

I advise a small number of students on research in an informal capacity, sometimes jointly with my advisor Jimmy Ba or labmate Harris Chan. If you are an aspiring researcher and find our work interesting, please email us directly.



When I have time, I enjoy connecting with pretty much anyone over a video call (or coffee if you’re in downtown Toronto).

I’m especially interested in discussing ideas related to:

You can reach me at: