CS 6789 Foundations of RL

Project Guidelines and Important Dates

Projects should be driven by an open-ended question (theory or implementation), with a clearly stated objective and a well-scoped plan, and must be completed in groups of three; students should consult the project page for suggested project ideas and clearly communicate the intended deliverables in the proposal (as best as possible). Both the project proposal and the final report must be uploaded on Gradescope, and presentations will be in-person in the lecture hall.

Dates

3/25: Project proposal due (upload to Gradescope)
4/17: Midterm report due (upload to Gradescope)
Student Project Presentation slots: See course dates
5/8: Final report due (upload to Gradescope)

Note that presenting the projects in person is mandatory. Students that are going to ICLR should plan out their presentations for the 04/21 slot.

Gradings

Midterm report: 5%

Project Presentations: 10%

Final report due: 15%

Reports and Presentations

Presentations: Details forthcoming.

Report Format: we use NeurIPS format. You must use the NeurIPS LaTex format.

Midterm Report: Your report should be 2 pages maximum (not including references). Your midterm report should include title, team members, abstract, related works, problem formulation and goals.

Final Report: Your report should be 9 pages maximum (not including references). Your final report will be evaluated by the following criteria:

Merit: Do you have sound reasoning for the approach? Is the question well motivated and are you taking a justifiably simple approach or, if you are choosing a more complicated method, do you have sound reasoning for doing this?
Technical depth: How technically challenging was what you did? Did you use a package or write your own code? It is fine if you use a package, though this means other aspects of your project must be more ambitious.
Presentation: How well did you explain what you did, your results, and interpret the outcomes? Did you use good graphs and visualizations? How clear was the writing? Did you justify your approach?

Project Ideas

We provide a few project ideas below. Studying existing RL theory papers and reproducing proofs is also a good option for the course project. Experiments for verifying conclusions and testing conjectures are also welcome.

Refined analysis in Tabular MDPs: Conduct a survey on a family of tabular MDP papers with tight regret bounds, e.g., Azar et.al , Jin et.al, Wang et.al

Comparison between variants of linear MDP models: Conduct a survey on papers with some kind of linear structures, e.g., Yang and Wang , Jin et.al

Thompson Sampling in RL: Survey Thompson sampling techniques used in RL. This is a good starting point.

Gittins Index: Understand and survey the Gittins index method. This is a framework for Bayes optimal learning for multi-armed bandits. Think about open questions and why extensions are difficult. This is a good starting point.

RL with Constraints: RL with convex and knapsack constraints is studied here for tabular settings. Can you extend it to non-tabular setting such as linear MDPs?

RL with Adversarial Corruption: Exploration in RL with corruption is studied here. Can you think about different attack models and study attack/defense in other RL frameworks such as policy gradient or batch RL?

Policy Gradient: Starting from the analysis of PG/NPG, can you think about how to do data-reuse in policy optimization to potentially improve its sample complexity?

Policy Gradient with Exploration: Starting from PC-PG, can you think about ways to improve its sample complexity?

Policy Gradient: Starting from this paper, can you think about how to extend the algorithm here to other linear MDP models?

Reward Free Exploration: Conduct a survey on a MDP methods, which do not use a reward signal. See Max-Ent exploration as a starting point.

Imitation Learning from many experts: This paper shows learning from multiple experts in the interactive learning setting. Can we do learning from multiple experts in non-interactive settings?

Online MDPs with expert advice. Sometimes RL can be done in adversarial contexts. Conduct a survey of online MDP methods (in adversarial settings). See Online MDPs as a starting point. Also, comment on the connections to the NPG analysis.

Statistical Limits of Offline RL: Offline RL seeks to learn a near-optimal policy from a fixed dataset. Recent work such as Wang et al., Wang et al., and Zanette explore the fundamental information-theoretic limits and instabilities in this setting. Can you survey offline RL methods and the estimation techniques used to handle distribution shift? Under what conditions (e.g., low noise or specific coverage) can we circumvent existing lower bounds?

Structural Assumptions and Learnability: What structural properties of an MDP make RL tractable with function approximation? Starting from the concept of Bellman Rank, recent research has unified these ideas into a broader framework of Bilinear Classes. Survey the different structural assumptions (such as Bellman rank and Bilinear rank) and discuss how they enable provably efficient learning in large-scale MDPs.

Hardness of Linear Realizability: If the optimal Q-function is linear in a given feature map, is RL always efficient? Lower bounds from Weisz et al., Weisz et al., and Wang et al. suggest otherwise, even with a constant suboptimality gap. An interesting open question is: if $Q^\pi$ is linear for all policies, is an online lower bound still possible, or does this make the problem tractable? Additionally, explore the near-deterministic case and its impact on learnability.