Update 2 files

- /contributions.md - /proposal.md

Update 2 files
- /contributions.md - /proposal.md
6a98a11d · Samuel Jia Cong Chua · b61168e6 · 6a98a11d · 6a98a11d
Commit 6a98a11d authored 9 months ago by Samuel Jia Cong Chua
--- a/contributions.md
+++ b/contributions.md
 # Engineering plan

-## Contribution 1
-
 ### Motivation
 *What is a specific problem that we'd currently like to solve, but can't with the existing science?*

+Mean Field Games (MFGs) have the ability to handle large-scale multi-agent systems, but learning Nash equilibria in MFGs remains a challenging task. Fictitious play (FP) and Online Mirror Descent (OMD) are two effective strategies for learning equilibria in MFGs. However, FP requires storing all historical best responses and sampling from the best response pool during execution, while OMD requires averaging historical Q functions which is not feasible for neural networks.
+
+Additionally, existing literature often assumes that agents always start from a fixed initial distribution.
+
+
 ### Contribution
 *What is the additional knowledge that would enable that problem to be solved?*

+A deep reinforcement learning (DRL) algorithm that achieves population-dependent Nash equilibrium without the need for averaging or sampling from history, inspired by Munchausen RL and OMD
+
 ### Results
 *What data will be necessary to capture and communicate that new knowledge?  That is, what data will demonstrate:*
 - *the problem was previously unsolvable,*
 - *the problem is now solvable, and*
 - *the solution came about through no possible reason other than your own engineering?*

+
+One way to characterize Nash equilibrium policies is through the notion of exploitability, which is a widely used metric for evaluating convergence and measuring how far a policy is from being a Nash equilibrium. Formally, it quantifies to what extent a single player can be better off by deviating from the population’s behavior and using a different policy. We will need exploitability between different policies and initial distributions.
+
+
 ### Experiment
 *What experiment(s) will generate that specific data?*

-## Contributions 1a, 1b, 2, 2a, ...
-*(same as above)*
+Four experiments will be conducted that are widely used as MFG domains wherethe existence of equilibrium is guaranteed. In each experiment, we explore two different scenarios respectively. The first scenario, referred to as fixed 𝜇0 in the sequel, follows the common practice in previous work, where the population always starts from a fixed initial distribution. The second scenario, referred to as multiple 𝜇0 in the sequel, aims to examine the effectiveness of "master" policy
+
+1. Exploration in One Room
+
+Consider a 2D grid world of dimension 11 × 11. The action set is $\mathcal{A}$=\{up, down, left, right, stay\}. The dynamics are:
+
+\begin{equation}
+    x_{n+1}=x_n+a_n+\epsilon_n
+\end{equation}
+
+where $\epsilon_n$ is an environment noise that perturbs each agent's movement ($\epsilon_n = $  no perturbation with probability $0.9$, and $\epsilon_n$ is one of the four directions with probability $0.025$ for each direction).
+The reward function will discourage agents from being in a crowded location: $$ r(x, a, \mu)=-\log (\mu(x))-\frac{1}{|X|}|a| $$
+
+
+2. Exploration in four connected rooms.
+
+The exploration case has  the same reward function and dynamics as the Uniform task, but  the environment is four connected rooms. The goal is to explore every grid point in those rooms. The dimension of the whole map is also 11 × 11.
+
+3. Beach Bar
+
+The Beach bar environment represents agents moving on a beach towards a bar. The goal for each agent is to avoid the crowd, but to get as close as possible to the bar. The dynamics is the same as in the exploration model. Here we consider that the bar is located at the center of the beach, and that there are walls on the four sides of the domain. The reward function is:
+\begin{equation}
+    r\left(x, a, \mu\right)={d_{bar}}\left(x\right)-\frac{\left|a\right|}{|\mathcal{X}|}-\log \left(\mu\left(x\right)\right)
+\end{equation}
+where $d_{bar}$ indicates the distance to the bar, the second term is penalizes the movement so that the agent does not move when it is unnecessary, and the third term penalizes the fact of being in a crowded region.
+
+4. Linear-Quadratic
+
+The LQ model is 1D space model, In the LQ model, the dynamics of a player after taking action $a_n$:
+
+\begin{equation}
+x_{n+1}=x_n+a_n \Delta_n+\sigma \epsilon_n \sqrt{\Delta_n}
+\end{equation}
+
+where $\mathcal{A}=\{-M, \ldots, M\}$, corresponding to go left, right, and stay still. The state space $\mathcal{X}=\{-L, \ldots, L\}$, the dimension of $|\mathcal{X}|$ =$2L-1$. To add more stochastic into this model, $\epsilon_n$ is an additional noise will perturb the action choice with $\epsilon_n \sim \mathcal{N}(0,1)$, but was discretized over $\{-3 \sigma, \ldots, 3 \sigma\}$. The reward function is:
+
+\begin{equation}
+    r\left(x_n, a_n, \mu_n\right)=\left[-\frac{1}{2}\left|a_n\right|^2+q a_n\left(m_n-x_n\right)-\frac{\kappa}{2}\left(m_n-x_n\right)^2\right] \Delta_n
+\end{equation}
+
+where $m_n=\sum_{x \in \mathcal{X}} x \mu_n(x)$ is the first moment of population distribution which serves as the reward to encourage agents to move to the population's average but also tries to keep dynamic movement.
+
+### Role in Paper
+
+I created a simple parser component for easy entry of algorithm hyperparameters such as the learning policy, neural network, initial distributions, environments and etc.
+
+I developed useful utility functions for effective visualization of experimental data such as LQVisualization, conversion of hash to matrix and showing the evolution process of the population using policies learnt by different algorithms
+
+I contributed to the implementation and analysis of experimental setups of LQ and Beach model with hyperparameters of interest and thereafter observed the effectiveness of the Master policy in these environments.    
\ No newline at end of file
--- a/proposal.md
+++ b/proposal.md
 # Title:
-3-5 word title
+Refer to Research Proposal.pdf, Research Personal Statement.pdf, Additional Information.pdf

 ### Summary:
-One or two sentence summary of the problem and your proposed solution approach

 ### Splash images
-One or two graphics that capture and communicate the problem and proposed solution to technical but non-expert audiences.  Don't use images that aren't yours, and make sure this figure in isolation still effectively communicates your project summary above. 

 ### Project git repo(s):
-Link to repo(s) which will have all project-related content

 ## Big picture