Training ML Agents to Avoid Traditional AI Using Curriculum-Based Reinforcement Learning
The initial goal was to create a machine learning agent which would be able to hide from a traditional AI indefinitely.
Inspired by the stealth genre of games like Metal Gear Solid. I wanted the traditional AI to patrol the environment and require the machine learning agent to be in a field of view and then a line of sight within that field of view. To achieve this I created a shape in blender using a scaled cylinder and turned the shape into a trigger zone. There is a raycast from the agent to the traditional AI each step to check for line of sight. When the agent is both within line of sight and the field of view, the AI updates its target location to the location of the agent. If the agent leaves line of sight, the last known location of the agent is the destination, and after reaching that, the AI will go back to patrolling the environment.
After experimenting with a different combinations of hyperparameters, a set of values which resulted in an agent trained with the most successful policy was found (or so I thought).
I was unhappy with the behavior of the agent and the wildly varying results of the training, so I looked through the best practices documentation on the ml-agents repository and decided to try increasing the buffer size. The results gave a more stable, although still not very incredibly smooth increase in cumulative reward. I am going to continue to double the buffer_size hyperparameter and train until stability or performance tapers off, and I will update this page with my results.
In the meantime, I posted my improved training parameters below, and the performance of the agent appears to show less "deer in headlights" incidents.
Given the unstable nature of the training, I decided not to attempt curriculum training for the Hide scenario, although I encourage anyone to try with the code posted on my GitHub, or any modification of my code, and let me know how it goes!
The Limits of Simple Avoidance
Given enough time, the ML agent was always caught by the pursuing AI. The agent never found a policy in which it was able to move from every position from the relatively slow moving AI forever. This forced a contemplation of the design of the scenario.
The scenario is simply not practical.
If two equally capable people were put into a similar closed environment, and one could only run from the other. Over the course of time, the one chasing would probably always win. The scenario of endlessly running in a small environment does not exist out of a game of tag between children, and in that case they typically use an entire park or playground. If this were say a scenario of an intruder, there would be some type of force used between the two and one would come out victorious. Or the person hiding would be able to find a space to hide in which the one seeking the other person would not be able to detect them at all.
I do believe that there is an optimal policy in which the ML Agent could always avoid the AI in this environment, however it was not possible to find given the limitations of the current training implementation. So, I decided to try another scenario.
A more realistic scenario seemed to be escaping from the small environment out into a larger one. So, I reworked the scenario to include a "win zone". This also gave an end condition for the episode, preventing the simulation from playing out indefinitely (or until the end of the limit set in the academy to be more precise).
After some experimentation with the curriculum, I found that it was best to have the policy of the agent baked in with a high mean reward, level out, and then creep up near a very high success rate before increasing the pursuing AI's speed and learning to flee from a faster chaser. Also, it was better to have the AI simply start out faster at a speed of 1.0 versus the previous starting speed of 0.5 I used in the Hide scenario (thus the name EscapeFaster in the GitHub and seen in the hyperparameters below).
This approach seems to have worked well as the cumulative reward for a chaser at the maximum speed reached 9.5, which is about as close to the perfect reward of 10 as it can achieve with the living penalty which exists during the escape mode.
The results of training machine learning agents are deeply dependent on so many factors including hyperparameter tuning, environmental setup, reward design, and training time, that in most cases one is more likely to benefit from a traditional AI built from state machines or behavior trees.
However, there are some benefits one can obtain from using these early versions of reinforcement learning agents:
Random and natural-looking behavior seems to emerge from the training of these agents. Or at least behavior which tends to look more natural than their hard-coded heuristic AI counterparts.
As seen in the Escape example, optimization in relatively simple scenarios is very possible given the correct hyperparameters and an effective reward/penalty setup combined with a curriculum.