I downloaded the free Space Shooter application from the asset store and modified it so that I could insert multiple instances of the game in one scene. I imported the package, ML-AgentsWithPlugin. Then I started experimenting with the Python Tensorflow kit and curriculum learning. While reading this article, please refer to the Github repository sources at the website link, or at the bottom of the article.
In the folder, AI Experiment, I create a new scene, which is named SpaceShooter1. The scene has a main camera, an academy, a brain, a simple UI, ten agents, and an event system.
The academy is a game object with the name, SpaceShooterAcademy. It has a component, which implements the class with the same name, SpaceShooterAcademy. This class inherits the base class, Academy. I initialized the academy game object as shown in the following image.
The property, Max Steps, is set to 5000 steps. The property, Frame to Skip, is set to 4. The academy has two reset parameters, both of which are public properties. The first reset parameter, hazards, defines the types of hazards that can be instantiated in a lesson. Valid values range from 1 to 5, where 1, 2 and 3 are the different asteroid shapes, 4 is the alien ship without missiles, and 5 is the alien ship with missiles. The second reset parameter, hazards_count, is the number of enemies that can be spawned during a lesson. Both of these reset parameters will be overridden by the curriculum learning configuration.
The brain is a game object with the name, SpaceNavigatorBrain. It has a component which implements the class, SpaceNavigatorBrain. This class inherits the base class, Brain. The brain has 8 states and 8 actions, as shown in the following image.
The state space type is continuous. The player's position in world space occupies the first three states (x, y, z). The target in front of the player's position occupies the next five states as follows. The target type uses four states in one-hot fashion. They are defined as the boundary, the asteroid, the alien ship, and the alien missile. The last of the five remaining states is the distance to the target.
The action space is discrete. Eight actions exist. They are no action (none), up, down, left, right, fire at asteroid, fire at alien ship, and fire at alien missile.
Each agent contains its own instance of the game. Each game has a boundary so no game objects will interfere with adjacent game spaces. Each agent has a component, SpaceShooterNavigatorAgent, which inherits the Agent object. The brain, SpaceNavigatorBrain, has been assigned to the public property, Brain, as per the specifications. Max Steps is set to 1000, and the game will reset automatically. The following image illustrates one of the agents.
The agent class, SpaceShooterNavigatorAgent, implements the reward algorithm. The reward algorithm rewards the player for firing at targets and for hitting targets. Also, the reward algorithm encourages the player to stay in the center of the game's boundaries as much as possible. The reward decreases as the player moves away from the center. Finally, the player is penalized when it is killed.
The curriculum file can be found in the folder, python, and has the name, Curriculum.json. It has the following definitions.
This curriculum has six lessons. In lesson one, only one hazard type is spawned, and two instances of this hazard are instantiated in a random location in front of the player. In lesson two, two hazard types are spawned, and three instances of these hazards are instantiated in front of the player. In lesson three, three hazard types are spawned, and four instances of these hazards are instantiated. In lesson four, four hazard types are spawned, and five instances of these hazards are instantiated. In lesson five, five hazard types are spawned, and six instances of these hazards are instantiated. In the last lesson, lesson six, five hazard types are spawned, and seven instances of these hazards are instantiated. The last two lessons are more difficult, as well, because the fifth hazard is an alien ship that fires missiles.
The mean reward is used as the measure. When it reaches one of the thresholds, the curriculum lesson is incremented and the lesson becomes more difficult.
I was not sure how well the results would turn out, because the hazards are spawned in random locations, and move toward the bottom of the game space. So the player has to learn to avoid the hazards, as well as shoot them.
Single Brain Results
I had a couple of interesting issues to resolve with the reward system. First, the agent figured out quickly that it could avoid danger, so it did. It stayed on the edge of the boundaries as much as possible. Also, the agent didn't know whether it should move or fire. So it did both randomly and with little effect.
To solve the first problem, I modified the logic in the agent so that it would reward the brain when it moves to the center of the game space, and decrements the reward proportionally as it moves away from the center. Also, the reward for firing at a target is slightly higher than the reward for moving. I was hoping that the brain would learn that firing at a target was more valuable than moving away from a target.
To solve the second problem, I modified the logic in the agent so that it gets a small reward only when it shoots at a target. It gets a larger reward when it destroys a target. The brain will shoot at anything. So the agent gives no reward for wild shots.
The following video shows the results with one brain, which was trained with the Jupyter notebook, PPO. The notebook has maximum number of steps set to 150,000.
The player does a pretty good job of avoiding hazards. However, I noticed during the game play that the player was getting destroyed far too often due to indecision. It moved, rather than fired, at a hazard. Consequently, it's learning time was longer and it destroyed fewer targets.
You may notice when you play the first video that the score reaches approximately 1200 points after 5000 steps. I wanted to improve this score, and that is why I conducted the second experiment which uses two brains.
The image that follows shows the tensorboard results for this first experiment. Readings were taken every 1000 steps. The values appear jagged because every new wave of enemies placed them randomly, and the waves moved. This upset the brain's algorithm. As you can see, it recovers. The same problem occurred after every lesson when additional enemies were added to the wave.
Two Brain Results
I was concerned with what looked like indecision on the part of the first brain. It looked like it decided to use a move action rather than a fire action at times. And the player crashed into the hazards as a result. So I added another brain, the heuristic brain, to the player game object.
The heuristic brain controls the player's weapon system. I added another game object to the academy, which has the name, SpaceShooterBrain. This brain has a new agent component of the class, SpaceShooterWeaponsAgent. Also, this brain has a Decision component attached. The decision component is implemented in the class, SpaceShooterDecision.
The SpaceShooterWeaponsAgent uses the same game state as the SpaceShooterNavigatorAgent to determine whether a target exists in its line of fire. But it has only 3 discrete actions. They are fire at asteroid, fire at alien ship, and fire at alien missile. The following image shows the configuration of the SpaceShooterBrain.
And the following video shows the results with two brains, which was trained with the same Jupyter notebook, PPO. The notebook has maximum number of steps set to 150,000.
The player does a much better job of avoiding hazards, because it shoots more of them. You may notice when you play the second video that the score reaches approximately 2300 points after 5000 steps.
The image that follows shows the tensorboard results for this second experiment. You may notice that the value graph starts at an initial, higher reward due to the additional brain's weapon system.
The following image is a superimposed image of the first two images. I thought you might find it interesting to see how they compared.
The random nature of the hazard placement and the constantly moving hazards made it difficult for the navigator brain to learn how to behave. Adding the second brain helped because it cleared a path and was unencumbered by the navigation actions.
Also, it was difficult to define an appropriate reward algorithm for this type of game play.
Finally, it was impossible to prevent the player from crashing into hazards. But I was surprised at how well it performed considering the difficulty of the game play. Perhaps I could improve the algorithm by adding a grid of hazards and their positions to the state machine. But this enhancement would take more time than I have allotted for this challenge. I'll leave this as an exercise for the reader.
You can find the sources in the following Github repository.