Experiments on juggling
Updated a year ago
I decided to test out the ML-Agents system to learn a bit more about the possibilities of machine learning. My idea was to investigate whether the system could be used for procedural generation. In particular, I wanted to test a single agent that could place blocks in a square grid in order to accomplish a complex goal. However, time was short and I realized the results I wanted to achieve required a lot more work, so I've decided to show what I achieved thus far. The result is a set of experiments on blocks that can move, rotate, and push marbles around.

Simulation Environment

The simulation environment consists of a table, a set of blocks that can move, rotate, and trigger, and a set of marbles that must be juggled around.
Blocks can be of different types. For example, there are 'spring' blocks, or 'flipper' blocks. However, to avoid issues with physics, I mostly worked with 'force' blocks, which would just apply a velocity impulse to any marble in their trigger area.
The goal of the AI is to 'juggle', which is measured as follows:
  • Marbles are given a large malus if they hit the floor
  • Marbles are given a large malus if they exit the 'play' area
  • As long as the simulation continues, a small reward is given at each step (more time juggling = good)
  • Marbles get higher rewards if they go fast, if they go high, and if they go high and fast.
  • Marbles cannot be still for more than 2 seconds (before adding this rule, the AI had a few issues with blocks trying to just keep the marbles from moving at all in order not to lose them)

State space

I experimented with different state spaces:
  • The position, velocity, angular velocity of each marble
  • The position, velocity, angular velocity of the lowest marble (the most 'risky')
  • The position, velocity, orientation and angular velocity of each block
  • The trigger cooldown time of each block
The idea to use only the lowest marble for the state space came from being able to use curriculum learning with a varying number of marbles, as by using just the lowest one the size of the state space would be always the same.
At the end, I found that the position and velocity of all marbles was needed to achieve some result, as the agent would forget too soon of marbles after launching them in the air, ignoring whether they would launch them directly above or in a random (usually bad) direction.

Action Space

I also experimented with different action spaces, representing the different capabilities of blocks. Here are the actions I considered:
  • Blocks can be placed arbitrarly
  • Blocks can be moved at discrete positions
  • Blocks can be moved continuously
  • Blocks can be rotated freely
A combination of these different actions led to a very different behaviour. In the end, I found that a continuous movement achieved better results.
It is interesting to note that the blocks kept 'spamming' their trigger action whenever possible if the marbles were out of range, stopping just in time instead when the marble was close. I guess this is related to the various dynamics of the input value that governs the trigger, since I modeled it as a float between -1 and 1 (with a value > 0 signalling a trigger).

The importance of being issue-free

I struggled a lot with issues that would make it really hard to distinguish a failed training session from a buggy behaviour. Often, I would find that the training session was working with a . This was especially apparent with the use of the physics system and with a correct reset of the simulation.
As an example, I did not consider the angular velocity of the marbles at first, neither in the state of the brain, not as a value to be reset. As a consequence, blocks would not be able to distinguish between a marble falling directly down and a rolling marble, failing to account for the added motion on impact.
As another more interesting example, blocks would learn to apply velocity impulses in such a way as to 'cheat' the physics engine and throw the ball much higher than allowed by applying multiple impulses. Adding a cooldown effect after a marble was affected by a block solved the issue.
I thus learnt that being bug-free and double checking everything is even more important in such tests, as the cost of a failed training session (in terms of time) is very high and the bug may not be apparent.


I realized that testing with machine learning is really time-intensive, so I worked to streamline the iterations by creating a single environment with dozens of parameters. Here you can see the parameters that the agent can be configured with. These parameters govern both the simulation envrionment, but they also modify dinamically the number and nature of actions and states that are assigned to the brain, allowing for a more organic parametrization. I found that these parameters, paired with a small script that would switch between an internal and external brain when connecting to the pythn API and a few modifications to the notebook, let me iterate really fast: modifying parameters and clicking one button to start the simulation and directly see the results in the scene.

After a set of failed experiments, here is a few experiments that instead worked.

Experiment 1

In this experiment, the block is able to move, rotate, and trigger freely. The block is tasked with keeping a single randomly spawned marble in air, possibly launching it up. The simulation ends if the marble is stull, or if it touches the floor. The final result is not remarkable, as the block always manages to catch the first drop of the marble regardless of its starting position, it launches it up, but raraly manages to catch it again.
The different phases in which the block learns, however, are very interesting. We can see a phase where the block only learns to stop the marble from falling directly by placing itself directly under it, there is then a second phase in which the block learns to not only catch the marble, but keep it from dropping to the floor. During a third phase, the block learns to also launch the marble up in the air.

Experiment 2

In this experiment, the marbles are three and they fall from higher up and with longer delays. My goal was to let the block learn how to launch one ball, then forget about it and look at the second marble, repeating the same steps.
Instead, since I did not randomize the marbles' spawn position, the marble learn to use the first marble to hit the second (one of these moments during which you look at the screen and think "stupid, stupid smart AI")

Experiment 3

I repeated the second experiment adding a new block, but the result was almost the same. The AI learnt to use one block to launch the second marble towards the third, but forgets about the first marble. Note that this is just one agent controlling two blocks, and not two agents.

Experiment 4

I finally used two blocks and two marbles, which the AI liked the best. It learnt to use the left block to control the left marble and the right block to control the right one. However, it kept losing marbles after the first launch.

Future Work

I intend to keep working on the system and achieve a complex juggling behaviour.
I also experimented with curriculum learning, allowing the number of blocks, number of marbles and the size of the movement grid to be changed. However, I could not find any difference between a normal training session and a curriculum one, so this needs further investigation (I suspect I am missing some part of the implementation).
Apart from that, I would like to add obstacles and see how the AI can react to their presence during juggling.
Michele Pirovano
PhD, Freelance Game Developer - Programmer