Project Summary

The goal of this project was to teach an agent how to ride a horse across difficult terrain in an efficient and effective manner, which was measured in speed and health. The core focus was on the unique interactions that a horse has with the terrain in Minecraft. Automating this process of riding a horse in Minecraft would be nuanced with the factors of the increased speed, wider turns, mounting and dismounting, and jumping. We decided to aim for a goal of teaching an agent how to ride a horse and navigate through different terrain in Minecraft. This is applicable to real life situations when it comes to obstacle avoidance in navigation, pathfinding and backtracking.

This project is perfect for reinforcement learning because the agent can be trained to achieve this specific goal of learning how to ride a horse by maximizing the reward in the Minecraft environment. The agent will be constantly rewarded for negative actions such as hitting obstacles and losing health but rewarded for positive actions like making it to various checkpoints. But rather than coding a simple pathfinding algorithm, we wanted the agent to be able to discover and maintain knowledge of what impedes proper navigation on a horse in Minecraft. Unlike walking in Minecraft, careful navigation around obstacles removes the main benefit of riding a horse in Minecraft, which is its speed.

The action space is continuous and includes speeding up and slowing down, turning, and moving forward, back, left and right. The observation space is discrete and is the flattened 2x5x5 grid of blocks immediately surrounding the agent. The horse’s attributes are all strictly average with the max health set to 40, the movement speed base set to 0.275 and the jump strength base set to 0.8.

Through Malmo and RLlib, we aimed to use reinforcement learning to train an agent to be able to mount a horse and navigate through difficult terrain, as well as lead the horse when the agent is dismounted by falling into water. The agent needs to learn what types of actions inhibit effective navigation and the tradeoffs between navigating slowly through obstacles and taking occasional damage but being able to optimize the speed at which a horse can travel.

Approach

We started this project by developing a terrain generator that would simulate real terrain that an agent might encounter. For instance, cacti are a common obstacle that a user in Minecraft would need to overcome because cacti injure both the agent and the horse. So the agent would need to navigate carefully, whether by reducing the speed at which they are riding at or steering around the cacti. Currently, the code is able to generate cacti and a bounding box of gold that determines when an agent has successfully navigated outside of the cacti terrain, simulating an objective that a Minecraft player would likely run into. We took into account obstacles from different biomes. Cacti are commonly found in desert biomes in Minecraft, patches of trees (which horses cannot pass under) are commonly found in forest biomes, and lakes are commonly found in plains biomes. We tried to combine these different features into a single piece of terrain to train the agent to be able to navigate through all of these.

Jumping, turning, and the speed of movement can all be defined as continuous actions, so this led to the environment of the agent having a large state space. Since the environment was not deterministic and had a large state space, we opted to use RLLib’s implementation of Proximal Policy Optimization, or PPO. PPO is a deep policy gradient to circumvent some of the issues we were facing with navigating through terrain properly because of the variety in actions that the agent could take.

The update function for PPO can be described with the following: \(L^{CLIP}(\theta)=\hat{E_t}[min(r_t(\theta)\hat{A_t},clip(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A_t})]\)

The agent is rewarded positively for actions such as exiting the terrain in a timely manner or for mounting a horse and staying on the horse, and rewarded negatively for lack of progress outside the terrain and injuring the horse. To reward the agent for progressing, the agent receives a reward at the mission end depending on the time it took for the agent to leave the area of obstacles. The agent receives a positive reward for staying on the horse, but a small but negative reward for dismounting the horse. Ultimately, the agent is also penalized for each half heart of health that the horse loses.

We took 3 different approaches to our reward function. Our first approach was to reward the agent at the end of a mission with a multiplier based on the amount of time it took to run the mission. However, there were some flaws with our first model. This would reward the agent for completing early regardless of the result of the mission. If the mission terminated early because the agent died, the agent would be positively rewarded for that. Also, with this model, the rewards were sparse and the agent would have difficulty making progress towards reducing the time it took to navigate through the mission. We changed our approach to reward incrementally, with the incremental reward reducing with time. However, this meant that the agent would go slower to maximize the time it took to reach the end of the mission, while still succeeding. So while this fixed our issues with rewarding the agent significantly for dying to end the mission, it meant that the agent would be rewarded for progressing slowly through the mission. Our third and final approach was to alter the reward function to use a distance based reward function. The agent would be rewarded by the distance from any of the edges of the terrain to encourage progress to exit the terrain, but still maintaining the typical penalties for if the agent dismounts from the horse or if the horse takes damage.

The reward function is determined by \(r = 1 \cdot (t_{\textsf{on_horse}}) - 0.5 \cdot (t_{\textsf{off_horse}}) - h + 10 \cdot \textsf{dist_from_origin} + (100 - 0.1 \cdot tk)\), where the reward \(r\) for each step is determined by a function of the time spent on the horse, \(t_{\textsf{on_horse}}\), the time spent off the horse, \(t_{\textsf{off_horse}}\), the hearts of health lost for the horse, \(h\), \(\textsf{dist_from_origin}\), the distance that the agent has travelled from the origin, and the number of ticks that the mission took until it ended \(tk\) which will assess the time required for the agent to reach the goal of the mission, which is to leave the obstacle terrain. This type of reward function is optimized for both speed but progress from the original starting point, which encourages the agent to leave the area of difficult terrain, but in a time-efficient manner as well.

Evaluation

We decided to measure effectiveness in horse riding by measuring whether or not the agent made it to the goal and how much health was lost along the way. This was accomplished by accumulating negative rewards for damaging the horse entity, killing the horse, and timing out before reaching the goal. Positive rewards are given for reaching gold block checkpoints and reaching the end goal. Overall returns can be seen in the images below:

In the left image, we can see that in the early stages of training the overall return is negative, but the agent learns quite quickly. This is because the agent is making a lot of mistakes, but is able to learn how to mount the horse quite quickly. But when it comes to learning how to navigate around obstacles, the returned rewards have difficulty growing, as shown on the right image. The right image shows the previous reward policy that we had implemented. We readjusted the rewards policy to not use a sparse reward policy and to change what the rewards were based on, as described in the Approach section.

In this image below, we can visualize the improvement of the model as it is trained with the new rewards policy. We can see a visible improvement in the model’s performance as the model is able to optimize based not on the rate at which a mission ends, but the progress that the agent is making towards leaving the terrain. The agent learns through exploration of the terrain which obstacles to avoid and which can be tolerated. Since our focus is more on the aspect of obstacle detection and less so navigation, the agent gradually learns how to effectively ride a horse with minimal error but optimized speed.

Qualitative evaluation can be performed by comparing the agent’s performance on different types of terrain as well as by observing how well it does on different points during the training process. For example, we can compare how well the agent finds the horse in the beginning vs later in the training process. We can see that the agent mounts the horse much more quickly in the end than at the start. However, we can visually assess our agent’s performance like in the below.

On the left, the baseline model is in proximity of the horse, but walks around aimlessly rather than mounting the horse and exiting the obstacle territory. On the right, the working model is able to mount the horse and navigate through the cacti. The model takes into account how injuring the horse may have a negative reward, but recognizes as well that maintaining the current speed that the horse is travelling at is also positively rewarded.

Resources Used