Status Video

Project Summary

The goal of the project is to train an agent that can ride a horse effectively in Minecraft. When riding a horse in a normal Minecraft game, the user needs to control a horse such that they can avoid obstacles that can injure the horse, or water, which dismounts the user. In this project, the agent will be in an almost-open world which consists of obstacles and terrain. The goal is to get as far as possible in this world while minimizing the amount of health the horse loses.

Approach

We started this project by developing a terrain generator that would simulate real terrain that an agent might encounter. For instance, cacti are a common obstacle that a user in Minecraft would need to overcome because cacti injure both the agent and the horse. So the agent would need to navigate carefully, whether by reducing the speed at which they are riding at or steering around the cacti. Currently, the code is able to generate cacti and a bounding box of gold that determines when an agent has successfully navigated outside of the cacti terrain, simulating an objective that a Minecraft player would likely run into.

Jumping, turning, and the speed of movement can all be defined as continuous actions, so this led to the environment of the agent having a large state space. Since the environment was not deterministic and had a large state space, we opted to use RLLib’s implementation of Proximal Policy Optimization, or PPO. PPO is a deep policy gradient to circumvent some of the issues we were facing with navigating through terrain properly because of the variety in actions that the agent could take.

The update function for PPO can be described with the following:

\[L^{CLIP}(\theta)=\hat{E_t}[min(r_t(\theta)\hat{A_t},clip(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A_t})]\]

The agent is rewarded positively for actions such as exiting the terrain in a timely manner or for mounting a horse and staying on the horse, and rewarded negatively for lack of progress outside the terrain and injuring the horse. To reward the agent for progressing, the agent receives a reward at the mission end depending on the time it took for the agent to leave the area of obstacles. The agent receives a positive reward for staying on the horse, but a small but negative reward for dismounting the horse. Ultimately, the agent is also penalized for each half heart of health that the horse loses. The reward function is determined by \(r = 1 \cdot (t_{\textsf{on_horse}}) - 0.5 \cdot (t_{\textsf{off_horse}}) - h + (100 - 0.1 \cdot tk)\), where the reward \(r\) for each step is determined by a function of the time spent on the horse, \(t_{\textsf{on_horse}}\), the time spent off the horse, \(t_{\textsf{off_horse}}\), the hearts of health lost for the horse, \(h\), and the number of ticks that the mission took until it ended \(tk\) which will assess the time required for the agent to reach the goal of the mission, which is to leave the obstacle terrain.

Evaluation

We decided to measure effectiveness in horse riding by measuring whether or not the agent made it to the goal and how much health was lost along the way. This was accomplished by accumulating negative rewards for damaging the horse entity, killing the horse, and timing out before reaching the goal. Positive rewards are given for reaching gold block checkpoints and reaching the end goal. Overall returns can be seen in the images below:

In the left image, the agent quickly learns how to mount the horse and receives rewards for that. But when it comes to learning how to navigate around obstacles, the returned rewards have difficulty growing, as shown on the right image. This can likely be offset by readjusting how the rewards policy is determined. Instead of a sparse reward policy where rewards are given based on, for instance, time required to complete the mission, we can re-balance the rewards between mounting the horse and navigating around obstacles.

Qualitative evaluation can be performed by comparing the agent’s performance on different types of terrain. In the current state of the project, we have only been testing one terrain that has most of the obstacles we would expect to encounter. In the future we plan to have example cases which have less obstacles and would expect the agent to perform very well in as well as more difficult cases which might cause the model to struggle a little bit. However, we can visually assess our agent’s performance. Some of the qualitative evaluation is shown in the video above, but we can also view in the screenshots below the improvement in the model.

On the left, the baseline model is in proximity of the horse, but walks around aimlessly rather than mounting the horse and exiting the obstacle territory. On the right, the working model is able to mount the horse and navigate through the cacti. Although the model has yet to achieve perfection of not injuring the horse or an optimal speed, we can see that this is a clear improvement over the baseline.

Remaining Goals and Challenges

Currently, our prototype is limited because the terrain the agent navigates through is missing obstacles and the agent does not know how to avoid or pass through other obstacles. We would also like to train the agent to increase their speed or slow down around certain obstacles. For example, if the agent is in a more open area and is not surrounded by any obstacles, the agent can increase their speed, whereas if they are surrounded by many obstacles, they should slow down to avoid being injured. The next goals in this project is adding bodies of water and expanding the current size of the area the agent is in. To navigate the bodies of water, the agent will need to dismount the horse and cross the body of water then mount the horse again.

Additionally, throughout the training process, we ran into issues where the agent or horse would sometimes spawn underneath a tree or obstacle. This poses an issue when the agent tries to mount the horse and navigate through the terrain because either the agent or horse will be stuck underneath an obstacle. To avoid this issue, we need to make sure that the agent is surrounded by air blocks.

We would also like to try testing out the different algorithms that RLLib has to offer. Currently we are using the PPO algorithm, however some other algorithms that we would like to test include the A3C algorithm which can also be used for continuous action spaces, something our project tackles.

One challenge that we will likely face is having enough time to train the model while comparing various algorithms. We noticed that improvement is quick in the initial steps of training, but the agent tends to stagnate and not learn in the later portions of training, which are typically several hours in. We will try to avoid this problem by running various algorithms on each of our computers then comparing the results.

Resources Used