Final Report

Video

Demo Video

Project Summary

We are planning to train an AI to play a minigame integrated with features in Beat-Saber (similar to the map in this YouTube video: Zedd & Jasmine Thompson - Funny (Minecraft Music Video | Beat Synchronized!) but also included features similar to dancing line (similar to this YouTube video: Dancing Line | The Piano %100 10/10 Gems).

The task of the AI is trying to hit the block along the railroad while riding on it with swords in the same color with the block. A correct hit will increase the AI’s score, and a miss or hitting with a sword in wrong color would decrease the score. The AI should take the game frame as input and perform “switch tools” and “attack at the correct side” actions correspondingly.

algorithm visualize

Considering the difficulty of the problem and time to train a good AI, we have two versions for this project. The first version (7-direction) involves blocks coming from 7 directions (Top, Upper left, Middle left, Lower left, Upper right, Middle right, Lower right; 2 directions (Front, Bottom) empty for the agent to move).

algorithm visualize

The second version (2-direction + Redstone) involves only two directions (Left, Right), but to make this task more challenging, the AI will encounter multiple rail ways and have to choose the correct one (other wrong ones will lead to lava). The AI should hit the controling redstone at proper time to switch railways to ride on correct one. The AI will get rewards when it lives, and be punished when it falls into lava.

algorithm visualize

This problem/setting needs AI/ML algorithm to solve it because it is quite hard for humans to manually play this game perfectly in Minecraft, while the cart drives relatively fast and features and blocks appear frequently. A simple if-else algorithm could be implemented, however, considering how fast the agent is moving and how fast the blocks are approaching, the agent would require a huge 3-D observation space in order to capture the upcoming items, which is very inefficient and memory-consuming. Thus, in this problem, an AI/ML algorithm which learns features from a smaller 2-D game frame is more favorable. For an AI to play such games, it will need convolutional neural networks and reinforcement learning to learn when blocks are approaching and where and when it needs to hit the blocks to earn scores and keep living.

Approach

Rewards

To reward our agent during training, we are considering 4 factors: completion of the task, the time the agent survived, the number of correct hits, and the number of wrong hits. We considered two ways to reward the agent’s survival time, the first of all is the time ticks taken. However, it turned out that time ticks are very unstable: each episode gains slightly different values for time ticks, even they all complete the task, which creates a lot of noises during training. Thus, we choose to reward the agent for each Redstone hit since it switches the agent to the correct railroad and leads to longer survival. In terms of task completion, the agent will gain a reward if it finishes the task or punished if it dies (fail to switch railways). The reward function could be written in this linear combination:

equation

For hitting blocks + Death (7-direction && 2-direction + Redstone):

For mission completion (2-direction + Redstone):

Action spaces

7-direction:

2-direction + Redstone Branching:

Observation / Information for the AI

Model

Deep Q Network

Neural network structure

3-layer Convolutional Neural Network

3-layer Fully Connected Neural Network

NN Structure NN Structure

Comparisons with the past approaches

Structure Changes

At the beginning: QNetwork model: a 6-layer forward-feeding neural network. No framework.

Later: PPO in default setting. Framework: RLlib.

Present: DQN model with CNN and FNN layers. Framework: RLlib.

Version Changes and Concerns

7-direction:

2-direction + redstone:

Rewards changes

Punishment for attacks

Rewards for falling into the lava or reaching the destination

Rewards for time living v.s. Rewards for hitting redstone

Model and Framework Changes

Map Design Changes

Evaluation

Quantitative Evaluation

The total score the agent receives after completing each episode is one evaluation criteria. The higher the score, the longer the agent lives, and the better its performance is. Scores will consist of different parts including living (penalty for not hitting lever and ride into lava), correct hitting (hit the block with the tool in the same color), and hitting timing. We will use random agents as the first baseline, human-player score as the second baseline, and train the agent to perform better than act randomly and approach the manual performance.

7-Direction

Rewards for Random Agent (Baseline)

algorithm visualize

Rewards for Our AI

algorithm visaulize

2-Direction + Redstone

Rewards for Random Agent (Baseline)

algorithm visaulize

Rewards for Our AI

algorithm visaulize

Qualitative Evaluation

We will check the result video, how the agent makes decisions, and compare performance between them.

7-Direction

Random Agent
Our AI
Human Player

2-Direction w/ Redstone Branching

Random Agent
Our AI
Human Player

Comments for Qualitative Evaluation

Clearly improvements from random agents:

Still, not as good as human players

Resources Used