Status

Video Summary

Demo Video

Project Summary

We are planning to train an AI to play a minigame integrated with features in Beat-Saber (similar to the map in this YouTube video: Zedd & Jasmine Thompson - Funny (Minecraft Music Video | Beat Synchronized!) but also included features similar to dancing line (similar to this YouTube video: Dancing Line | The Piano %100 10/10 Gems).

The task of the AI is trying to hit the block along the railroad while riding on it with swords in the same color with the block at the proper time (i.e. right before the block is passing the agent). A correct and precise hit will increase the AI’s score, and a miss or hitting with a sword in wrong color would decrease the score. The AI should take the game frame as input and perform “aim” (including turn, look / pitch), “switch tools”, and “attack” actions correspondingly.

To make this task more challenging, the AI will encounter multiple rail ways and have to choose the correct one (other wrong ones will lead to lava). The AI should hit the controling redstone at proper time to switch railways to ride on correct one.

Approach

At current stage, we are using reinforcement learning and Q-Learning algorithm. For the QNetwork model, we keep using torch and use a 6-layer forward-feeding neural network, but we consider to switch to Keras and frameworks such as RLib in the future. The structure of our model is as following: linear, ReLu, linear, ReLu, linear, Softmax. The observation the agent can get is nine nearby blocks of the agent that one block above the ground. These blocks will cover redstones for controlling railroads and wool block for hitting. As we optimize the design of our map, we might change the range of the observations. The model has 9 final states, corresponding to the actions the agent can choose. As we further increase the difficulty of the project problem and add more choices to the agent’s inventory, we will also increase the number of states.

We change the usage of the default reward system in Malmo because we need to consider whether the agent hits the block with the tool in the same color. When the agent successfully hits the block, it will collect the item mined from the block. Therefore, we use RewardForCollectingItem handler as a signal sender. When the agent collects different items, we programmed the Malmo system to give different rewards, respectively. In the function that gets the rewards and process them, we use these rewards as signals, and check whether the signals are the same with the tool in the agent’s hand, which is stored in a global variable. For the beginning stage, the reward function is simply adding rewards or substracting penalties, and the final score will be returned when one action ends. For future stages, we might make it more complicated and change the time the agent knows the score.

The QNetwork will be updated after every 200 frames. Gradient-descending and backpropagating are applied and change weights in the state_dict correspondingly.

Evaluation

Quantitatively, the total score the agent receives after completing each episode is one evaluation criteria. The higher the score, the longer the agent lives, and the better its performance is. Scores will consist of different parts including living (penalty for not hitting lever and ride into lava), correct hitting (hit the block with the tool in the same color), and hitting timing. We will use random agents as the first baseline, human-player score as the second baseline, and train the agent to perform better than act randomly and manually. In next stages, more tools and levers will be added to the map, and the evaluation process will start again.

For qualitative evaluation, we will check the final result and check how the agent makes decisions. Decisions includes turn, attack, pitch, and hotbar switch. If the agent mostly makes necessary decisions, then it will pass qualitative evalutaion perfectly. If not, then it might need improvements besides of quatitative evaluations.

Remaining Goals and Challenges

For challenges:

For solutions we have currently:

Therefore, for goals and future plans:

Resources Used