Reinforcement Learning

Training an RL Agent for 2D Minecraft

How I spent 1000 leftover GPU-hours trying things

by Mikhail Vlasenko

Loading video...

The trained RL agent finding its way out of being surrounded by mobs. It uses speed potions that it saved up, melee attacks, eats to restore health, and shoots from the bow.

I hate videos without controls

Introduction

Training RL agents for complex games is often slow due to the rendering costs of the environment. A few years ago, with the original goal of learning Rust, I built a survival game that sidesteps this limitation. The turn-based mechanics and headless simulation of 2d-minecraft enable fast execution and easy vectorization for reinforcement learning. Still, the game offers relatively rich gameplay, with crafting progression, resource management, combat, and dozens of distinct actions to choose from. This makes it significantly more challenging than classic Atari benchmarks, and fast enough for rapid experimentation.

The environment is a 2.5D tiled world with hostile mobs of varied behaviors. For example, fast purple enemies coordinate to attack only in groups, while green round ones will detonate and destroy walls if they can't find a short path to the player. Useful items can be crafted from collected materials, and defeated mobs sometimes drop powerful consumables. To play the game well, the agent must learn to craft essential equipment and get adept at fighting large numbers of mobs.

This blog describes (the more informative fraction of) the experiments that I ran while training a neural network to play this game better than I can myself. Through trial, error, and ablations, I found that with the right training setup, PPO surpasses human performance within a few hours of training on consumer hardware.

Observation Space

Rather than raw pixels, the agent receives structured observations:

Terrain: A 17x17 grid of material types and tile heights around the player
Entities: Nearby mobs and loot items
State: Health, inventory, time, status effects, and available actions

Crucially, all of this information (and more) is available to a human player via the game's graphics.

Action Space

The game is turn-based, with around 30 total discrete actions, which include crafting items, moving, using consumables, placing blocks, etc.

Network Architecture

The policy network follows a two-stage architecture: a Feature Extractor processes structured observations into a fixed-size embedding, which is then passed through a Residual MLP to produce action logits and value estimates.

Feature Extractor

The feature extractor uses specialized encoders for each modality, injecting inductive bias into the model to accelerate training. For example, nearby mobs and loot items are aggregated using multi-head attention pooling, allowing the network to focus on relevant entities regardless of their order in the observation.

%%{init: {'theme': 'dark', 'themeVariables': { 'fontSize': '14px' }}}%% flowchart TB subgraph inputs["Observation Inputs"] direction LR mat["Materials
17x17 grid"] height["Heights
17x17 grid"] mobs["Mobs
16 entities"] loot["Loot
16 items"] state["Player State
hp, inventory, time"] actions["Past Actions
last 4"] end subgraph feature["Feature Extractor"] direction TB subgraph spatial["Spatial Processing"] direction LR mat_enc["Linear
8 channels"] mat_near["Near Materials
5x5 full res"] mat_pool["MaxPool2d
far materials"] height_h["Conv1D Horiz
1x3 kernel"] height_v["Conv1D Vert
3x1 kernel"] end subgraph entity["Entity Processing"] direction LR mob_enc["Mob Encoder
Linear 256"] mob_pool["Attentive Pooler
16 heads"] loot_enc["Loot Encoder
Linear 64"] loot_pool["Attentive Pooler
8 heads"] end subgraph scalar["Scalar Processing"] direction LR log_scale["Log Scaling
positions, time"] inv_enc["Inventory Encoder
Linear"] action_enc["Action Encoder
one-hot Linear"] end concat(["Concatenate ~1800 features"]) end %% Input connections mat --> mat_enc mat_enc --> mat_near mat_enc --> mat_pool height --> height_h height --> height_v mobs --> mob_enc --> mob_pool loot --> loot_enc --> loot_pool state --> log_scale state --> inv_enc actions --> action_enc %% Feature extraction to concat mat_near --> concat mat_pool --> concat height_h --> concat height_v --> concat mob_pool --> concat loot_pool --> concat log_scale --> concat inv_enc --> concat action_enc --> concat %% Styling classDef inputStyle fill:#1e3a3a,stroke:#7ec9b8,color:#e8ebed classDef linearStyle fill:#2d5a4a,stroke:#7acc96,color:#fff classDef attentionStyle fill:#4a3a5a,stroke:#a88ac9,color:#fff classDef convStyle fill:#5a3a4a,stroke:#d88a8a,color:#fff classDef poolStyle fill:#1e4a4a,stroke:#7ec9b8,color:#fff classDef scaleStyle fill:#4a4a2a,stroke:#d4b85c,color:#fff classDef concatStyle fill:#1e2326,stroke:#6b757c,color:#e8ebed class mat,height,mobs,loot,state,actions inputStyle class mat_enc,mob_enc,loot_enc,inv_enc,action_enc linearStyle class mob_pool,loot_pool attentionStyle class height_h,height_v convStyle class mat_near,mat_pool poolStyle class log_scale scaleStyle class concat concatStyle

A slightly simplified representation of the feature extractor.

Policy Network

The policy (or value) network processes the concatenated features through a residual MLP with 3 to 12 blocks. Each block expands to a hidden dimension, applies a nonlinearity, projects back, and adds to the residual connection.

%%{init: {'theme': 'dark', 'themeVariables': { 'fontSize': '14px' }}}%% flowchart LR features["Features
~1800"] --> ln1["LayerNorm"] --> proj["Linear
1024"] --> act1["GELU"] --> ln2["LayerNorm"] --> res_in subgraph res["Residual Blocks x3-12"] direction LR res_in((" ")) --> lin1["Linear 1536"] --> gelu1["GELU"] --> lin2["Linear 1024"] --> lnorm["LayerNorm"] --> add(("+")) --> gelu2["GELU"] --> res_out((" ")) res_in --> add end res_out --> head["Linear
actions"] --> logits["Action
Logits"] classDef inputStyle fill:#1e3a3a,stroke:#7ec9b8,color:#e8ebed classDef linearStyle fill:#2d5a4a,stroke:#7acc96,color:#fff classDef normStyle fill:#1e4a4a,stroke:#7ec9b8,color:#fff classDef activationStyle fill:#4a3a5a,stroke:#a88ac9,color:#fff classDef outputStyle fill:#4a4a2a,stroke:#d4b85c,color:#fff classDef nodeStyle fill:#1e2326,stroke:#6b757c,color:#e8ebed class features inputStyle class proj,lin1,lin2,head linearStyle class ln1,ln2,lnorm normStyle class act1,gelu1,gelu2 activationStyle class logits outputStyle class res_in,res_out,add nodeStyle

The value network has an identical architecture, with a separate set of weights. Both networks share the same feature extractor.

Experiments

With the architecture in place, the agent starts playing the game by making practically random moves. Each action it takes earns a reward, and over millions of iterations it learns which decisions lead to higher scores. The experiments below track this learning process and show which design choices made the biggest difference.

Observation Scaling

The first and most important optimization is scaling the inputs. Neural networks struggle with large and unbounded values like game time, player position, or health points.

I applied log-scaling to unbounded observations (like time and position), normalized HP relative to the maximum, and addressed other features similarly. This single change makes the difference between a network that learns and one that doesn't.

This plot tracks learning progress over time. The x-axis shows environment steps — how many in-game actions the agent has taken across all training episodes up to that point. The y-axis shows the average game score from episodes that were played at that moment in the training. Lines use 10-point running average smoothing, and the shaded regions indicate the standard deviation across runs in the respective group. As the agent learns, it survives longer and scores higher, so an upward trend indicates improvement. Without observation scaling, the agent stays flat near a low score, as it never learns to play.

Reward Normalization

Having scaled the inputs, let's now scale the outputs. The reward signal in this environment varies greatly: surviving a turn gives +1, killing a zombie gives +20, and mining a diamond gives +50. Moreover, as the agent progresses, episode returns increase by orders of magnitude. Without normalization, this destabilizes training.

The fix is to track a running estimate of return variance and divide each reward by its standard deviation. This keeps returns in a consistent range regardless of how well the agent is doing.

Before adding reward normalization, I had runs with otherwise decent hyperparameters completely collapse after hours of seemingly okay training. After introducing it, the collapses stopped entirely and training became significantly faster.

Both results use an otherwise final training setup, including the components discussed in the following sections.

LayerNorm

Normalizing observations and rewards sure worked well. What else can be normalized? Layer outputs! The best runs use LayerNorm throughout the network. Removing it substantially slows down training, and even makes it plateau earlier.

Activation Functions

LayerNorm helps with gradient flow, but so does the choice of the nonlinearity. Tanh is, or at least was until recently, a popular choice for RL networks. However, since our policy network is relatively deep compared to standard networks for simple environments, vanishing gradients may be an issue. ReLU is the standard activation function in many domains of deep learning, but is said to suffer from the dying ReLU problem. GELU has become the default in modern architectures, so I compared all three along with Leaky ReLU.

GELU consistently outperforms Leaky ReLU and tanh across network depths, but performs on par with ReLU, which shows no signs of dying within the first 100M environment steps. The computational overhead of GELU is negligible in this use case, so I continue using it for the remaining experiments.

* No 4-block tanh run was performed with reward normalization. Earlier experiments without reward normalization showed tanh training worse than GELU but better than the 8-block tanh, consistent with vanishing gradients being less severe in shallower networks.

Discovery Reward

The reward signal for the agent consists entirely of the game score, with one exception. The agent receives an additional reward for each new action it discovers within the game. For example, whenever the agent first gathers resources to craft a specific item, obtains a consumable, or mines a new type of block, it receives the bonus reward.

This intrinsic motivation helps bootstrap learning before the extrinsic game score becomes informative, greatly accelerating progress at early stages.

Pushing the Limits

After the initial training runs, I continued training from saved checkpoints of the most successful runs to push the agent's performance further and see if it reaches a plateau.

Reducing the learning rate by an order of magnitude is beneficial for this second step, while reducing the entropy coefficient has little effect on performance. Earlier experiments showed that increasing the entropy coefficient over 0.01 significantly slows down training in the first 64M steps.

As we can see from these runs, the agents does in fact reach the plateau around 30 thousand game score. In the evaluation setting, some of the checkpoints score 35k+ on average, and individual games reach into the 55k+ territory.

Group by LR

The first logged game score for continuation runs is artificially low because those data points consist of generally shorter (=lower score) episodes due to the iteration size.

What Didn't Make a Difference

Not every idea panned out. Some changes that seemed promising had no measurable impact on training.

Starting Loadouts

Each episode, the agent is initialized with one of 4 random starting loadouts, including a fighter and an archer. The fighter starts with a sword and meat (a consumable that restores health), while the archer begins with a bow and 10 arrows. In both cases, 3 zombies are spawned near the player, with some zombies having reduced health. Despite these varied starting conditions, some of which encourage fighting, the loadouts did not appear to impact the speed of early training.

Checkpoint Restarting

When the agent reaches a milestone (e.g., crafting a pickaxe), the game state is saved. On subsequent episode resets, there's a chance of loading from a recent checkpoint instead of starting from scratch. The idea was to let the agent practice late-game skills without having to reach them from scratch every time. In practice, this curriculum stopped being useful once the network was training quickly and consistently.

Orthogonal Low-Gain Initialization

The residual blocks use orthogonal initialization with a low gain on the second linear layer, making each block start close to an identity mapping. This technique is meant to improve gradient flow in deep networks. However, at the scale of 8 residual blocks, this initialization choice had no noticeable effect on training dynamics or final performance.

Conclusion

The biggest wins came from normalization: scaling observations, normalizing rewards, and LayerNorm in the residual blocks. These changes transformed unstable training into consistent progress. GELU performed well, though plain ReLU worked just as effectively. Discovery rewards helped bootstrap early learning, but curriculum tricks like checkpoint restarting added little once the fundamentals were in place.

The training plateaus around 30k score, with occasional episodes reaching 55k+. The limiting factor is likely the lack of recurrence: the agent is almost stateless, seeing only its past 4 actions, which prevents any real planning. It's also unclear whether indefinite survival is even possible: speed potions become essential for late-game combat, and the agent may eventually need more than it can loot from defeated mobs.

Mostly, this was just fun. There were plenty more heuristics and optimizations that went into getting consistent training — things too small to warrant their own section but that added up to a setup that just worked. So in the end, the trained agent plays better than I do, which was the goal. A thousand GPU-hours well spent.