How I spent 1000 leftover GPU-hours trying things
The trained RL agent finding its way out of being surrounded by mobs. It uses speed potions that it saved up, melee attacks, eats to restore health, and shoots from the bow.
Training RL agents for complex games is often slow due to the rendering costs of the environment. A few years ago, with the original goal of learning Rust, I built a survival game that sidesteps this limitation. The turn-based mechanics and headless simulation of 2d-minecraft enable fast execution and easy vectorization for reinforcement learning. Still, the game offers relatively rich gameplay, with crafting progression, resource management, combat, and dozens of distinct actions to choose from. This makes it significantly more challenging than classic Atari benchmarks, and fast enough for rapid experimentation.
The environment is a 2.5D tiled world with hostile mobs of varied behaviors. For example, fast purple enemies coordinate to attack only in groups, while green round ones will detonate and destroy walls if they can't find a short path to the player. Useful items can be crafted from collected materials, and defeated mobs sometimes drop powerful consumables. To play the game well, the agent must learn to craft essential equipment and get adept at fighting large numbers of mobs.
This blog describes (the more informative fraction of) the experiments that I ran while training a neural network to play this game better than I can myself. Through trial, error, and ablations, I found that with the right training setup, PPO surpasses human performance within a few hours of training on consumer hardware.
Rather than raw pixels, the agent receives structured observations:
Crucially, all of this information (and more) is available to a human player via the game's graphics.
The game is turn-based, with around 30 total discrete actions, which include crafting items, moving, using consumables, placing blocks, etc.
The policy network follows a two-stage architecture: a Feature Extractor processes structured observations into a fixed-size embedding, which is then passed through a Residual MLP to produce action logits and value estimates.
The feature extractor uses specialized encoders for each modality, injecting inductive bias into the model to accelerate training. For example, nearby mobs and loot items are aggregated using multi-head attention pooling, allowing the network to focus on relevant entities regardless of their order in the observation.
A slightly simplified representation of the feature extractor.
The policy (or value) network processes the concatenated features through a residual MLP with 3 to 12 blocks. Each block expands to a hidden dimension, applies a nonlinearity, projects back, and adds to the residual connection.
The value network has an identical architecture, with a separate set of weights. Both networks share the same feature extractor.
With the architecture in place, the agent starts playing the game by making practically random moves. Each action it takes earns a reward, and over millions of iterations it learns which decisions lead to higher scores. The experiments below track this learning process and show which design choices made the biggest difference.
The first and most important optimization is scaling the inputs. Neural networks struggle with large and unbounded values like game time, player position, or health points.
I applied log-scaling to unbounded observations (like time and position), normalized HP relative to the maximum, and addressed other features similarly. This single change makes the difference between a network that learns and one that doesn't.
This plot tracks learning progress over time. The x-axis shows environment steps — how many in-game actions the agent has taken across all training episodes up to that point. The y-axis shows the average game score from episodes that were played at that moment in the training. Lines use 10-point running average smoothing, and the shaded regions indicate the standard deviation across runs in the respective group. As the agent learns, it survives longer and scores higher, so an upward trend indicates improvement. Without observation scaling, the agent stays flat near a low score, as it never learns to play.
Having scaled the inputs, let's now scale the outputs. The reward signal in this environment varies greatly: surviving a turn gives +1, killing a zombie gives +20, and mining a diamond gives +50. Moreover, as the agent progresses, episode returns increase by orders of magnitude. Without normalization, this destabilizes training.
The fix is to track a running estimate of return variance and divide each reward by its standard deviation. This keeps returns in a consistent range regardless of how well the agent is doing.
Before adding reward normalization, I had runs with otherwise decent hyperparameters completely collapse after hours of seemingly okay training. After introducing it, the collapses stopped entirely and training became significantly faster.
Both results use an otherwise final training setup, including the components discussed in the following sections.
Normalizing observations and rewards sure worked well. What else can be normalized? Layer outputs! The best runs use LayerNorm throughout the network. Removing it substantially slows down training, and even makes it plateau earlier.
LayerNorm helps with gradient flow, but so does the choice of the nonlinearity. Tanh is, or at least was until recently, a popular choice for RL networks. However, since our policy network is relatively deep compared to standard networks for simple environments, vanishing gradients may be an issue. ReLU is the standard activation function in many domains of deep learning, but is said to suffer from the dying ReLU problem. GELU has become the default in modern architectures, so I compared all three along with Leaky ReLU.
GELU consistently outperforms Leaky ReLU and tanh across network depths, but performs on par with ReLU, which shows no signs of dying within the first 100M environment steps. The computational overhead of GELU is negligible in this use case, so I continue using it for the remaining experiments.
* No 4-block tanh run was performed with reward normalization. Earlier experiments without reward normalization showed tanh training worse than GELU but better than the 8-block tanh, consistent with vanishing gradients being less severe in shallower networks.
The reward signal for the agent consists entirely of the game score, with one exception. The agent receives an additional reward for each new action it discovers within the game. For example, whenever the agent first gathers resources to craft a specific item, obtains a consumable, or mines a new type of block, it receives the bonus reward.
This intrinsic motivation helps bootstrap learning before the extrinsic game score becomes informative, greatly accelerating progress at early stages.
After the initial training runs, I continued training from saved checkpoints of the most successful runs to push the agent's performance further and see if it reaches a plateau.
Reducing the learning rate by an order of magnitude is beneficial for this second step, while reducing the entropy coefficient has little effect on performance. Earlier experiments showed that increasing the entropy coefficient over 0.01 significantly slows down training in the first 64M steps.
As we can see from these runs, the agents does in fact reach the plateau around 30 thousand game score. In the evaluation setting, some of the checkpoints score 35k+ on average, and individual games reach into the 55k+ territory.
The first logged game score for continuation runs is artificially low because those data points consist of generally shorter (=lower score) episodes due to the iteration size.
Not every idea panned out. Some changes that seemed promising had no measurable impact on training.
Each episode, the agent is initialized with one of 4 random starting loadouts, including a fighter and an archer. The fighter starts with a sword and meat (a consumable that restores health), while the archer begins with a bow and 10 arrows. In both cases, 3 zombies are spawned near the player, with some zombies having reduced health. Despite these varied starting conditions, some of which encourage fighting, the loadouts did not appear to impact the speed of early training.
When the agent reaches a milestone (e.g., crafting a pickaxe), the game state is saved. On subsequent episode resets, there's a chance of loading from a recent checkpoint instead of starting from scratch. The idea was to let the agent practice late-game skills without having to reach them from scratch every time. In practice, this curriculum stopped being useful once the network was training quickly and consistently.
The residual blocks use orthogonal initialization with a low gain on the second linear layer, making each block start close to an identity mapping. This technique is meant to improve gradient flow in deep networks. However, at the scale of 8 residual blocks, this initialization choice had no noticeable effect on training dynamics or final performance.
The biggest wins came from normalization: scaling observations, normalizing rewards, and LayerNorm in the residual blocks. These changes transformed unstable training into consistent progress. GELU performed well, though plain ReLU worked just as effectively. Discovery rewards helped bootstrap early learning, but curriculum tricks like checkpoint restarting added little once the fundamentals were in place.
The training plateaus around 30k score, with occasional episodes reaching 55k+. The limiting factor is likely the lack of recurrence: the agent is almost stateless, seeing only its past 4 actions, which prevents any real planning. It's also unclear whether indefinite survival is even possible: speed potions become essential for late-game combat, and the agent may eventually need more than it can loot from defeated mobs.
Mostly, this was just fun. There were plenty more heuristics and optimizations that went into getting consistent training — things too small to warrant their own section but that added up to a setup that just worked. So in the end, the trained agent plays better than I do, which was the goal. A thousand GPU-hours well spent.