Building a LunarLander Agent with REINFORCE in PyTorch
Walk through the full REINFORCE policy-gradient pipeline for LunarLander.
Ed Saunders demonstrates how to set up Gymnasium, model a policy in PyTorch, and train it end-to-end with discounted, normalized returns.
Building a LunarLander Agent with REINFORCE in PyTorch
Ed Saunders lays out the full journey of building a LunarLander reinforcement-learning agent in PyTorch, from spinning up Gymnasium to landing the craft reliably. The video is a tutorial where viewers can follow each stage of the pipeline.
The walkthrough starts with environment setup, quick hardcoded rollouts, and side-by-side reward plots to show how the eight-value state vector (position, velocity, tilt, leg sensors) drives four possible thrust commands. Next comes a compact policy network: an 8 → 128 → 4 multilayer perceptron whose logits flow through softmax so the lander samples actions instead of repeating the same move. The episode explains why that stochasticity keeps exploration alive without overwhelming compute budgets. Raw episode returns bounce wildly, so Ed introduces discounted returns with γ = 0.99, epsilon-normalized targets, and the REINFORCE loss
L = - \sum (\log p \times G)
Ed trains the policy with RMSprop while discussing gradient intuition, log-prob tracking, and learning-rate tweaks that keep training stable as touchdowns improve.
The session wraps with pointers on where to take the project next—from advantage estimators to actor-critic methods—so beginners and experienced engineers alike leave with clear next steps.
Topics covered in the video
The video covers:
- Gymnasium setup, baseline rollouts, and reward visualisations
- Anatomy of the 4-action control space and 8-value state vector
- Policy MLP design (8 → 128 → 4), logits-to-softmax flow, and action sampling
- Discounted and normalized returns with γ = 0.99
- REINFORCE objective tracking, log-prob capture, and RMSprop training loop
- Practical tuning tips for stable landings and ideas for advanced follow-ups
The full code used in the video is available here: https://github.com/teambrookvale/rl-zero-to-hero
Feel free to reach out with any comments and questions—always keen to hear how your lander is doing.