Founding Research Scientist (Long-Horizon RL) at Lanturn
Location: San Francisco (preferred) / Remote (US)
Compensation: $200K base + 0.5–1% equity
Type: Full-time · Founding Team
At Lanturn, we are building the next generation of reinforcement learning systems for real-world agents. Our focus is on enabling AI systems to learn from behavioral data and long-horizon workflows, through:
- High-fidelity RL environments
- Synthetic data generation
- Closed-loop training systems
We are looking for a Founding RL Researcher to push the frontier of:
- Long-horizon RL
- Environment design
- Post-training for agents
About us:
Lanturn is building the end-to-end behavioural learning stack for AI systems. We believe current approaches to RL and post-training are limited by short-horizon optimisation, weak or proxy reward signals, and a lack of grounded environments. Our approach is to build closed-loop RL systems where environments, data, training, and evaluation are tightly integrated and based on real-world behavioral data.
The role:
As a Founding RL Researcher, you will lead efforts to develop novel reinforcement learning algorithms and environments for training autonomous agents. You will work across:
- Algorithm design
- Environment modelling
- Training systems
- Evaluation frameworks
This role sits at the intersection of:
- Frontier Labs-style RL research (environments + algorithms)
- Modern LLM post-training (RLHF, preference optimisation)
Key responsibilities:
- Design and implement RL systems for long-horizon tasks (10–100+ steps)
- Develop and extend modern post-training methods:
- PPO, DPO, ORPO
- GRPO / GRPO++ and ranking-based optimization methods
- Build RL environments grounded in real-world workflows
- Work on meta-RL and adaptive learning systems:
- Generalization across tasks
- Rapid adaptation to new environments
- Design reward systems for:
- Behavioural correctness
- Efficiency and robustness
- Develop evaluation frameworks aligned with real-world outcomes
- Collaborate with engineering teams to scale training systems
Ideal candidate:
You are a researcher with strong theoretical grounding and real-world system intuition, capable of working on open-ended problems in RL. You thrive in environments where:
- Problems are not well-defined
- Systems must be built from first principles
- Research directly translates into deployed systems
Minimum qualifications:
- Experience at a top-tier AI lab or company: OpenAI, DeepMind, Anthropic, FAIR, or equivalent
- Strong background in reinforcement learning and post-training systems
- Experience training large-scale models (LLMs or similar)
- Strong programming skills (Python, PyTorch/JAX)
Preferred qualifications:
- Experience with long-horizon RL or sequential decision-making systems
- Experience designing or working with RL environments
- Familiarity with: Preference optimization (DPO, ORPO), RLHF pipelines, and automated RL env generation
- Experience with meta-RL / adaptive learning systems
- Strong publication record in top-tier ML conferences
Core technical skills:
- Deep understanding of: Policy gradient methods (PPO and beyond), KL-regularized optimization, and credit assignment in long-horizon settings
- Experience with: Cascading RL pipelines (SFT → RL → evaluation), distributed training systems, and stability and scaling challenges
- Strong intuition for: Exploration vs exploitation, reward shaping vs reward learning, and trajectory-level optimization
What makes this role unique ?
- Focus on long-horizon behavioral learning, not short-form RLHF
- Treats environment design and generation as a first-class problem
- Opportunity to define GRPO++-style next-generation algorithms and publish to NeurIPS
Why join Lanturn ?
- Founding ownership (0.5–1% equity)
- Work on unsolved problems in RL and agent systems
- High autonomy and research freedom
- Direct impact on how real-world AI systems are trained
- Work with second time founders directly who have worked with various big tech companies and enterprises.
If you’ve worked on RL at a top lab or have had production RL experience and want to push beyond current paradigms into real-world, long-horizon intelligence, this is your opportunity.