AI Infrastructure Engineer

Stealth Post-LLM Startup | Los Altos, California, United States | 1mo ago

full-time | on-site | entry | 3+ years

skills: python, kubernetes, aws, gcp, docker, firecracker, gvisor, terraform, opentelemetry, postgres, redis, object storage, anthropic sdk, openai sdk, vllm, vector extensions

apply →

The Mission

Dyssonance is reimagining how AI thinks. We are building an AI capable of memory, dynamic reasoning, and evolving beliefs. Founded by leaders from DeepMind and Google, we are a small, elite team solving the foundational challenges of cognition.

Product engineers need to move as fast as the model changes. Researchers need to test ten ideas a day instead of a few a week. We are building the infrastructure that makes both possible.

The Role

We need an engineer who builds the substrate that our product engineers and our researchers both stand on—the pipelines, sandboxes, and agent infrastructure that turn intent into shipped code and shipped experiments.

As our AI Infrastructure Engineer, you will own the systems that automate development and research. On the product side, that's the dev loops, CI, and agent-assisted pipelines that let us ship infra & products to users. On the research side, that's the training substrate, eval harnesses, and experiment orchestration that keep our researchers in flow. The same primitives serve both, and you will design them that way.

What You Will Build

Research Infra: High-throughput orchestration for training runs, evals, and ablations on GPU fleets. Artifact tracking, deterministic replay, and cost attribution—so a researcher can launch a sweep in one command and trust the numbers that come back.
Product Development Infra: The CI, deploy, and preview-environment pipelines that let product engineers ship dozens of times a day. Agent-assisted code review, auto-generated tests, and the plumbing that lets coding agents open PRs against our repo safely.
Sandboxed Execution: Isolated, reproducible environments where agent-generated code can compile, run, and be evaluated at scale without torching the host or the budget. The same sandbox serves product agents and research agents.
The Agent Control Plane: APIs, queues, and observability for running thousands of concurrent agents—whether they're fixing bugs in the product, running experiments on the model, or somewhere in between. Traces, interventions, and replay for every step.
The Dev Substrate: Internal tooling that binds it together—secrets, datasets, cost dashboards, experiment registries. The command center for a lab that ships a product.

The Stack

Orchestration: Python, Kubernetes.
Agents & Models: Anthropic SDK, OpenAI SDK, vLLM, in-house checkpoints.
Data & State: Postgres (with vector extensions), Redis, object storage for artifacts.
Infra: AWS/GCP, Docker, Firecracker / gVisor for sandboxing, Terraform.
Observability: OpenTelemetry, structured traces for every agent step.

Who You Are

An Infra Native: You have built systems that run thousands of jobs a day without a human in the loop, and you know the difference between something that looks autonomous in a demo and something that stays up on a Saturday.
A Force Multiplier for Both Sides: You are as comfortable unblocking a product engineer who needs a faster preview environment as a researcher who needs a cleaner ablation pipeline. You don't pick sides between "ship the product" and "do the research"—your infra serves both.
Production-Grade: 3+ years of engineering experience, with a track record of shipping systems that don't silently corrupt state. Clean interfaces, instrumentation, and correctness under concurrency are reflexes, not checklist items.
Agentic by Default: You already use coding agents in your daily workflow. You have a point of view on where they fail, and you want to fix it at the infra layer. You can look at projects like autoagent or autoresearch and immediately see what it would take to make those primitives production-grade for a frontier lab.
High Agency: You spot a bottleneck, you fix it. You don't wait for a ticket, and you don't wait for permission to delete the thing that isn't working.

Bonus Points

Built developer platforms, CI systems, or deploy pipelines at a company that shipped fast.
Built agent frameworks, eval harnesses, or experiment trackers (MLFlow, Weights & Biases, LangGraph, Inspect, etc.).
Experience running large training or inference workloads on GPU clusters.
Comfort with sandboxing (Firecracker, gVisor, nsjail) and the security model of running untrusted code at scale.
Open-source contributions.

Requirements

3+ years of engineering experience.
Willing to commute to Los Altos

Benefits

health insurance

Get new builder jobs daily: