OpenAI has made an extremely transparent step by publishing the file detailed technical description of the operation of the Codex CLI coding agent under the hood. The post, authored by OpenAI engineer Michael Bolin, offers one of the clearest looks yet at how a production-grade AI agent orchestrates large language models, tools, and user input to perform real-world software development tasks.
At the core of Codex is what OpenAI calls the agent loop: a repeating cycle that alternates between model inference and tool execution. Each cycle begins when Codex constructs a prompt based on structured input: system instructions, developer constraints, user messages, environment context, as well as available tools and sends it to the OpenAI Response API for inference.
The model output can take one of two forms. It may generate an assistant message intended for the user, or it may request a tool invocation, such as running a shell command, reading a file, or invoking a scheduling or search tool. When a tool call is requested, Codex executes it locally (within defined sandbox limits), appends the result to the prompt, and queries the model again. This loop continues until the model broadcasts the last assistant message, signaling the end of the conversation round.
While this high-level pattern is common among many AI agents, OpenAI's documentation stands out for its specificity. Bolin discusses how tooltips are assembled element by element, how roles (system, developer, user, assistant) determine priority, and how even minor design choices, such as the order of tools in a list, can have a serious impact on performance.
One of the most significant architectural decisions is Codex's fully stateless interaction model. Instead of relying on server-side conversation memory via the optional previous_response_id parameter, Codex resends the entire conversation history on every request. This approach simplifies infrastructure and enables zero data retention (ZDR) for customers who require strict privacy guarantees.
The disadvantage is obvious: message sizes increase with each interaction, leading to a quadratic increase in transferred data. OpenAI mitigates this problem by aggressive prompt caching, which allows the model to reuse calculations as long as each new prompt is an exact extension of the prefix of the previous one. When caching works, the inference cost scales linearly, not quadratically.
However, this limitation imposes strict discipline on the system. Changing tools mid-chat, switching models, modifying sandbox permissions, or even changing the order of tool definitions can cause cache misses and a sharp drop in performance. Bolin notes that early support for Model Context Protocol (MCP) tooling exposed exactly this kind of fragility, forcing the team to carefully redesign how it handles dynamic tooling updates.
The rapid growth also clashes with another hard constraint: the model's context window. Because both input and output tokens count towards this limit, a long-running agent making hundreds of tool calls risks exhausting its useful context.
To solve this problem, Codex uses automatic conversation thickening. When the number of tokens exceeds a configurable threshold, Codex replaces the full conversation history with a truncated representation generated via special responses/compact API endpoint. Crucially, this compressed context contains an encrypted payload that preserves the model's implicit understanding of the model's past interactions, allowing it to continue to reason coherently without access to the full raw history.
Earlier versions of Codex required users to manually run compression; today, the process is automatic and largely invisible – a significant improvement in usability as agents take on longer and more complex tasks.
OpenAI has historically been reluctant to publish detailed technical details about flagship products like ChatGPT. However, the Code is treated differently. The result is a rare, honest account of the trade-offs involved in building a real-world AI agent: performance versus privacy, flexibility versus cache efficiency, autonomy versus security. Bolin isn't afraid to describe mistakes, inefficiencies, or hard-won lessons, reinforcing the message that today's AI agents are powerful, but far from magical.
Beyond the Code itself, the post serves as a blueprint for anyone building agents on top of modern LLM APIs. Highlights emerging best practices: stateless design, stable prefixed tooltips, explicit context management, which are quickly becoming industry standards.
















