The canonical agent pattern
Most "AI runs code" demos break the moment the model needs to install a package, debug a traceback, or hold state across turns. With stateful sessions, that loop just works.
Agent generates code
The LLM writes Python that uses pandas to summarize a CSV.
Agent calls session_execute
SandboxAPI runs the code in an isolated gVisor sandbox. Returns ModuleNotFoundError: No module named 'pandas'.
Agent reads error, calls session_install_packages
{"manager":"pip","packages":["pandas"]} — installs in <3s using the cached index.
Agent retries — same session, same variables
The CSV the agent read in step 2 is still in memory. The retry succeeds. Output goes back to the user.
Code example: OpenAI tool calling + Python SDK
Here's the entire wiring you need. The agent picks the tool, the SDK calls SandboxAPI, the result goes back to the model. Wrap it in a loop and you have a code interpreter.
import os
from openai import OpenAI
from sandboxapi import SandboxAPI
client = OpenAI()
sb = SandboxAPI(api_key=os.environ["SANDBOX_API_KEY"])
# Open a session up front so state persists across tool calls
session = sb.sessions.create(language="python3", idle_ttl=600)
tools = [
{
"type": "function",
"function": {
"name": "execute_python",
"description": "Run Python code in a stateful sandbox. Variables, files, packages persist across calls.",
"parameters": {
"type": "object",
"properties": {"code": {"type": "string"}},
"required": ["code"],
},
},
},
{
"type": "function",
"function": {
"name": "install_packages",
"description": "Install pip packages in the current session.",
"parameters": {
"type": "object",
"properties": {"packages": {"type": "array", "items": {"type": "string"}}},
"required": ["packages"],
},
},
},
]
messages = [{"role": "user", "content": "Read the CSV at /tmp/sales.csv and report total revenue."}]
# Agent loop
while True:
resp = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=tools,
)
msg = resp.choices[0].message
messages.append(msg.model_dump())
if not msg.tool_calls:
print(msg.content)
break
for call in msg.tool_calls:
args = json.loads(call.function.arguments)
if call.function.name == "execute_python":
result = sb.sessions.execute(session.id, code=args["code"])
content = result.stdout + (("\n" + result.stderr) if result.stderr else "")
elif call.function.name == "install_packages":
sb.sessions.install(session.id, manager="pip", packages=args["packages"])
content = "Installed " + ", ".join(args["packages"])
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": content,
})
sb.sessions.close(session.id)
Why SandboxAPI fits AI agents
- Sessions — variables, files, and installed packages persist between tool calls. The agent doesn't re-do work.
- Package install — the agent can
pip installwhat it needs in <3s. No pre-baking the image. - Modern runtimes — Python 3.12, Node 22, .NET 9. Whatever your model trained on works.
- gVisor isolation — when an agent generates code, you don't trust the code. We don't either.
- MCP-native — Claude Desktop, Cursor, Cline, VS Code — drop in a JSON config, you're done.
- Streaming output — show the agent's stdout in real time as long-running scripts execute.
Common patterns
Code interpreter for any LLM
Wrap the SDK in two tool definitions (execute_python, install_packages) and you have a fully-featured code interpreter for any model that supports function calling.
Self-correcting agent
When the LLM produces broken code, the traceback feeds back as the next message. With sessions, the fix doesn't lose context. Most coding tasks converge in 1–3 iterations.
Multi-language pipelines
The agent can drop down to Bash to inspect a file, switch to Python to analyze it, and produce a JSON output. Each language gets its own session — sessions are cheap.
Auto-grading agent demos
Building a "show me you can solve LeetCode" demo? Use execute_with_expected — pass expected_output, get a wrong_answer status. No string-comparison logic in your code.