GPT-5.3 Codex: Features, Benchmarks & Migration

AI coding models are past the cute demo phase. The new bar is boring, unsexy reliability - can it survive a long terminal loop, touch multiple files, and not die in a lint spiral.

OpenAI shipped GPT-5.3-Codex on February 5, 2026. The story here is not a shiny new context window. It’s execution quality: fewer dead ends, better follow-through, and faster turnaround on the kinds of tasks that eat reviewer hours.

Here are the launch numbers OpenAI put front and center:

56.8% - SWE-Bench Pro Public
77.3% - Terminal-Bench 2.0
64.7% - OSWorld-Verified
25% Faster - Inference Speed

Compared with GPT-5.2-Codex, this is a “make the agent less annoying” release. If your team already runs issue-to-patch workflows, the wins show up where you feel pain: unstable patch loops, weak bug writeups, and premature “done” states when tests are flaky.

Official rollout status (February 5, 2026): GPT-5.3-Codex is live across all Codex surfaces (app, CLI, IDE extension, web) for paid ChatGPT users. API availability is coming in the following weeks.

Key Takeaways

OpenAI's new coding flagship is live: GPT-5.3-Codex launched on February 5, 2026 across all Codex surfaces (app, CLI, IDE extension, web) for paid ChatGPT plans, with API access announced for the coming weeks.
Large jump on terminal and computer-use tasks: OpenAI reports 77.3% on Terminal-Bench 2.0 and 64.7% on OSWorld-Verified, with notable gains over GPT-5.2-Codex.
SWE-Bench Pro leadership is incremental: GPT-5.3-Codex scores 56.8% on SWE-Bench Pro Public versus 56.4% for GPT-5.2-Codex, keeping it at the top tier rather than a step-change leap.
Codex UX improvements target real engineering pain: The release highlights improved codebase coherence, deep diffs for reasoning transparency, and fixes for lint loops, weak bug explanations, and flaky-test premature completion.
First model classified High for cybersecurity: OpenAI classifies GPT-5.3-Codex as High capability in cybersecurity under its Preparedness Framework and pairs the release with its most comprehensive safety stack, including trusted-access controls and a $10M cyber defense credit commitment.

What's New in GPT-5.3-Codex

GPT-5.3-Codex combines the frontier coding performance of GPT-5.2-Codex with the reasoning and professional knowledge capabilities of GPT-5.2 into a single model that is also 25% faster.

This model is tuned for long-horizon, tool-using work - the stuff that breaks weaker agents. Think: keep context across many steps, revise plans mid-stream, and still land the patch with the tests green.

OpenAI also claims GPT-5.3-Codex is the first model that was instrumental in creating itself. The Codex team used early versions to debug its own training, manage deployment, and diagnose test results and evaluations during development.

What changed at a product level

Agentic Reliability: Fewer breakdowns in multi-file, multi-step execution with stronger long-horizon task completion. This matters when your agent has to touch 6 files, run a migration, update tests, and not forget what it did 30 steps ago.
Tool-Use Performance: Major gains on Terminal-Bench 2.0 and OSWorld-Verified with fewer tokens than any prior model. Terminal work is where agents usually faceplant - this is where the big delta is.
25% Faster Inference: Infrastructure and inference stack improvements deliver 25% faster results for all Codex users. If you run lots of small iterations, speed is not a nice-to-have - it’s throughput.
Safety Gating: First model classified High for cybersecurity, with OpenAI's most comprehensive safety stack deployed. That classification has downstream implications for access controls and governance.

Benchmark Performance Breakdown

OpenAI’s launch appendix compares GPT-5.3-Codex to GPT-5.2-Codex on coding and agentic execution benchmarks. The pattern is consistent:

SWE-Bench moved a little
Terminal and computer-use moved a lot

Benchmarks table (OpenAI launch appendix, February 5, 2026)

Benchmark	GPT-5.3-Codex	GPT-5.2-Codex	Delta
SWE-Bench Pro Public	56.8%	56.4%	+0.4
Terminal-Bench 2.0	77.3%	64.0%	+13.3
OSWorld-Verified	64.7%	38.2%	+26.5
Cybersecurity CTF	77.6%	67.4%	+10.2
SWE-Lancer IC Diamond	81.4%	76.0%	+5.4
GDPval (wins or ties)	70.9%	-	Matches GPT-5.2

Benchmark context: Figures above are published by OpenAI as part of the launch appendix (February 5, 2026). Always validate with your own repos and CI environment before committing to model-routing changes.

OpenAI also notes that GPT-5.3-Codex achieves its SWE-Bench Pro scores with fewer output tokens than any prior model. If you pay per token, that’s a quiet win - cost per accepted patch can drop even before API pricing is posted.

If your workload is mostly small edits on clean tickets, expect modest improvements. If your workload is long tool loops and cross-file coordination, these deltas are big enough to justify piloting immediately.

Codex Workflow Upgrades

OpenAI paired model improvements with UX changes aimed at the stuff that slows teams down in the real world.

Deep Diffs

“Deep diffs” are basically reviewer empathy. Instead of dumping a patch and hoping your staff engineer figures out why, Codex provides deeper change explanations so reviewers can see the reasoning behind edits, not just the surface-level diff. This is how you reduce “LGTM but I don’t trust it” feedback loops.

Interactive Steering

This is the big one if you manage agents like junior devs. You can steer the agent mid-task without losing context - ask questions, debate approaches, and redirect in real time. That’s the difference between “agent runs for 8 minutes then fails” and “agent runs for 8 minutes and you course-correct at minute 2”.

Stronger Follow-Up

OpenAI calls out improved interaction quality for cloud threads and pull request comments, reducing re-prompt overhead. Translation: fewer “no, I meant the other file” follow-ups and less babysitting.

Regression Fixes Called Out by OpenAI

Reduced non-deterministic linting loops that repeatedly touched the same files without progress.
Improved bug-analysis responses that previously lacked concrete supporting evidence.
Lowered premature completion behavior in flaky-test scenarios, where agents previously exited too early.

Access, Rollout, and Pricing

GPT-5.3-Codex is available with paid ChatGPT plans across every Codex surface: web app, CLI, IDE extension, and web. OpenAI says API access is coming soon, but not on day one - so if you rely on API-based production pipelines, plan for a short gap.

Availability table (as of February 5, 2026)

Channel	Status on February 5, 2026	Notes
Codex (ChatGPT)	Available now	App, CLI, IDE extension, and web for paid plans
OpenAI API	Coming weeks	No exact public date announced at launch
Pricing details	Pending API rollout	Finalize cost modeling after API pricing is posted

If you need immediate production-grade APIs today, keep GPT-5.2-Codex as your active default and run GPT-5.3-Codex in pilot channels until pricing and API SLAs are published.

Safety and Cybersecurity Governance

OpenAI published a dedicated system card for GPT-5.3-Codex and ties it to its Preparedness Framework.

The headline: GPT-5.3-Codex is the first model OpenAI classifies as High capability for cybersecurity-related tasks under this framework. That triggers what they describe as their most comprehensive safety deployment stack to date.

What OpenAI is emphasizing

System Card Disclosure: OpenAI shares deployment rationale, benchmark context, and safety assumptions specific to GPT-5.3-Codex. This is useful if you need to justify vendor risk decisions internally.
High Cyber Capability: First model OpenAI classifies as High capability for cybersecurity under its Preparedness Framework. That’s a policy and governance flag, not just a benchmark flex.
Trusted Access Path: Advanced cybersecurity use cases are gated through vetted trusted-access workflows. In practice, expect more friction for certain categories of requests and workflows.

OpenAI is also investing in ecosystem-level defenses alongside the model release, including:

Trusted Access for Cyber, a pilot program to accelerate cyber defense research
An expanded private beta of Aardvark, their security research agent and first Codex Security product
A $10M commitment in API credits to accelerate cyber defense for open-source software and critical infrastructure

Organizations engaged in good-faith security research can apply through OpenAI's Cybersecurity Grant Program.

Security team guidance: treat vendor-level safety classification as baseline, not replacement, for your own secure SDLC controls and approval gates.

For a deeper look at the model lineage leading to this release, see our GPT-5.2-Codex model guide.

Migration Playbook from GPT-5.2-Codex

If you already have GPT-5.2-Codex in production, don’t “flip the switch” because a blog post said so. Migrate like you’d migrate a database - with evals, guardrails, and a rollback button.

1. Build a representative eval queue

Use historical issues covering refactors, flaky tests, and terminal-heavy debugging rather than toy tasks. If your eval set is 20 easy tickets, you’ll learn nothing and ship a false sense of confidence.

2. Compare completion reliability, not just pass rate

Pass rate is a vanity metric if the model takes 4 retries, touches 12 files, and still needs a human to unroll the mess. Track reruns, dead-end loops, and reviewer rework to capture true throughput impact.

3. Keep a reversible fallback route

Maintain GPT-5.2-Codex as a failover path during early rollout, then tighten traffic split only after stable outcomes. You want a clean “route back” when the new model hits an edge case in your stack.

4. Prepare API migration now

Even before API access arrives, pre-wire config toggles, observability dashboards, and cost-alert budgets. The best time to add switches is before you’re under pressure.

```ts

// config/model-routing.ts

const MODEL_CONFIG = {

// Toggle when API access is confirmed

codex: {

// model: "gpt-5.2-codex", // Previous default

model: "gpt-5.3-codex", // Updated default

fallback: "gpt-5.2-codex", // Keep as failover

maxRetries: 3,

timeoutMs: 120_000,

};

```

For broader patterns on managing multi-model routing, see our AI agent orchestration workflows guide.

Competitive Context and Positioning

OpenAI frames GPT-5.3-Codex as a stronger coding agent against other frontier models. In practice, model choice still depends on task mix, budget constraints, and your existing tooling ecosystem.

Here’s the decision framing that actually helps in a real org:

Decision Area	GPT-5.3-Codex Position	What to Verify Internally
Long-horizon coding tasks	Strong launch metrics	Throughput per reviewer hour on your real backlogs
Terminal + computer-use work	Largest reported delta	Failure rate in shell-heavy CI and integration scripts
General model economics	API pricing not yet posted	Total cost per accepted patch after API rollout
Cross-vendor strategy	Best in mixed-model stacks	Routing policy across OpenAI, Claude, and Gemini surfaces

For direct alternatives, see our coverage of Claude Opus 4.6 and broader comparison posts focused on coding-model tradeoffs. For a wider landscape view, our AI coding tools comparison covers additional alternatives.

Implementation Checklist

If you want to turn launch news into something your team can ship this week, do this:

Select 20-30 representative tasks from recent engineering sprints.
Run GPT-5.2-Codex vs GPT-5.3-Codex in parallel where possible.
Track accepted patches, reruns, and manual reviewer edits.
Keep security and compliance review in the loop for trusted access workflows.
Prepare an API switchover plan once OpenAI posts model pricing and availability.

What This Means for Engineering Teams

GPT-5.3-Codex looks like a meaningful release for teams running agentic engineering workflows at scale. The benchmark pattern suggests small gains on classic coding tasks and large gains on terminal and computer-use workloads where previous models often stalled.

The right move is not immediate global replacement. It’s a measured rollout with hard evals, CI guardrails, and clear fallback routes. If your workloads match the model’s strongest benchmarks, you’ll likely see faster cycle time and less reviewer fatigue.

A quick note from EzUGC

Different domain, same lesson: iteration speed wins. In UGC ads, paying ~$200 to a creator for 1 video slows you down. With EzUGC, teams generate AI UGC videos for about ~$5 each and iterate instantly - no back-and-forth, no scheduling, no “can you redo the hook”.

If you’re shipping creative every week, try EzUGC and keep your iteration loop tight.

What is GPT-5.3-Codex and when was it released?

GPT-5.3-Codex is OpenAI’s latest coding-focused model. OpenAI launched it on February 5, 2026 across all Codex surfaces (app, CLI, IDE extension, web) for paid ChatGPT plans, with API access announced for the coming weeks.

What are the key benchmark numbers for GPT-5.3-Codex?

OpenAI reported:

56.8% on SWE-Bench Pro Public
77.3% on Terminal-Bench 2.0
64.7% on OSWorld-Verified
77.6% on Cybersecurity CTF
81.4% on SWE-Lancer IC Diamond
70.9% on GDPval (wins or ties)

Is GPT-5.3-Codex available in the API right now?

Not as of the launch note in this article. On February 5, 2026, OpenAI listed API access as “Coming weeks” with no exact public date announced at launch.

How does GPT-5.3-Codex compare to GPT-5.2-Codex?

On SWE-Bench Pro Public it’s a small bump (56.8% vs 56.4%, +0.4). The bigger jumps are on tool-use benchmarks:

Terminal-Bench 2.0: 77.3% vs 64.0% (+13.3)
OSWorld-Verified: 64.7% vs 38.2% (+26.5)
Cybersecurity CTF: 77.6% vs 67.4% (+10.2)

OpenAI also says it’s 25% faster on inference.

What Codex product improvements ship with this release?

OpenAI highlights:

Deep Diffs (better explanations behind changes)
Interactive Steering (redirect mid-task without losing context)
Stronger Follow-Up (better PR comment and cloud thread interactions)

They also called out fixes for lint loops, weak bug evidence, and premature completion in flaky tests.

What does OpenAI say about cybersecurity safety for GPT-5.3-Codex?

OpenAI ties the release to a system card and its Preparedness Framework. GPT-5.3-Codex is the first model they classify as High capability for cybersecurity tasks, and advanced cyber use cases are gated through trusted-access workflows. They also announced a $10M commitment in API credits to accelerate cyber defense work.

Should teams switch immediately from GPT-5.2-Codex?

If you’re API-dependent, probably not until API access and pricing are published. The practical approach is to keep GPT-5.2-Codex as default, run GPT-5.3-Codex in pilots, and only shift traffic after you have stable results on your own repos and CI.

What is the best rollout strategy for agencies and engineering teams?

Run a controlled eval:

Use a representative queue (refactors, flaky tests, terminal-heavy debugging)
Measure reliability (reruns, dead ends, reviewer edits), not just pass rate
Keep a rollback path (GPT-5.2-Codex as failover)
Pre-wire toggles, dashboards, and budgets before API access lands

If you’re building anything that needs fast iteration loops (engineering or marketing), the meta-play is the same: measure, route traffic, keep a rollback, and optimize for throughput. And if you want that same iteration speed on UGC ads, start with EzUGC.