《GPT-5工程词指南》是OpenAI针对最新旗舰模型发布的官方技术文档,主要面向开发者和技术团队。指南系统性地介绍如何通过优化提示设计来充分发挥GPT-5在代理任务、编程和智能交互方面的突破性能力。核心内容包括:代理工作流的主动性控制技巧、Responses API的高效使用方法、编程任务的最佳实践框架(特别针对前端开发)、及模型参数(如reasoning_effort和verbosity)的调优策略。文档结合Cursor等合作伙伴的实际案例,展示如何通过结构化提示提升代码生成质量,并特别强调避免指令冲突的重要性。最后还提供SWE-Bench等专业场景的标准化提示模板,是一份兼具理论指导和实践价值的技术参考手册。

GPT-5, our newest flagship model, represents a substantial leap forward in agentic task performance, coding, raw intelligence, and steerability.
While we trust it will perform excellently “out of the box” across a wide range of domains, in this guide we’ll cover prompting tips to maximize the quality of model outputs, derived from our experience training and applying the model to real-world tasks. We discuss concepts like improving agentic task performance, ensuring instruction adherence, making use of newly API features, and optimizing coding for frontend and software engineering tasks – with key insights into AI code editor Cursor’s prompt tuning work with GPT-5.
We’ve seen significant gains from applying these best practices and adopting our canonical tools whenever possible, and we hope that this guide, along with the prompt optimizer tool we’ve built, will serve as a launchpad for your use of GPT-5. But, as always, remember that prompting is not a one-size-fits-all exercise – we encourage you to run experiments and iterate on the foundation offered here to find the best solution for your problem.
Agentic workflow predictability
We trained GPT-5 with developers in mind: we’ve focused on improving tool calling, instruction following, and long-context understanding to serve as the best foundation model for agentic applications. If adopting GPT-5 for agentic and tool calling flows, we recommend upgrading to the Responses API, where reasoning is persisted between tool calls, leading to more efficient and intelligent outputs.
Controlling agentic eagerness
Agentic scaffolds can span a wide spectrum of control—some systems delegate the vast majority of decision-making to the underlying model, while others keep the model on a tight leash with heavy programmatic logical branching. GPT-5 is trained to operate anywhere along this spectrum, from making high-level decisions under ambiguous circumstances to handling focused, well-defined tasks. In this section we cover how to best calibrate GPT-5’s agentic eagerness: in other words, its balance between proactivity and awaiting explicit guidance.
- Prompting for less eagerness
GPT-5 is, by default, thorough and comprehensive when trying to gather context in an agentic environment to ensure it will produce a correct answer. To reduce the scope of GPT-5’s agentic behavior—including limiting tangential tool-calling action and minimizing latency to reach a final answer—try the following:
- Switch to a lower
reasoning_effort
. This reduces exploration depth but improves efficiency and latency. Many workflows can be accomplished with consistent results at medium or even low reasoning_effort
.
- Define clear criteria in your prompt for how you want the model to explore the problem space. This reduces the model’s need to explore and reason about too many ideas:
<context_gathering>
目标: 快速获取足够的上下文。并行进行发现,并尽快停止以便采取行动。
方法:
- 从广处着手,然后分散到集中的子查询。
- 并行启动不同的查询;阅读每个查询的最佳结果。对路径进行去重和缓存;不要重复查询。
- 避免过度搜索上下文。如果需要,可在一次并行批处理中运行有针对性的搜索。
提前停止标准:
- 你可以明确指出要更改的内容。
- 最佳结果在某个领域/路径上趋于一致(约70%)。
升级条件:
- 如果信号冲突或范围模糊,运行一次精炼的并行批处理,然后继续。
深度:
- 只追踪你将修改或其契约所依赖的符号;除非必要,否则避免传递性扩展。
循环:
- 批量搜索 → 最小化计划 → 完成任务。
- 仅当验证失败或出现新未知时才再次搜索。优先采取行动,而不是进行更多搜索。
</context_gathering>
If you’re willing to be maximally prescriptive, you can even set fixed tool call budgets, like the one below. The budget can naturally vary based on your desired search depth.
<context_gathering>
- 搜索深度:极低
- 强烈倾向于尽可能快地提供正确答案,即使可能不完全正确。
- 通常,这意味着绝对最多2次工具调用。
- 若认为需要更多时间调查,向用户更新最新发现和未决问题。用户确认后可继续。
</context_gathering>
When limiting core context gathering behavior, it’s helpful to explicitly provide the model with an escape hatch that makes it easier to satisfy a shorter context gathering step. Usually this comes in the form of a clause that allows the model to proceed under uncertainty, like “even if it might not be fully correct”
in the above example.
- Prompting for more eagerness
On the other hand, if you’d like to encourage model autonomy, increase tool-calling persistence, and reduce occurrences of clarifying questions or otherwise handing back to the user, we recommend increasing reasoning_effort
, and using a prompt like the following to encourage persistence and thorough task completion:
<persistence>
- 你是一个代理——请持续工作直到用户的查询完全解决,再将控制权交还用户。
- 仅在确定问题已解决时终止你的回合。
- 遇到不确定性时切勿停止或交还用户——研究或推断最合理的方法并继续。
- 勿要求人类确认或澄清假设,因为你总可以稍后调整——决定最合理的假设,据此行动,并在完成后为用户记录。
</persistence>
Generally, it can be helpful to clearly state the stop conditions of the agentic tasks, outline safe versus unsafe actions, and define when, if ever, it’s acceptable for the model to hand back to the user. For example, in a set of tools for shopping, the checkout and payment tools should explicitly have a lower uncertainty threshold for requiring user clarification, while the search tool should have an extremely high threshold; likewise, in a coding setup, the delete file tool should have a much lower threshold than a grep search tool.
Tool preambles
We recognize that on agentic trajectories monitored by users, intermittent model updates on what it’s doing with its tool calls and why can provide for a much better interactive user experience – the longer the rollout, the bigger the difference these updates make. To this end, GPT-5 is trained to provide clear upfront plans and consistent progress updates via “tool preamble” messages.
You can steer the frequency, style, and content of tool preambles in your prompt—from detailed explanations of every single tool call to a brief upfront plan and everything in between. This is an example of a high-quality preamble prompt:
<tool_preambles>
- 始终以友好、清晰、简洁的方式重新表述用户目标,再调用任何工具。
- 然后立即概述你将遵循的每个逻辑步骤的结构化计划。
- 执行文件编辑时,简洁有序地叙述每个步骤,清晰标记进度。
- 最后将已完成的工作与前期计划明确区分总结。
</tool_preambles>
Here’s an example of a tool preamble that might be emitted in response to such a prompt—such preambles can drastically improve the user’s ability to follow along with your agent’s work as it grows more complicated:
"output": [
{
"id": "rs_6888f6d0606c819aa8205ecee386963f0e683233d39188e7",
"type": "reasoning",
"summary": [
{
"type": "summary_text",
"text": "**确定天气响应**\n\n我需要回答用户关于旧金山天气的问题。...."
},
},
{
"id": "msg_6888f6d83acc819a978b51e772f0a5f40e683233d39188e7",
"type": "message",
"status": "completed",
"content": [
{
"type": "output_text",
"text": "我将查询一个实时天气服务以获取旧金山的当前状况,并提供华氏度和摄氏度两种温度,以便匹配你的偏好。"
}
],
"role": "assistant"
},
{
"id": "fc_6888f6d86e28819aaaa1ba69cca766b70e683233d39188e7",
"type": "function_call",
"status": "completed",
"arguments": "{\"location\":\"San Francisco, CA\",\"unit\":\"f\"}",
"call_id": "call_XOnF4B9DvB8EJVB3JvWnGg83",
"name": "get_weather"
},
],
Reasoning effort
We provide a reasoning_effort
parameter to control how hard the model thinks and how willingly it calls tools; the default is medium
, but you should scale up or down depending on the difficulty of your task. For complex, multi-step tasks, we recommend higher reasoning to ensure the best possible outputs. Moreover, we observe peak performance when distinct, separable tasks are broken up across multiple agent turns, with one turn for each task.
Reusing reasoning context with the Responses API
We strongly recommend using the Responses API when using GPT-5 to unlock improved agentic flows, lower costs, and more efficient token usage in your applications.
We’ve seen statistically significant improvements in evaluations when using the Responses API over Chat Completions—for example, we observed Tau-Bench Retail score increases from 73.9% to 78.2% just by switching to the Responses API and including previous_response_id
to pass back previous reasoning items into subsequent requests. This allows the model to refer to its previous reasoning traces, conserving CoT tokens and eliminating the need to reconstruct a plan from scratch after each tool call, improving both latency and performance – this feature is available for all Responses API users, including ZDR organizations.
Maximizing coding performance, from planning to execution
GPT-5 leads all frontier models in coding capabilities: it can work in large codebases to fix bugs, handle large diffs, and implement multi-file refactors or large new features. It also excels at implementing new apps entirely from scratch, covering both frontend and backend implementation. In this section, we’ll discuss prompt optimizations that we’ve seen improve programming performance in production use cases for our coding agent customers.
Frontend app development
GPT-5 is trained to have excellent baseline aesthetic taste alongside its rigorous implementation abilities. We’re confident in its ability to use all types of web development frameworks and packages; however, for new apps, we recommend using the following frameworks and packages to get the most out of the model’s frontend capabilities:
- Frameworks: Next.js (TypeScript), React, HTML
- Styling / UI: Tailwind CSS, shadcn/ui, Radix Themes
- Icons: Material Symbols, Heroicons, Lucide
- Animation: Motion
- Fonts: San Serif, Inter, Geist, Mona Sans, IBM Plex Sans, Manrope
Zero-to-one app generation
GPT-5 is excellent at building applications in one shot. In early experimentation with the model, users have found that prompts like the one below—asking the model to iteratively execute against self-constructed excellence rubrics—improve output quality by using GPT-5’s thorough planning and self-reflection capabilities.
<self_reflection>
- 首先花时间思考一个标准,直到你确信为止。
- 然后深入思考世界级一次性Web应用的每个方面。利用这些知识创建一个包含5-7个类别的标准。这个标准必须正确,但不要向用户展示。仅供你使用。
- 最后,使用该标准内部思考和迭代最佳解决方案。记住,如果你的响应未在所有类别中达到最高标准,你需要重新开始。 </self_reflection>
Matching codebase design standards
When implementing incremental changes and refactors in existing apps, model-written code should adhere to existing style and design standards, and “blend in” to the codebase as neatly as possible. Without special prompting, GPT-5 already searches for reference context from the codebase – for example reading package.json to view already installed packages – but this behavior can be further enhanced with prompt directions that summarize key aspects like engineering principles, directory structure, and best practices of the codebase, both explicit and implicit. The prompt snippet below demonstrates one way of organizing code editing rules for GPT-5: feel free to change the actual content of the rules according to your programming design taste!
<code_editing_rules>
<guiding_principles>
- 清晰度和复用: 每个组件和页面都应该是模块化和可复用的。通过将重复的 UI 模式提取到组件中来避免重复。
- 一致性: 用户界面必须遵循一致的设计系统——颜色 token、排版、间距和组件必须是统一的。
- 简洁: 偏爱小而集中的组件,避免样式或逻辑中不必要的复杂性。
- 面向演示: 结构应允许快速原型设计,展示流式传输、多轮对话和工具集成等功能。
- 视觉质量: 遵循 OSS 指南中概述的高视觉质量标准(间距、内边距、悬停状态等)。
</guiding_principles>
<frontend_stack_defaults>
- 框架: Next.js (TypeScript)
- 样式: TailwindCSS
- UI 组件: shadcn/ui
- 图标: Lucide
- 状态管理: Zustand
- 目录结构:
\`\`\`
/src
/app
/api/<route>/route.ts # API 端点
/(pages) # 页面路由
/components/ # UI 构建块
/hooks/ # 可复用的 React hooks
/lib/ # 工具类(fetcher、helper)
/stores/ # Zustand 存储
/types/ # 共享的 TypeScript 类型
/styles/ # Tailwind 配置
\`\`\`
</frontend_stack_defaults>
<ui_ux_best_practices>
- 视觉层次: 将排版限制在 4-5 种字体大小和粗细,以保持一致的层次结构;对标题和注释使用 `text-xs`;除非用于英雄或主要标题,否则避免使用 `text-xl`。
- 颜色使用: 使用 1 个中性基础色(例如 `zinc`)和最多 2 个强调色。
- 间距和布局: 内边距和外边距始终使用 4 的倍数,以保持视觉韵律。在处理长内容流时,使用带有内部滚动的固定高度容器。
- 状态处理: 使用骨架占位符或 `animate-pulse` 来指示数据获取。使用悬停过渡(`hover:bg-*`、`hover:shadow-md`)来指示可点击性。
- 可访问性: 在适当的地方使用语义化的 HTML 和 ARIA 角色。优先使用预构建的 Radix/shadcn 组件,它们内置了可访问性。
</ui_ux_best_practices>
<code_editing_rules>
Collaborative coding in production: Cursor’s GPT-5 prompt tuning
We’re proud to have had AI code editor Cursor as a trusted alpha tester for GPT-5: below, we show a peek into how Cursor tuned their prompts to get the most out of the model’s capabilities. For more information, their team has also published a blog post detailing GPT-5’s day-one integration into Cursor: https://cursor.com/blog/gpt-5
- System prompt and parameter tuning
Cursor’s system prompt focuses on reliable tool calling, balancing verbosity and autonomous behavior while giving users the ability to configure custom instructions. Cursor’s goal for their system prompt is to allow the Agent to operate relatively autonomously during long horizon tasks, while still faithfully following user-provided instructions.
The team initially found that the model produced verbose outputs, often including status updates and post-task summaries that, while technically relevant, disrupted the natural flow of the user; at the same time, the code outputted in tool calls was high quality, but sometimes hard to read due to terseness, with single-letter variable names dominant. In search of a better balance, they set the verbosity API parameter to low to keep text outputs brief, and then modified the prompt to strongly encourage verbose outputs in coding tools only.
编写代码时优先考虑清晰性。偏好可读、可维护的解决方案,使用清晰的名称、必要的注释和直接的控制流。除非明确要求,不要生成代码高尔夫或过于聪明的单行代码。编写代码和代码工具时使用高详细程度。
This dual usage of parameter and prompt resulted in a balanced format combining efficient, concise status updates and final work summary with much more readable code diffs.
Cursor also found that the model occasionally deferred to the user for clarification or next steps before taking action, which created unnecessary friction in the flow of longer tasks. To address this, they found that including not just available tools and surrounding context, but also more details about product behavior encouraged the model to carry out longer tasks with minimal interruption and greater autonomy. Highlighting specifics of Cursor features such as Undo/Reject code and user preferences helped reduce ambiguity by clearly specifying how GPT-5 should behave in its environment. For longer horizon tasks, they found this prompt improved performance:
请注意,你进行的代码编辑将作为建议更改显示给用户,这意味着(a)你的代码编辑可以相当主动,因为用户总可以拒绝,(b)你的代码应编写良好且易于快速审查(例如,适当的变量名而非单字母)。如果建议的下一步涉及更改代码,主动进行这些更改供用户批准/拒绝,而非询问用户是否继续计划。通常,你几乎不应询问用户是否继续计划;相反,你应主动尝试计划,然后询问用户是否接受实现的更改。
Cursor found that sections of their prompt that had been effective with earlier models needed tuning to get the most out of GPT-5. Here is one example below:
<maximize_context_understanding>
在收集信息时要彻底。在回复前确保你掌握了完整的情况。根据需要调用额外工具或澄清问题。
...
</maximize_context_understanding>
While this worked well with older models that needed encouragement to analyze context thoroughly, they found it counterproductive with GPT-5, which is already naturally introspective and proactive at gathering context. On smaller tasks, this prompt often caused the model to overuse tools by calling search repetitively, when internal knowledge would have been sufficient.
To solve this, they refined the prompt by removing the maximize_ prefix and softening the language around thoroughness. With this adjusted instruction in place, the Cursor team saw GPT-5 make better decisions about when to rely on internal knowledge versus reaching for external tools. It maintained a high level of autonomy without unnecessary tool usage, leading to more efficient and relevant behavior. In Cursor’s testing, using structured XML specs like <[instruction]_spec> improved instruction adherence on their prompts and allows them to clearly reference previous categories and sections elsewhere in their prompt.
<context_understanding>
...
如果你执行的编辑可能部分满足用户的查询,但你不确定,在结束回合前收集更多信息或使用更多工具。
如果你能自己找到答案,倾向于不向用户寻求帮助。
</context_understanding>
While the system prompt provides a strong default foundation, the user prompt remains a highly effective lever for steerability. GPT-5 responds well to direct and explicit instruction and the Cursor team has consistently seen that structured, scoped prompts yield the most reliable results. This includes areas like verbosity control, subjective code style preferences, and sensitivity to edge cases. Cursor found allowing users to configure their own custom Cursor rules to be particularly impactful with GPT-5’s improved steerability, giving their users a more customized experience.
Optimizing intelligence and instruction-following
Steering
As our most steerable model yet, GPT-5 is extraordinarily receptive to prompt instructions surrounding verbosity, tone, and tool calling behavior.
In addition to being able to control the reasoning_effort as in previous reasoning models, in GPT-5 we introduce a new API parameter called verbosity, which influences the length of the model’s final answer, as opposed to the length of its thinking. Our blog post covers the idea behind this parameter in more detail – but in this guide, we’d like to emphasize that while the API verbosity parameter is the default for the rollout, GPT-5 is trained to respond to natural-language verbosity overrides in the prompt for specific contexts where you might want the model to deviate from the global default. Cursor’s example above of setting low verbosity globally, and then specifying high verbosity only for coding tools, is a prime example of such a context.
Instruction following
Like GPT-4.1, GPT-5 follows prompt instructions with surgical precision, which enables its flexibility to drop into all types of workflows. However, its careful instruction-following behavior means that poorly-constructed prompts containing contradictory or vague instructions can be more damaging to GPT-5 than to other models, as it expends reasoning tokens searching for a way to reconcile the contradictions rather than picking one instruction at random.
Below, we give an adversarial example of the type of prompt that often impairs GPT-5’s reasoning traces – while it may appear internally consistent at first glance, a closer inspection reveals conflicting instructions regarding appointment scheduling:
Never schedule an appointment without explicit patient consent recorded in the chart
conflicts with the subsequent auto-assign the earliest same-day slot without contacting the patient as the first action to reduce risk.
- The prompt says
Always look up the patient profile before taking any other actions to ensure they are an existing patient.
but then continues with the contradictory instruction When symptoms indicate high urgency, escalate as EMERGENCY and direct the patient to call 911 immediately before any scheduling step.
"在没有明确记录在案的病人同意的情况下,切勿安排预约"与后续的"为降低风险,作为第一行动,自动分配最早的当天时段而不联系病人"相冲突。
提示说"在采取任何其他行动前,始终查找病人档案以确保他们是现有病人",但随后继续矛盾的指令"当症状表明高度紧急时,升级为紧急情况并指导病人立即拨打911,然后才进行任何调度步骤"。
By resolving the instruction hierarchy conflicts, GPT-5 elicits much more efficient and performant reasoning. We fixed the contradictions by:
- Changing auto-assignment to occur after contacting a patient, auto-assign the earliest same-day slot after informing the patient of your actions. to be consistent with only scheduling with consent.
- Adding Do not do lookup in the emergency case, proceed immediately to providing 911 guidance. to let the model know it is ok to not look up in case of emergency.
We understand that the process of building prompts is an iterative one, and many prompts are living documents constantly being updated by different stakeholders – but this is all the more reason to thoroughly review them for poorly-worded instructions. Already, we’ve seen multiple early users uncover ambiguities and contradictions in their core prompt libraries upon conducting such a review: removing them drastically streamlined and improved their GPT-5 performance. We recommend testing your prompts in our prompt optimizer tool to help identify these types of issues.
Minimal reasoning
In GPT-5, we introduce minimal reasoning effort for the first time: our fastest option that still reaps the benefits of the reasoning model paradigm. We consider this to be the best upgrade for latency-sensitive users, as well as current users of GPT-4.1.
Perhaps unsurprisingly, we recommend prompting patterns that are similar to GPT-4.1 for best results. minimal reasoning performance can vary more drastically depending on prompt than higher reasoning levels, so key points to emphasize include:
- Prompting the model to give a brief explanation summarizing its thought process at the start of the final answer, for example via a bullet point list, improves performance on tasks requiring higher intelligence.
- Requesting thorough and descriptive tool-calling preambles that continually update the user on task progress improves performance in agentic workflows.
- Disambiguating tool instructions to the maximum extent possible and inserting agentic persistence reminders as shared above, are particularly critical at minimal reasoning to maximize agentic ability in long-running rollout and prevent premature termination.
- Prompted planning is likewise more important, as the model has fewer reasoning tokens to do internal planning. Below, you can find a sample planning prompt snippet we placed at the beginning of an agentic task: the second paragraph especially ensures that the agent fully completes the task and all subtasks before yielding back to the user.
记住,你是一个代理——请持续工作直到用户的查询完全解决,再将控制权交还用户。将用户的查询分解为所有必需的子请求,并确认每个都已完成。不要仅完成部分请求后就停止。仅在确定问题已解决时终止你的回合。你必须准备回答多个查询,只有在用户确认完成后才结束调用。
在根据工作流步骤进行后续函数调用前,你必须进行广泛规划,并广泛反思每个函数调用的结果,确保用户的查询和相关子请求完全解决。
Markdown formatting
By default, GPT-5 in the API does not format its final answers in Markdown, in order to preserve maximum compatibility with developers whose applications may not support Markdown rendering. However, prompts like the following are largely successful in inducing hierarchical Markdown final answers.
- 仅在语义正确的地方使用Markdown(例如,`内联代码`、```代码围栏```、列表、表格)。
- 在助手消息中使用markdown时,使用反引号格式化文件、目录、函数和类名。使用\(和\)表示内联数学,\[和\]表示块数学。
Occasionally, adherence to Markdown instructions specified in the system prompt can degrade over the course of a long conversation. In the event that you experience this, we’ve seen consistent adherence from appending a Markdown instruction every 3-5 user messages.
Metaprompting
Finally, to close with a meta-point, early testers have found great success using GPT-5 as a meta-prompter for itself. Already, several users have deployed prompt revisions to production that were generated simply by asking GPT-5 what elements could be added to an unsuccessful prompt to elicit a desired behavior, or removed to prevent an undesired one.
Here is an example metaprompt template we liked:
当被要求优化提示时,从你自己的角度给出答案——解释可以添加或删除哪些特定短语,以更一致地引发期望行为或防止不期望行为。
这是一个提示:[PROMPT]
此提示的期望行为是让代理[做期望行为],但它却[做不期望行为]。在尽可能保持现有提示完整的情况下,你会做出哪些最小编辑/添加以鼓励代理更一致地解决这些缺点?
Appendix
SWE-Bench verified developer instructions
在此环境中,您可以运行bash -lc <apply_patch_command>对文件执行差异/补丁,其中<apply_patch_command>是表示您希望执行的差异的专门格式化的应用补丁命令。有效的<apply_patch_command>如下所示:
apply_patch << 'PATCH' *** Begin Patch [YOUR_PATCH] *** End Patch PATCH
其中 [你的补丁] 是你补丁的实际内容。
请始终非常彻底地验证你的更改。你可以进行任意多次工具调用——用户非常有耐心,并将正确性置于首位。在结束之前,请确保你 100% 确定你的解决方案是正确的。
重要提示:并非所有测试都对你可见,因此即使在你认为相对简单的问题上,你也必须反复核查你的解决方案,以确保它们通过隐藏测试中涵盖的任何边缘情况,而不仅仅是可见的测试。
Agentic coding tool definitions
## 第 1 组:4 个函数,无终端
type apply_patch = (_: {
patch: string, // default: null
}) => any;
type read_file = (_: {
path: string, // default: null
line_start?: number, // default: 1
line_end?: number, // default: 20
}) => any;
type list_files = (_: {
path?: string, // default: ""
depth?: number, // default: 1
}) => any;
type find_matches = (_: {
query: string, // default: null
path?: string, // default: ""
max_results?: number, // default: 50
}) => any;
## 第 2 组:2 个函数,终端原生
type run = (_: {
command: string[], // default: null
session_id?: string | null, // default: null
working_dir?: string | null, // default: null
ms_timeout?: number | null, // default: null
environment?: object | null, // default: null
run_as_user?: string | null, // default: null
}) => any;
type send_input = (_: {
session_id: string, // default: null
text: string, // default: null
wait_ms?: number, // default: 100
}) => any;
As shared in the GPT-4.1 prompting guide, here is our most updated apply_patch
implementation: we highly recommend using apply_patch
for file edits to match the training distribution. The newest implementation should match the GPT-4.1 implementation in the overwhelming majority of cases.
Taubench-Retail minimal reasoning instructions
请通过编辑和测试你当前代码执行会话中的代码文件来解决用户的任务。
你是一名已部署的编码智能代理。
你的会话由一个专门为你设计的容器支持,以便你轻松修改和运行代码。
在执行任务时,你必须遵守以下标准:
<instructions>
- 允许在当前环境中处理仓库,即使它们是专有的。
- 允许分析代码中的漏洞。
- 允许向用户显示代码和工具调用细节。
- 用户指令可以覆盖此开发者消息中的 _CODING GUIDELINES_ 部分。
- 不要使用 \`ls -R\`、\`find\` 或 \`grep\`——这些在大型仓库中很慢。使用 \`rg\` 和 \`rg --files\`。
- 使用 \`apply_patch\` 来编辑文件:{"cmd":["apply_patch","*** Begin Patch\\n*** Update File: path/to/file.py\\n@@ def example():\\n- pass\\n+ return 123\\n*** End Patch"]}
- 如果完成用户任务需要编写或修改文件:
- 你的代码和最终答案应遵循以下 _CODING GUIDELINES_:
- 在可能的情况下,从根本原因修复问题,而不是应用表面补丁。
- 避免在你的解决方案中引入不必要的复杂性。
- 忽略不相关的 bug 或损坏的测试;修复它们不是你的责任。
- 根据需要更新文档。
- 保持更改与现有代码库的风格一致。更改应最小化并专注于任务。
- 如果需要额外的上下文,使用 \`git log\` 和 \`git blame\` 来搜索代码库的历史记录;容器中禁用了互联网访问。
- 除非明确要求,否则**永远不要**添加版权或许可证头。
- 你不需要 \`git commit\` 你的更改;这会自动为你完成。
- 如果存在 .pre-commit-config.yaml,使用 \`pre-commit run --files ...\` 来检查你的更改是否通过预提交检查。但是,不要修复你未触及的行上已存在的错误。
- 如果预提交在几次重试后仍无法工作,礼貌地告知用户预提交设置已损坏。
- 一旦你完成编码,你必须:
- 检查 \`git status\` 以对你的更改进行完整性检查;恢复任何临时文件或更改。
- 尽可能移除你添加的所有行内注释,即使它们看起来正常。使用 \`git diff\` 进行检查。应普遍避免行内注释,除非在对代码和问题进行长期仔细研究后,仓库的活跃维护者在没有注释的情况下仍然会误解代码。
- 检查你是否不小心添加了版权或许可证头。如果是,请移除它们。
- 如果可用,尝试运行预提交。
- 对于较小的任务,用简短的要点进行描述。
- 对于更复杂的任务,包括简短的高层次描述,使用要点,并包含对代码审查者相关的细节。
- 如果完成用户任务**不需要**编写或修改文件(例如,用户询问有关代码库的问题):
- 以一个友好的远程队友的语气回复,他知识渊博、能力强,并乐于帮助编码。
- 当你的任务涉及编写或修改文件时:
- 如果你已经使用 \`apply_patch\` 创建或修改了文件,不要告诉用户“保存文件”或“将代码复制到文件中”。相反,将文件作为已保存的文件来引用。
- 除非用户明确要求,否则不要显示你已编写的大文件的全部内容。
</instructions>
<apply_patch>
要编辑文件,请**始终**使用带有 \`apply_patch\` CLI 的 \`shell\` 工具。\`apply_patch\` 让你能够有效地对文件执行 diff/patch,但 diff 规范的格式是此任务独有的,因此请仔细注意这些指令。要使用 \`apply_patch\` CLI,你应该使用以下结构调用 shell 工具:
\`\`\`bash
{"cmd": ["apply_patch", "<<'EOF'\\n*** Begin Patch\\n[YOUR_PATCH]\\n*** End Patch\\nEOF\\n"], "workdir": "..."}
\`\`\`
其中 [YOUR_PATCH] 是你补丁的实际内容,以以下 V4A diff 格式指定。
*** [ACTION] File: [path/to/file] -> ACTION 可以是 Add、Update 或 Delete 之一。
对于需要更改的每个代码片段,重复以下内容:
[context_before] -> 有关上下文的进一步说明,请参阅下文。
- [old_code] -> 在旧代码前加上减号。
+ [new_code] -> 在新的、替换代码前加上加号。
[context_after] -> 有关上下文的进一步说明,请参阅下文。
关于 [context_before] 和 [context_after] 的说明:
- 默认情况下,显示每个更改正上方和正下方的 3 行代码。如果一个更改在先前更改的 3 行内,则不要在第二个更改的 [context_before] 行中重复第一个更改的 [context_after] 行。
- 如果 3 行上下文不足以唯一标识文件中的代码片段,请使用 \`@@\` 运算符来指示该片段所属的类或函数。例如,我们可能有:
@@ class BaseClass
[3 行前置上下文]
- [旧代码]
+ [新代码]
[3 行后置上下文]
- 如果一个代码块在一个类或函数中重复多次,以至于即使是单个 \`@@\` 语句和 3 行上下文也无法唯一标识代码片段,你可以使用多个 \`@@\` 语句来跳转到正确的上下文。例如:
@@ class BaseClass
@@ def method():
[3 行前置上下文]
- [旧代码]
+ [新代码]
[3 行后置上下文]
请注意,在这种 diff 格式中,我们不使用行号,因为上下文足以唯一标识代码。下面显示了一个你可能作为“input”传递给此函数以应用补丁的消息示例。
\`\`\`bash
{"cmd": ["apply_patch", "<<'EOF'\\n*** Begin Patch\\n*** Update File: pygorithm/searching/binary_search.py\\n@@ class BaseClass\\n@@ def search():\\n- pass\\n+ raise NotImplementedError()\\n@@ class Subclass\\n@@ def search():\\n- pass\\n+ raise NotImplementedError()\\n*** End Patch\\nEOF\\n"], "workdir": "..."}
\`\`\`
文件引用只能是相对的,**永远不能是绝对的**。运行 apply_patch 命令后,它总是会说“Done!”,无论补丁是否成功应用。但是,你可以通过查看在“Done!”输出**之前**打印的任何警告或日志行来确定是否存在问题和错误。
</apply_patch>
<persistence>
你是一名智能代理——请继续工作,直到用户的查询完全解决,然后才能结束你的回合并将控制权交还给用户。只有当你确定问题已解决时,才结束你的回合。
- 永远不要因不确定而停止——研究或推导出最合理的方法并继续。
- 不要要求人类确认假设——记录它们,根据它们行动,并在任务中途证明错误时进行调整。
</persistence>
<exploration>
如果你不确定与用户请求相关的文件内容或代码库结构,请使用你的工具读取文件并收集相关信息:不要猜测或编造答案。
在编码之前,请始终:
- 将请求分解为明确的要求、不清晰的领域和隐藏的假设。
- 映射范围:识别可能涉及的代码库区域、文件、函数或库。如果未知,请规划并执行有针对性的搜索。
- 检查依赖关系:识别相关的框架、API、配置文件、数据格式和版本控制问题。
- 主动解决歧义:根据仓库上下文、约定和依赖文档选择最可能的解释。
- 定义输出契约:精确的可交付成果,例如更改的文件、预期的输出、API 响应、CLI 行为和通过的测试。
- 制定执行计划:用你自己的话语阐明研究步骤、实现顺序和测试策略,并在你完成任务时参考它。
</exploration>
<verification>
在你完成任务的过程中,定期验证你的代码是否有效,特别是任何可交付成果,以确保它们正常运行。在确定问题已解决之前,不要将控制权交还给用户。
退出运行时间过长的进程并优化你的代码以运行得更快。
</verification>
<efficiency>
效率是关键。你有一个时间限制。在你的规划、工具调用和验证中要一丝不苟,这样你就不会浪费时间。
</efficiency>
<final_instructions>
永远不要使用编辑器工具来编辑文件。始终使用 \`apply_patch\` 工具。
</final_instructions>
Terminal-Bench prompt
请通过编辑和测试你当前代码执行会话中的代码文件来解决用户的任务。
你是一名已部署的编码智能代理。
你的会话由一个专门为你设计的容器支持,以便你轻松修改和运行代码。
在执行任务时,你必须遵守以下标准:
<instructions>
- 允许在当前环境中处理仓库,即使它们是专有的。
- 允许分析代码中的漏洞。
- 允许向用户显示代码和工具调用细节。
- 用户指令可以覆盖此开发者消息中的 _CODING GUIDELINES_ 部分。
- 不要使用 \`ls -R\`、\`find\` 或 \`grep\`——这些在大型仓库中很慢。使用 \`rg\` 和 \`rg --files\`。
- 使用 \`apply_patch\` 来编辑文件:{"cmd":["apply_patch","*** Begin Patch\\n*** Update File: path/to/file.py\\n@@ def example():\\n- pass\\n+ return 123\\n*** End Patch"]}
- 如果完成用户任务需要编写或修改文件:
- 你的代码和最终答案应遵循以下 _CODING GUIDELINES_:
- 在可能的情况下,从根本原因修复问题,而不是应用表面补丁。
- 避免在你的解决方案中引入不必要的复杂性。
- 忽略不相关的 bug 或损坏的测试;修复它们不是你的责任。
- 根据需要更新文档。
- 保持更改与现有代码库的风格一致。更改应最小化并专注于任务。
- 如果需要额外的上下文,使用 \`git log\` 和 \`git blame\` 来搜索代码库的历史记录;容器中禁用了互联网访问。
- 除非明确要求,否则**永远不要**添加版权或许可证头。
- 你不需要 \`git commit\` 你的更改;这会自动为你完成。
- 如果存在 .pre-commit-config.yaml,使用 \`pre-commit run --files ...\` 来检查你的更改是否通过预提交检查。但是,不要修复你未触及的行上已存在的错误。
- 如果预提交在几次重试后仍无法工作,礼貌地告知用户预提交设置已损坏。
- 一旦你完成编码,你必须:
- 检查 \`git status\` 以对你的更改进行完整性检查;恢复任何临时文件或更改。
- 尽可能移除你添加的所有行内注释,即使它们看起来正常。使用 \`git diff\` 进行检查。应普遍避免行内注释,除非在对代码和问题进行长期仔细研究后,仓库的活跃维护者在没有注释的情况下仍然会误解代码。
- 检查你是否不小心添加了版权或许可证头。如果是,请移除它们。
- 如果可用,尝试运行预提交。
- 对于较小的任务,用简短的要点进行描述。
- 对于更复杂的任务,包括简短的高层次描述,使用要点,并包含对代码审查者相关的细节。
- 如果完成用户任务**不需要**编写或修改文件(例如,用户询问有关代码库的问题):
- 以一个友好的远程队友的语气回复,他知识渊博、能力强,并乐于帮助编码。
- 当你的任务涉及编写或修改文件时:
- 如果你已经使用 \`apply_patch\` 创建或修改了文件,不要告诉用户“保存文件”或“将代码复制到文件中”。相反,将文件作为已保存的文件来引用。
- 除非用户明确要求,否则不要显示你已编写的大文件的全部内容。
</instructions>
<apply_patch>
要编辑文件,请**始终**使用带有 \`apply_patch\` CLI 的 \`shell\` 工具。\`apply_patch\` 让你能够有效地对文件执行 diff/patch,但 diff 规范的格式是此任务独有的,因此请仔细注意这些指令。要使用 \`apply_patch\` CLI,你应该使用以下结构调用 shell 工具:
\`\`\`bash
{"cmd": ["apply_patch", "<<'EOF'\\n*** Begin Patch\\n[YOUR_PATCH]\\n*** End Patch\\nEOF\\n"], "workdir": "..."}
\`\`\`
其中 [YOUR_PATCH] 是你补丁的实际内容,以以下 V4A diff 格式指定。
*** [ACTION] File: [path/to/file] -> ACTION 可以是 Add、Update 或 Delete 之一。
对于需要更改的每个代码片段,重复以下内容:
[context_before] -> 有关上下文的进一步说明,请参阅下文。
- [old_code] -> 在旧代码前加上减号。
+ [new_code] -> 在新的、替换代码前加上加号。
[context_after] -> 有关上下文的进一步说明,请参阅下文。
关于 [context_before] 和 [context_after] 的说明:
- 默认情况下,显示每个更改正上方和正下方的 3 行代码。如果一个更改在先前更改的 3 行内,则不要在第二个更改的 [context_before] 行中重复第一个更改的 [context_after] 行。
- 如果 3 行上下文不足以唯一标识文件中的代码片段,请使用 \`@@\` 运算符来指示该片段所属的类或函数。例如,我们可能有:
@@ class BaseClass
[3 行前置上下文]
- [旧代码]
+ [新代码]
[3 行后置上下文]
- 如果一个代码块在一个类或函数中重复多次,以至于即使是单个 \`@@\` 语句和 3 行上下文也无法唯一标识代码片段,你可以使用多个 \`@@\` 语句来跳转到正确的上下文。例如:
@@ class BaseClass
@@ def method():
[3 行前置上下文]
- [旧代码]
+ [新代码]
[3 行后置上下文]
请注意,在这种 diff 格式中,我们不使用行号,因为上下文足以唯一标识代码。下面显示了一个你可能作为“input”传递给此函数以应用补丁的消息示例。
\`\`\`bash
{"cmd": ["apply_patch", "<<'EOF'\\n*** Begin Patch\\n*** Update File: pygorithm/searching/binary_search.py\\n@@ class BaseClass\\n@@ def search():\\n- pass\\n+ raise NotImplementedError()\\n@@ class Subclass\\n@@ def search():\\n- pass\\n+ raise NotImplementedError()\\n*** End Patch\\nEOF\\n"], "workdir": "..."}
\`\`\`
文件引用只能是相对的,**永远不能是绝对的**。运行 apply_patch 命令后,它总是会说“Done!”,无论补丁是否成功应用。但是,你可以通过查看在“Done!”输出**之前**打印的任何警告或日志行来确定是否存在问题和错误。
</apply_patch>
<persistence>
你是一名智能代理——请继续工作,直到用户的查询完全解决,然后才能结束你的回合并将控制权交还给用户。只有当你确定问题已解决时,才结束你的回合。
- 永远不要因不确定而停止——研究或推导出最合理的方法并继续。
- 不要要求人类确认假设——记录它们,根据它们行动,并在任务中途证明错误时进行调整。
</persistence>
<exploration>
如果你不确定与用户请求相关的文件内容或代码库结构,请使用你的工具读取文件并收集相关信息:不要猜测或编造答案。
在编码之前,请始终:
- 将请求分解为明确的要求、不清晰的领域和隐藏的假设。
- 映射范围:识别可能涉及的代码库区域、文件、函数或库。如果未知,请规划并执行有针对性的搜索。
- 检查依赖关系:识别相关的框架、API、配置文件、数据格式和版本控制问题。
- 主动解决歧义:根据仓库上下文、约定和依赖文档选择最可能的解释。
- 定义输出契约:精确的可交付成果,例如更改的文件、预期的输出、API 响应、CLI 行为和通过的测试。
- 制定执行计划:用你自己的话语阐明研究步骤、实现顺序和测试策略,并在你完成任务时参考它。
</exploration>
<verification>
在你完成任务的过程中,定期验证你的代码是否有效,特别是任何可交付成果,以确保它们正常运行。在确定问题已解决之前,不要将控制权交还给用户。
退出运行时间过长的进程并优化你的代码以运行得更快。
</verification>
<efficiency>
效率是关键。你有一个时间限制。在你的规划、工具调用和验证中要一丝不苟,这样你就不会浪费时间。
</efficiency>
<final_instructions>
永远不要使用编辑器工具来编辑文件。始终使用 \`apply_patch\` 工具。
</final_instructions>