Providers and Models
fast-agent has native support for OpenAI Responses and Chat Completions, Anthropic Messages, Google GenAI and Amazon Bedrock APIs.
OpenAI Codex users can use their subscription with fast-agent, using their existing installation or logging in with fast-agent auth codexplan.
Chat Completions models are also available via Microsoft Azure, and supported Anthropic models are available on Google Vertex.
Local models with llama.cpp are directly supported, with automatic configuration and connection with the Responses API.
Selecting a Model
Model Picker and Defaults
In interactive mode, with no model specified or default configured, fast-agent shows a model selector on startup, highlighting available models.
Using Presets
The quickest way to get started is to use the convenience presets for popular models, for example:
fast-agent --model opus # Use the most recent opus model
fast-agent --model codexplan # Use the latest supported Codex Subscription Model
Use fast-agent model presets to see the current shortcuts.
Model Strings and Configuration
Models in fast-agent are specified with a model string:
The query string allows configuration of provider, model, and sampling parameters.
Custom models and configurations can be defined using Model Overlays.
- Providers and Models lists provider configuration and authentication details.
- Models Reference lists generated model capabilities such as structured outputs, reasoning, verbosity, and supported input modalities.
Provider families
Start with the native providers for common use, or use additional providers for hosted OpenAI-compatible APIs, routers, and local endpoints.
| Provider family | Start with | Main features |
|---|---|---|
| OpenAI Responses | gpt55, gpt54, gpt52, gpt-5-mini, codex |
GPT-5 class models, reasoning, text verbosity, structured outputs, web_search, SSE/WebSocket transports, service tiers, connectors |
| Anthropic | sonnet, opus, opus48, opus47, haiku |
Claude 4.x, prompt caching, adaptive reasoning/effort, structured outputs, web_search, web_fetch, long context, task budget where supported |
gemini, gemini3, gemini3.1, gemini3flash |
Gemini native API, structured outputs, thinking controls, text/image/PDF/audio/video input, YouTube links through media attachments | |
| xAI / Grok | grok, grok4, grok-4.3 |
Grok models, reasoning controls, web_search, x_search, SSE/WebSocket transports |
| Hugging Face | kimi, kimi26instant, deepseek-hf, glm, minimax |
Hugging Face Inference Providers routing, curated aliases, and HF MCP authentication |
| Additional providers | deepseek, qwen-turbo, gpt-oss |
Groq, DeepSeek, Aliyun, OpenRouter, Open Responses, TensorZero, and generic OpenAI-compatible endpoints |
OpenAI Responses
Use the responses provider for GPT-5 class OpenAI models.
fast-agent --model "responses.gpt-5.5?reasoning=medium"
fast-agent --model "responses.gpt-5.5?web_search=on"
fast-agent --model "responses.gpt-5.5?verbosity=high&transport=ws"
fast-agent --model "responses.gpt-5.5?service_tier=fast"
Useful query parameters:
reasoning=none|minimal|low|medium|high|xhighdepending on modelverbosity=low|medium|highweb_search=on|offtransport=sse|ws|autoservice_tier=fast|flexwhere supported
Use the openai provider for Chat Completions-style models such as openai.gpt-4.1.
Anthropic
Anthropic support includes Claude-specific reasoning, caching, web tools, and structured-output selection.
fast-agent --model sonnet
fast-agent --model "sonnet?reasoning=4096"
fast-agent --model "opus?reasoning=auto"
fast-agent --model "opus?reasoning=xhigh"
fast-agent --model "opus?web_search=on&web_fetch=on"
fast-agent --model "opus?task_budget=128k"
Useful query parameters and config:
reasoning=auto|low|medium|high|max|offon adaptive-thinking modelsreasoning=xhighon Opus models that advertise it, such asopus,opus48, andopus47reasoning=<tokens>on older budget-thinking models, for examplereasoning=4096web_search=on|offweb_fetch=on|offtask_budget=20k|128k|offwhere supportedanthropic.cache_mode: auto|prompt|offanthropic.cache_ttl: 5m|1h
opus currently resolves to claude-opus-4-8; use opus47 or opus46 when you need to pin an
older Opus generation. Claude Opus 4.7+ uses adaptive reasoning rather than fixed thinking budgets:
reasoning=auto lets the model choose, effort levels tune depth and token spend, and task_budget
sets a model-visible budget for a whole agentic loop. task_budget is separate from max_tokens,
which remains the enforced per-response ceiling.
Structured outputs default to JSON schema on models that support Anthropic's structured-output
feature. Older models fall back to the legacy tool_use flow.
Use the native Google provider for Gemini models.
fast-agent --model gemini
fast-agent --model "gemini3?reasoning=auto"
fast-agent --model "google.gemini-3.1-pro-preview?reasoning=high"
Google models support structured outputs and multimodal inputs. Current fast-agent model metadata advertises text, image, PDF, audio, and video tokenization for Gemini models. YouTube links can be attached as media links when using a model that supports video input.
Useful query parameters:
reasoning=auto|minimal|low|medium|high|offstructured=json- sampling controls such as
temperature,top_p, andtop_kwhere applicable
xAI
Use the xAI provider for Grok models.
fast-agent --model grok
fast-agent --model "xai.grok-4.3?web_search=on"
fast-agent --model "xai.grok-4.3?x_search=on"
Hugging Face
Use the Hugging Face provider for Hugging Face Inference Providers routing and curated aliases.
fast-agent --model kimi
fast-agent --model kimi26instant
fast-agent --model "hf.moonshotai/Kimi-K2.6:novita?reasoning=on"
Additional providers
Use Additional Providers for hosted OpenAI-compatible APIs, routers, and local endpoints such as Groq, DeepSeek, Aliyun, OpenRouter, Open Responses, TensorZero, and generic endpoints.
fast-agent --model deepseek
fast-agent --model qwen-turbo
fast-agent --model groq.openai/gpt-oss-120b
fast-agent --model openrouter.google/gemini-2.5-pro-exp-03-25:free
fast-agent --model generic.llama3.2:latest
That page keeps the long-tail reference in one place, including config keys, API key environment variables, default endpoints, and provider-specific notes.
Model string format
Model strings follow this format:
- provider: the LLM provider, for example
responses,anthropic,google,xai,hf,azure,openrouter,generic, ortensorzero - model_name: the model or deployment name
- query parameters: provider/model-specific overrides such as
reasoning,structured,context,transport,service_tier,temperature(tempalias),web_search,web_fetch,x_search, andtask_budget
Examples:
responses.gpt-5.5?reasoning=mediumresponses.gpt-5.5?web_search=onsonnet?reasoning=4096opus?web_search=on&web_fetch=ongemini3?reasoning=autoxai.grok-4.3?x_search=onkimi26instanthf.moonshotai/Kimi-K2.6:novita?reasoning=onazure.my-deploymentgeneric.llama3.2:latestopenrouter.google/gemini-2.5-pro-exp-03-25:freetensorzero.my_tensorzero_function
Precedence
Model specifications follow this precedence order, highest to lowest:
- Explicitly set in agent decorators
- Command-line arguments with
--model - Default model in
fast-agent.yaml FAST_AGENT_MODELenvironment variable- System default (
gpt-5.4-mini?reasoning=low)
Reasoning
You can also set reasoning directly in the model string query. This is especially useful for provider-specific reasoning modes:
responses.gpt-5.5?reasoning=mediumsonnet?reasoning=4096(budget tokens)opus?reasoning=auto(adaptive default)opus?reasoning=xhigh&task_budget=128k(adaptive Opus + task budget)gemini3?reasoning=highxai.grok-4.3?reasoning=none
Reasoning, Verbosity and Task Budget settings are also available from the /model command, or by using F6 or F7 keys.
Temperature and sampling
You can set sampling temperature directly in the model string query:
responses.gpt-5.5?temperature=0.2openai.gpt-4.1?temp=0.7hf.moonshotai/Kimi-K2.6:novita?temperature=1.0&top_p=0.95
If temperature is omitted, fast-agent does not send a temperature parameter.
Only explicit values (for example via ?temperature= / ?temp= or request
params/config) are forwarded.
Model presets and model references
For convenience, popular models have built-in model presets such as codex or sonnet.
These are documented on the LLM Providers page.
You can also create local model overlays. These are environment-local named model entries that
bundle endpoint settings, auth, request defaults, and local metadata under a short token such as
qwen-local. See Model Overlays.
You can also define your own namespaced model references in fast-agent.yaml and
reference them with exact tokens like $system.fast.
If a configured model reference cannot be resolved, fast-agent logs a warning and automatically falls back to the next lower-precedence model source.
Default configuration
You can set a default model for your application in your fast-agent.yaml:
History saving
You can save the conversation history to a file by sending a ***SAVE_HISTORY <filename> message. This can then be reviewed, edited, loaded, or served with the prompt-server or replayed with the playback model.
File Format / MCP Serialization
If the filetype is json, fast-agent saves a {"messages": [...]} JSON container. It can contain either MCP PromptMessage objects (legacy) or PromptMessageExtended objects (preserves tool calls, channels, etc). fast_agent.load_prompt and prompt-server will load either the text or JSON format directly.
This can be helpful when developing applications to:
- Save a conversation for editing
- Set up in-context learning
- Produce realistic test scenarios to exercise edge conditions etc. with the Playback model