Author's): Michałzarnecki
Originally published in Towards Artificial Intelligence.
Hi! In this part we move from experiments and prototyping to the real world – production implementations.
Because the truth is: building a working notebook or proof of concept is just the beginning. The real challenges begin when the application must support hundreds or thousands of users, operate reliably 24/7, and still stay within budget.
Let's start with the first foundation: a model independent approach.
Indifferent to models from day one
Many teams building AI applications quickly lock themselves into a single vendor – just OpenAI or just Anthropic. It's understandable: it's faster to pick one API and focus. But in the long run it is a huge risk. If a provider increases prices, goes out of business, or changes license terms, the entire application may come to a halt.
That's why it's worth thinking about it from the very beginning model-independent gateway layer.
In practice, this means that your code does not communicate directly with one specific model. Instead, it invokes an abstraction:
- “give me LLM classes on chat” or
- “give me an embedding generator”
And only the gateway decides whether under the hood it should call GPT-5, Claude 4.5 Sonnet, or a local LLaMA running on its own infrastructure.
API gateway + routing + fallback
The second foundation is API Gateway.
Imagine you share a simple endpoint like POST /v1/chatwhere users send requests. In the header, e.g X-Modelthe customer specifies which model to use.
The gateway can run multiple models in parallel and can also implement fallback logic: if the primary model fails to respond within a certain time, you will automatically switch to a backup model, such as an open source model running locally.
This pattern not only improves reliability, but also opens the door to experimentation.
You can send 1% of your traffic to a new model and see how it performs compared to the previous one without changing the entire system.
Cost monitoring and control
The third foundation – often neglected – is cost monitoring and control.
In a prototype, you just say “it works”. In production you will be asked more difficult questions:
- How much does it cost per day?
- What is our hallucination rate?
- How often do we reject results?
This is where tools like LangSmith help – but even a simple internal logging system can work.
We measure latency (because users don't want to wait 30 seconds), we measure cost, and we measure quality – for example: how many responses were rejected by guardrails or grading.
We can also set very simple but effective alerts:
- if daily cost exceeds $50 → send notification,
- if the average response time exceeds 5 seconds → trigger another alert.
This gives you real insight into what is happening inside the system.
These three elements – model independent gateway, API GatewayAND monitoring — are not “nice to have.” These are the foundations. If you take them seriously, your application will not only work in production, but will also remain resilient to changes in the market and technology.
Now let's move on to the code.
Install the libraries and load the environment variables
!pip install -U langchain langchain-openai langgraph fastapi uvicorn
from dotenv import load_dotenv
load_dotenv()
Man in the loop
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain.agents import create_agent
from langchain.agents.middleware import HumanInTheLoopMiddleware
from langgraph.checkpoint.memory import MemorySaver
from langgraph.types import Command@tool
def risky_operation(secret: str) -> str:
"""Perform a risky operation that requires manual approval."""
return f"Executed risky operation with: {secret}"
tools = (risky_operation)
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
hitl = HumanInTheLoopMiddleware(
interrupt_on={
"risky_operation": {"allowed_decisions": ("approve", "edit", "reject")}
},
description_prefix="Manual approval required for risky operation:"
)
checkpointer = MemorySaver()
agent = create_agent(
model=model,
tools=tools,
middleware=(hitl),
checkpointer=checkpointer,
debug=True
)
config = {"configurable": {"thread_id": "hitl-demo-1"}}
result = agent.invoke(
{"messages": ({"role": "user", "content": "Please run the risky operation with secret code $%45654@."})},
config=config,
)
Exit:
(values) {'messages': (HumanMessage(content='Please run the risky operation with secret code $%45654@.', additional_kwargs={}, response_metadata={}, id='589244c7-9860-48fa-b68a-eca595510a73'))}
(updates) {'model': {'messages': (AIMessage(content='', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 60, 'total_tokens': 79, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-CaJj7md4CRaAN2mcI1ju8uek8BJti', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='lc_run--35ad04bd-5d01-4649-a64c-d8c583ffe3aa-0', tool_calls=({'name': 'risky_operation', 'args': {'secret': '$%45654@'}, 'id': 'call_dK786IhVaO3Z4VssPOI1cM6y', 'type': 'tool_call'}), usage_metadata={'input_tokens': 60, 'output_tokens': 19, 'total_tokens': 79, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}))}}
(values) {'messages': (HumanMessage(content='Please run the risky operation with secret code $%45654@.', additional_kwargs={}, response_metadata={}, id='589244c7-9860-48fa-b68a-eca595510a73'), AIMessage(content='', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 60, 'total_tokens': 79, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-CaJj7md4CRaAN2mcI1ju8uek8BJti', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='lc_run--35ad04bd-5d01-4649-a64c-d8c583ffe3aa-0', tool_calls=({'name': 'risky_operation', 'args': {'secret': '$%45654@'}, 'id': 'call_dK786IhVaO3Z4VssPOI1cM6y', 'type': 'tool_call'}), usage_metadata={'input_tokens': 60, 'output_tokens': 19, 'total_tokens': 79, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}))}
(updates) {'__interrupt__': (Interrupt(value={'action_requests': ({'name': 'risky_operation', 'args': {'secret': '$%45654@'}, 'description': "Manual approval required for risky operation:nnTool: risky_operationnArgs: {'secret': '$%45654@'}"}), 'review_configs': ({'action_name': 'risky_operation', 'allowed_decisions': ('approve', 'edit', 'reject')})}, id='a3abdfe342bd7c8be8b1b586ee9f8815'),)}
interrupt handling:
if "__interrupt__" in result:
print("Interrupt detected!")
decisions = ({"type": "approve"})result = agent.invoke(
Command(resume={"decisions": decisions}),
config=config,
)
Exit:
(values) {'messages': (HumanMessage(content='Please run the risky operation with secret code $%45654@.', additional_kwargs={}, response_metadata={}, id='589244c7-9860-48fa-b68a-eca595510a73'), AIMessage(content='', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 60, 'total_tokens': 79, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-CaJj7md4CRaAN2mcI1ju8uek8BJti', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='lc_run--35ad04bd-5d01-4649-a64c-d8c583ffe3aa-0', tool_calls=({'name': 'risky_operation', 'args': {'secret': '$%45654@'}, 'id': 'call_dK786IhVaO3Z4VssPOI1cM6y', 'type': 'tool_call'}), usage_metadata={'input_tokens': 60, 'output_tokens': 19, 'total_tokens': 79, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}), ToolMessage(content='Executed risky operation with: $%45654@', name='risky_operation', id='13109032-38fb-4d94-920c-90026acc41f3', tool_call_id='call_dK786IhVaO3Z4VssPOI1cM6y'))}
Model an agnostic API gateway
To run the sample code below with a model-agnostic API gateway:
1. Place the above code in the app.py file
# Place the above code in a file app.pyfrom fastapi import FastAPI, Header
from pydantic import BaseModel
from langchain_core.runnables import RunnableLambda
from langchain_core.messages import AIMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
class ChatRequest(BaseModel):
message: str
class ChatResponse(BaseModel):
provider: str
model: str
answer: str
prompt = ChatPromptTemplate.from_messages((
("system", "You are a helpful assistant."),
("human", "{message}")
))
def build_model(x_model: str):
"""
x_model format:
- 'openai:gpt-4o-mini'
"""
if ":" in x_model:
provider, model_name = x_model.split(":", 1)
else:
provider, model_name = "openai", x_model
provider = provider.lower().strip()
if provider == "openai":
return provider, model_name, ChatOpenAI(model=model_name, temperature=0)
# if provider == "anthropic": # support for another LLM API provider
# return provider, model_name, ChatAnthropic(model=model_name, temperature=0)
def _unknown(inputs: dict):
return AIMessage(content=f"(unknown provider) Echo: {inputs.get('message','')}")
return "unknown", x_model, RunnableLambda(_unknown)
app = FastAPI(title="Model-Agnostic LangChain Gateway")
@app.post("/chat", response_model=ChatResponse)
def chat_endpoint(
req: ChatRequest,
x_model: str = Header(default="openai:gpt-4o-mini", alias="X-Model"),
):
provider, model_name, model = build_model(x_model)
chain = prompt | model | StrOutputParser()
answer: str = chain.invoke({"message": req.message})
return ChatResponse(provider=provider, model=model_name, answer=answer)
2. Start the server:
uvicorn app:app - reload
3. Send inquiry:
curl -X POST 'http://127.0.0.1:8000/chat'
-H 'Content-Type: application/json'
-H 'X-Model: openai:gpt-5-mini'
-d '{"message":"Podaj 3 zalety Pythona."}'curl -X POST 'http://127.0.0.1:8000/chat'
-H 'Content-Type: application/json'
-H 'X-Model: openai:gpt-4o-mini'
-d '{"message":"Podaj 3 zalety Pythona."}'
The future of GenAI
This brings us to the second part of this episode: the future of GenAI.
What will this industry look like in the next few years? No one has a crystal ball – but some trends are already very clear.
Trend No. 1: Multimodality
Models such as GPT-5 or Claude 4.5 can already analyze images, audio and video. This will soon be standard.
When you build apps, you need to assume that users won't just send text. They will send screenshots, photos of documents, audio recordings. Your architecture must be ready for this.
Trend #2: Agentic workflows
Classic APIs and linear workflows are not enough when the process is complex and dynamic.
Instead of encoding the conditions in traditional code, we will declare agent state charts: Researcher, Critic, Expert – and let the system iterate based on status and quality signals.
With these trends in mind, we can prepare our applications for the next generation of even more efficient AI models.
That's all for this chapter on Model Agnostic Pattern, LLM API Gateway, and Future AI Trends.
See next chapter
See previous chapter
see the full code from this GitHub article warehouse
Published via Towards AI
















