Deploying an AI agent in a company: what I wish I'd known beforehand

A few months ago, I started integrating AI agents into real professional contexts. Not to experiment on the side, but to answer genuine business needs. And I quickly realized that building an agent that works in a demo is one thing - keeping it running in production is another story entirely.

This article is an honest field report. Not an exhaustive list, but rather a distillation of what I wish I'd had on hand when I started. I'm still learning, and this text will likely evolve over time.

What is an AI agent, concretely?

An AI agent is a program capable of reasoning autonomously to accomplish a task. Where a simple call to a language model is limited to a single exchange - one message in, one response out - an agent can break an objective down into several steps, make decisions along the way, and act accordingly.

What makes all of this possible are tools. A tool is a function the agent can decide to call on its own when it needs to. Concretely, that can be a web search, a database query, sending an email, a call to an external API, or even running a piece of code.

The language model acts as the brain: it analyzes the user's message, decides which tools to use, interprets their results, and determines whether the task is complete or whether it needs to keep going. This cycle - reason, act, observe, reason again - is what fundamentally distinguishes an agent from a simple chatbot.

It's a powerful architecture. But it quickly raises practical questions: how do you make sure the agent selects the right tools? How do you keep it from going off in every direction? How do you monitor its behavior in production? That's exactly what I cover next.

Framework, aggregator, or direct API: how to choose?

The choice of technical tooling is far from trivial. It shapes development speed, maintainability, and the agent's long-term robustness.

A framework like LangChain lets you get started fast thanks to ready-made abstractions - context management, tools, memory. But it introduces a hidden complexity that can become a serious drag in production: frequent breaking changes, opaque debugging, over-engineering for simple cases.

An aggregator like OpenRouter offers welcome flexibility by giving access to dozens of models through a single unified API. It's ideal for comparing models or switching easily from one to another. The trade-off: an extra dependency, and sometimes delayed access to the latest features.

A provider's direct API guarantees full control, optimal performance, and long-term stability. But it requires more upfront investment to build your own orchestration layer.

In practice, many teams follow a natural path: a framework to explore and prototype, then a direct API once the requirements stabilize. The aggregator only comes into play if multi-model flexibility becomes a real business need - and not a theoretical precaution.

Mastering tool selection

While building my agent, I quickly ran into a classic pitfall: the agent systematically selected too many tools from a single user message. The result was skyrocketing token consumption, slower responses, and a climbing API bill.

I identified three approaches to finely control this mechanism, each with its own trade-offs.

Building your own selection mechanism

You can take full control by using regular expressions or classification rules to analyze the incoming message and match it to a predefined group of tools.

The benefits are real: you avoid an extra call to the model, which translates into less latency and significant token savings. But this approach is rigid by nature. It works well on specialized agents with a narrow scope, and quickly shows its limits as soon as the agent becomes more versatile. A rule doesn't capture nuance.

Letting the AI choose its own tools

This is the approach frameworks like LangChain or LlamaIndex offer by default: delegating selection to the model itself. The major advantage is contextualization - the model can lean on the conversation history to make a more informed choice. It understands the intent, not just the words. The trade-off is a slightly higher cost and a somewhat longer response time. Nothing prohibitive on a modest agent, but it counts at scale.

Moving to a multi-agent architecture

When the number of tools piles up and the business logic grows more complex, the two previous approaches show their limits. That's when you seriously consider setting up sub-agents.

The principle: an orchestrator agent receives the message, assesses the overall intent, and delegates to specialized sub-agents - each with its own restricted set of tools. You get a form of separation of concerns, familiar to any developer. This architecture is the most flexible and the most scalable. But it's also the most expensive to run and to maintain. It's justified when the agent's complexity genuinely demands it - not before.

Observability: the thing we always neglect at first

Building an agent that works in a demo is one thing. Making sure it behaves correctly in production, day after day, is another. That's where observability comes in - often neglected at first, always missed when it's absent.

An AI agent is a black box by nature: chained LLM calls, dynamically selected tools, contexts that pile up. Without instrumentation, diagnosing abnormal behavior is like looking for a needle in a haystack.

What you need to be able to measure

Timing performance. Every step of the pipeline should be timed: the model call, a tool's execution, context retrieval. It's the only way to spot bottlenecks and understand why a response takes 8 seconds instead of 2.

Token consumption. This is both a cost and a reliability concern. You need to watch that consumption stays within the limits set by the API to avoid rate limiting errors, but also to detect drift - a context growing abnormally, a tool generating disproportionate responses. Without that visibility, the nasty surprises land at the end of the month on the bill.

The relevance of selected tools. Did the agent choose the right tools for the request? Did it select useless tools that bloated the query? Tracing these decisions lets you detect problematic patterns and refine the system prompt accordingly.

Adherence to the prompt's instructions. The model can sometimes drift - ignoring format constraints, replying in the wrong language, stepping outside its defined scope. Regularly assessing the alignment between the instructions and the outputs produced is essential, especially after every change to the system prompt.

Hallucinations. This is probably the most critical risk. The agent can produce factually false information, invent tool results, or assert things with unjustified confidence. This type of error is silent - no exception raised, no warning signal - and can have serious consequences depending on the use case.

The tools to address this

Server-side logs are the unavoidable starting point. Well structured - JSON with a timestamp, session identifier, step, duration, tokens consumed - they form the basis of any investigation. The goal is to be able to faithfully reconstruct the complete trace of an interaction from the logs alone.

An automated AI-based flagging system goes further. The idea: have a second model analyze the agent's outputs - or the same one with a dedicated evaluation prompt - to automatically detect anomalies. Potential hallucinations, instruction violations, responses inconsistent with the context. It's particularly useful for monitoring a high volume of interactions without systematic manual review.

Specialized tracking solutions like LangSmith offer a dedicated interface to visualize execution traces, compare runs, evaluate performance over time, and share test sets. This kind of tool quickly becomes indispensable as soon as the agent grows in complexity.

These three levels are complementary rather than substitutable. Logs capture everything, automated flagging detects the abnormal, and tracking gives the big picture. An agent without observability is an agent you can't really maintain - you endure its behavior instead of steering it.

Prompt engineering: framing your agent without smothering it

If the tools define what the agent can do, the prompt defines what it should do - and how. It's the agent's founding document, the one you tweak constantly and where every word counts.

The fundamentals of the system prompt

Tone and personality. An agent that responds in a cold, technical way won't suit a consumer-facing application. Conversely, an agent that's too casual can lack credibility in a professional context. Explicitly defining the expected register prevents drift and gives consistency to every interaction.

Role and mission. The agent must know precisely who it is and what's expected of it. A clear description of its scope anchors its behavior and reduces the risk of off-topic responses.

Presenting the available tools. Even though the framework technically handles exposing the tools to the model, it's helpful to mention them explicitly in the prompt with a brief description of their use. It helps the model better contextualize when and why to use them.

The current date and time. This is a detail that makes a real difference in practice. Language models have no awareness of the present moment - their knowledge stops at their training date. Dynamically injecting the date and time into the system prompt on every request lets the agent place itself correctly in time. It's indispensable as soon as it has to handle scheduling tasks, reminders, or anything involving a notion of a calendar.

Security guardrails

An agent exposed to real users must be protected - against misuse, attempts to manipulate the prompt, or simply unintended drift.

A few explicit rules in the prompt are often enough to cover the essentials: specifying that the agent must strictly limit itself to its mission, that it must not respond to requests outside its scope, and that it must rigorously follow the instructions it receives. These are simple instructions, but their absence is quickly felt in production.

These rules can be complemented by rate limiting at the application level, to prevent a user from soliciting the agent abusively. The two levels of protection are complementary: one frames the AI's behavior, the other frames its use.

Using AI to write and refine your prompt

This is probably the most underrated piece of advice: what better than the AI itself to write or improve its own prompt? By simply describing the agent's goal, its context, and its constraints, you can ask the model to propose a first structured version - often better than a first hand-written draft.

And even if you'd rather write it yourself, submitting the prompt to a review by the AI is a valuable reflex. It can point out ambiguities, contradictory instructions, or edge cases you hadn't anticipated.

Finding the right balance

This is the trap you fall into easily: in trying to control everything, you end up over-constraining the agent. Too many rules, too many restrictions, too many special cases to handle - and the agent loses fluidity, sometimes even effectiveness on its basic tasks.

A good prompt is one that frames without shackling. It gives a clear direction, reasonable limits, and then leaves the model enough latitude to adapt to unforeseen situations. Finding that balance takes iteration - and it's often by observing the agent's real behavior in production that you understand where to tighten or loosen the constraints.

To conclude (provisionally)

Deploying an AI agent in a company is work that doesn't stop at the first deployment. It's a living product, one that demands monitoring, prompt iteration, and real thought about the architecture as needs evolve.

The mistakes I've described here, I've made them all. Some cost me time, others money. A few simply taught me how not to do things.

This field report is a work in progress. I'll keep enriching it as I move forward on these topics.

This article will be updated as my experiments continue. If you have questions or feedback on this approach, reach out via the contact page.