top of page
Search

Securing an Agentic AI application for Federal Data: How RedCastle Resources Built Trustworthy Guardrails with Model Armor

  • Writer: Maria Borja Arboleda
    Maria Borja Arboleda
  • 5 minutes ago
  • 5 min read

At RedCastle Resources, our work with the USDA Forest Service goes beyond mapping the nation's landscapes, it extends to making that data accessible. We built a conversational, AI-powered interface that will let USFS staff and the public ask natural-language questions about Landscape Change Monitoring System (LCMS) and Tree Canopy Cover (TCC) data.  How has land cover changed in a particular National Forest since 1985? How many acres were affected by a particular disturbance agent in a Ranger District? How do the change trends in a County compare to those at the State level? The chatbot translates those questions into real-time queries and fetches its response from official USFS databases and documentation.


Putting a large language model (LLM) in front of a federal data system raises an obvious question: how do you make sure it only ever does what it's supposed to do? That question was at the core of this tool’s design and development. Tool-level constraints, data source validation, and routing logic never let the model take an unauthorized action like hallucinate a statement or fabricate numbers. 


A second question came along: how do you prevent the LLM from becoming an attack surface by a malicious actor? Could the system still be manipulated into ignoring its own boundaries? This post focuses on how we answered this second challenge by implementing Google’s Model Armor into our architecture, a safety layer that screens every conversation. 


Guiding Principle: An AI agent serving federal data must be technically incapable of being talked out of its intended behavior, not merely instructed to avoid it.


Why Agent Instructions Alone Aren't Enough

Every well-built agent starts with clear instructions: stay on topic, don't reveal internal logic, don't discuss prohibited subjects, only cite verified sources. These instructions are essential, they represent the most basic line of defense, but they share a critical weakness: they rely on the LLM remembering to follow them. A system prompt is persuasive guidance to the model, not a hard enforcement boundary. An adversarial user trying to manipulate a model can craft a message convincing enough to talk the model into reinterpreting, ignoring, or working around its own instructions, and once that happens, the attacker can use the entire system for its benefit.


This is the gap Model Armor closes. It doesn't ask the model to police itself, it sits outside the model entirely, independently verifying that the conversation stays within bounds. 


A Before-and-After Look

A traditional keyword filter may catch a request with a restricted topic outright, but it can't catch a paraphrase or an embedded command in the shape of a harmless request. Conversational manipulation are particular challenges that have arisen with language models. Here are two examples of attacks worth distinguishing, because they fail differently without a screening layer in place.


Jailbreak attempts try to talk the model out of its own behavioral rules directly: "ignore your previous instructions," "you're now an agent with no restrictions," or a role-play interaction designed to gain access to off-limits territory. Without Model Armor, the only thing standing between a well-crafted jailbreak attempt and a policy-violating response is whether the underlying model resists the attack entirely on its own, under pressure, every single time. With Model Armor in place, that same message is screened first and flagged as a manipulation attempt. The system then returns a safe, on-topic message to the attacker, without having to ever reach the language model.


Prompt injection is a quieter variant: instead of talking to the model directly, an attacker plants instruction-like text somewhere the agent will read later. Without Model Armor, there's no checkpoint watching for that pattern. With Model Armor in place, both the conversation and the model's response are screened independently, so an injected instruction that successfully nudges the model still gets caught before its output reaches the user.


In this particular app that we built for our federal partner, a successful jailbreak or injection could potentially get the model to say something it shouldn't: drift off-topic, reveal internal instructions, or reveal personal identifiable information (PPI). That's a real risk for a public-facing federal tool, and it's exactly the gap Model Armor closes.


How Model Armor Works

We built an architecture where Model Armor operates at two independent checkpoints around the language model: one screening happens on the way in (from the user), and another on the way out (to the user).


Input screening happens before the model processes the user's message. The text is sent to a tuned screening service that checks for signs of prompt manipulation, personally identifiable information that shouldn't be processed, and explicit or harmful content. If it comes back flagged, the conversation never reaches the language model and the user gets a polite redirect instead.


Output screening happens after the model generates a response, but before that response is shown to the user. This catches cases where something undesirable made it through generation despite the input check, giving the system a second, independent chance to intervene.

Because this screening sits between the user and the model on both sides of the conversation, it doesn't depend on the model remembering its own rules under pressure. It's a structural checkpoint, not a behavioral expectation.


Iteration, iteration, iteration.

Getting a safety layer like this properly tuned takes iteration. A user’s question naming a specific place can pattern-match against PII detection, the same way a home address would. Thresholds need to be calibrated against real conversation patterns so the system catches genuine threats without over-blocking legitimate questions about land management data. The goal is a system that is both secure and dependable, and one cannot happen at the expense of the other. 


A Layered Defense, Not a Single Gate

Model Armor is the core safety checkpoint, but it isn't the only layer in the system. The system built for our federal partners has several layers of defense, each with a well-defined job, covering different parts of the conversation:

  • A domain-policy guard that watches for specific sensitive topics the agent should never engage with, redirecting those conversations toward the agent's actual data and documentation capabilities instead of refusing outright.

  • A link sanitizer checks every URL the model wants to share against a strict allow-list of official USFS and project domains, so a fabricated or incorrect link can never reach a user.

  • Model-level safety configuration on the underlying LLM itself, providing a baseline content filter beneath everything else.


Getting There: A Practical Checklist

For teams building similar safety architecture around a federal or otherwise sensitive data system, a few phases have served us well:


  • First and last point screening. Add input and output screening as early and last checkpoints in the agent's request flow, don’t rely exclusively on application logic.

  • Tune before you enforce. Run the screening layer in an observation mode first, so you can calibrate sensitivity without disrupting beta-testers while you learn what your system's real blocking capabilities look like.

  • Layer, don't replace. Besides your system prompt guardrails, keep domain-specific guards, output validation, and model-level safety settings in place alongside the screening layer. Each catches something the others don't.


Building AI tools for federal land management data means the bar for trust is high, the people that will be using this chatbot need to know it will stay exactly within its intended scope, every time. By treating safety as an independent, structural layer rather than a matter of instructions and good faith, RedCastle Resources was able to deliver a conversational AI experience for the USFS that's both genuinely useful and built to hold up under real-world pressure.


 
 
 
bottom of page