If you are building anything called “agentic GeoAI,” your architecture is one of four things. Deterministic machine learning on imagery. Retrieval-augmented generation over an image archive. A language model with tool calls into your pipelines. A chain of language model agents. They look similar in marketing copy. And they behave differently in what they carry and drop on the way to a buyer’s decision.

Earth observation has spent over a decade building uncertainty estimation into deterministic ML. Sentinel-2’s S2-RUTv1 publishes per-pixel uncertainty at the source. Sen2Cor’s cloud-omission rate averages 37.4% in validated scenes and reaches 87% in difficult ones. Marsocci and Scardapane’s 2024 evaluation across eight foundation models, eleven areas of interest, and seven tasks found that no foundation model has stable out-of-distribution calibration. The numbers exist. The methods to report them exist. In any commercial agentic GeoAI product publicly documented as of May 2026, there is no path for those numbers to reach the buyer’s decision.

Language models are trained to maximise next-token likelihood given context. They are not trained to propagate the uncertainty of upstream computations. When the synthesis layer writes “the deforestation rate is 4.2%” as the conclusion of a tool call or a retrieval, the number is a token prediction conditioned on the prior tokens. The confidence interval that came with it does not change the next-token decision.

Four architectures, and why we have these four

“Agentic” predates LLMs by several decades. In classical AI it names any system that perceives an environment, selects actions, and pursues goals. Reinforcement learning agents are agentic. Bayesian sensor fusion with message-passing between specialised models is agentic. Classical multi-agent systems built on planning languages are agentic. None of these require a language model. Some have been deployed in defence applications since the 2000s. Almost none reached commercial civilian Earth observation.

The current market category, agentic AI as marketed since 2023, converged on a much narrower architectural choice. Four architectures dominate the commercial deployments visible in May 2026.

Deterministic machine learning on imagery. A trained model produces a labelled output from satellite input. Change detection, segmentation, classification, regression. Sentinel-2 scenes go in, vegetation indices come out. Most of what works at production scale in EO today sits here.

Retrieval-augmented generation over an image archive. A query produces vector-similarity matches from an image embedding store, and an LLM summarises the matches in natural language. Sometimes called search-and-summarise when sold honestly.

A language model with tool calls into pipelines. The LLM coordinates. It decides which tool to invoke, parses each tool’s structured output, and writes a prose synthesis. ReAct, OpenAI function calling, MCP, LangChain agents, Anthropic Agent Skills. The market default for “agentic” since 2024.

A chain of language model agents. Multiple LLM-driven agents coordinate over multi-step workflows, each potentially calling its own tools. Multi-agent in the sense most vendors mean it now: LLM coordinators all the way down.

Other architectures do the agentic functionality without LLMs. Reinforcement learning agents over EO observations. Bayesian belief networks for multi-sensor fusion. Active-learning loops where the system selects which scenes to label next. Classical planning agents. Each can perceive, decide, and act without an LLM in the coordinator slot. Some handle uncertainty natively in ways the LLM does not. None dominates the current commercial market.

The reason we have the four architectures we do, rather than the older alternatives, is cost. Building an LLM-coordinator agent costs one to two orders of magnitude less than building an RL agent or a Bayesian fusion system. You write prompts and define tools instead of designing reward functions, training environments, simulation infrastructure, and Bayesian priors. A small team ships in weeks instead of a research lab in years.

Cost was not the only change. ChatGPT made natural-language interfaces universal, so buyers who never wanted to learn EO now expect to ask questions in English. Satellite data volumes rose roughly tenfold between 2020 and 2025, through Planet’s daily revisit, new SAR constellations, and hyperspectral sensors, and the human-analyst-as-agent model started breaking. Vendors needed automation regardless of architecture. Defence procurement (NGA, DoD, NATO MoDs) began calling for AI-augmented analysis after 2022, with civilian climate-finance buyers following suit, creating a supply-side incentive that was not there before.

None of these forcing functions made the LLM the right architecture for problems requiring calibrated uncertainty propagation. They made it an available architecture.

When agentic was expensive, you only built it for problems where the alternatives were demonstrably worse. The engineering cost forced discipline about which problems deserved an agentic architecture. When agentic became cheap, founders built it everywhere, including problems where the LLM is structurally wrong for what the buyer needs. The decade of uncertainty work in deterministic ML, calibration validation, conformal prediction wrappers, Bayesian propagation, foundation-model Out-of-Distribution (OOD) evaluation, does not follow the architecture up the stack.

What each architecture carries and drops

Each of the four architectures handles upstream uncertainty differently.

Diagram comparing four agentic GeoAI architectures: deterministic ML on imagery, RAG over an image archive, LLM with tool calls, and chain of LLM agents. Each panel shows what the architecture produces with uncertainty visible in orange and what is delivered to the buyer without it.
What each architecture produces. What the buyer gets.

Deterministic ML on imagery carries uncertainty when it is designed in. Per-pixel uncertainty exists for major government missions. Sentinel-2’s S2-RUTv1 (Gorroño et al. 2018) publishes per-pixel radiometric uncertainty at the L1 product level. Landsat-8 OLI is calibrated against NIST traceability at approximately 2.5%. MODIS reflectance is reported with a ±2% uncertainty. The well-handled cases publish the numbers. Sen2Cor’s cloud-omission rate averages 37.4% in validated scenes and reaches 87% in difficult ones (Coluzzi et al. 2018). Aerosol optical thickness can exceed 160% relative error. Absolute geolocation drifts 11 to 14 metres at the 95th percentile. None of this is hidden. The path from these numbers to the buyer’s decision is missing. Marsocci and Scardapane’s 2024 evaluation (arXiv 2409.08744) ran around 500,000 linear probes across eight foundation models, eleven areas of interest, and seven tasks. No foundation model showed stable out-of-distribution calibration. The architecture carries the work and the next layer up usually does not.

Open-source missions are not the same as commercial imagery on this question. The numbers above come from ESA, USGS, and NASA programmes that publish their calibration work and uncertainty budgets in peer-reviewed literature. Commercial imagery is often a different story. Maxar’s high-resolution optical products, Planet’s SkySat and PlanetScope, BlackSky’s Gen-3, ICEYE, Capella Space, Umbra, and Satellogic do not provide the equivalent of S2-RUTv1’s per-pixel uncertainty product to downstream users. None publishes granular per-pixel uncertainty at scale. Commercial small-sat SAR has the thinnest public calibration literature of all.

Traditional Cal/Val (instrumented test sites, ground-truth campaigns, peer-reviewed methodology) takes months to years and incurs substantial costs. Government missions have decades of budget for it. Commercial constellations launch fast, iterate fast, and serve customers who pay for fresh imagery and revisit cadence, not calibration certificates. Commercial vendors removed the work from the workflow because the buyer was not asking for it. It is a sensible choice. It is also a loss, especially in the agentic applications of EO. If your imagery is commercial, the uncertainty that the deterministic-ML layer can carry depends on what the imagery vendor produced.

RAG over an image archive drops the retrieval question. Vector-similarity matches in an embedding space are not well-calibrated to semantic correctness. The retriever’s similarity score answers “is this image embedding close to the query embedding,” not “is this the right scene to answer the question.” When an LLM summarises retrieved tiles in natural language, the similarity scores do not enter the prose. The summary speaks as if the retrieved evidence was the right evidence. Image-RAG inherits a problem text-RAG has documented since 2023: hallucinated synthesis over retrieved context, where the LLM generates fluent prose that references the right kind of source but synthesises content the source does not support. In imagery, the failure mode is worse because semantic correctness in remote sensing depends on temporal context, sensor type, and acquisition geometry, none of which are well-represented in current imagery embeddings.

A language model with tool calls flattens structured tool uncertainty. The tool returns a structured output. A value, a confidence interval, a probability score, a model checkpoint. The LLM ingests it, decides what to write, and produces prose. The prose is a sequence of token predictions that maximise likelihood given the training data. KalshiBench (Bartlett et al., December 2025) reports expected calibration errors of 0.120 for Claude Opus 4.5 and 0.395 for GPT-5.2-XHigh, compared with human superforecasters at 0.03 to 0.05. Reasoning-enhanced models score worse on calibration than non-reasoning models at comparable accuracy. The Agent UQ survey (arXiv 2602.05073, February 2026) finds that standard uncertainty-quantification methods perform nearly at random on tool-use benchmarks. The synthesis layer does not preserve what the tool returned.

A chain of language model agents compounds the loss. Each step’s intrinsic uncertainty is hard enough to handle. Extrinsic uncertainty, inherited from earlier steps, dominates in agentic chains and goes untracked by default. UProp (Liu et al., June 2025, arXiv 2506.17419) decomposes uncertainty into intrinsic and extrinsic components and shows the extrinsic component grows with chain length. SAUP (ACL 2025) recovers up to 20% AUROC improvement when step-wise uncertainty is propagated explicitly; standard methods that do not propagate get close-to-random performance on τ²-bench. OpenAI Agents, LangChain, Anthropic Agent Skills, and the rest do not propagate calibrated uncertainty through their multi-step workflows, though the methods exist.

The architecture that handles uncertainty natively (deterministic ML) handles it at the bottom of the stack and loses it on the way up to the buyer. The architectures that the LLM revolution made cheap (RAG, tool-calling, multi-agent) do not handle uncertainty natively at all. They borrow credibility from the layer below and erase the work the field already did.

What the LLM is doing

This is just my observation, and it could be wrong. In fact, I hope it is. Based on the conversations I have had with EO operators, the AI literature I have read, the way the agentic-AI category has been pitched in 2025-2026, and the texture of working with these systems daily (including in the research and drafting of this piece), the part that seems missing is mechanistic. People know LLMs predict the next token. They have not always worked through what that means for uncertainty propagation, for the difference between training and post-training cost functions, or for the gap between what an LLM agent does by default and what it could be engineered to do. If those distinctions are common in rooms I am not in, this section is redundant. Yet I do not think they are.

The LLM has three cost functions stacked over time.

Pretraining maximises the likelihood of the next token given prior tokens, over a corpus of human-written text. Every time the model guesses the wrong next token, the weights get nudged toward making the right one more likely. The bulk of what is in the model’s weights comes from this stage: language, facts, patterns of reasoning, code syntax, and how arguments are structured.

Post-training changes the cost function. The model generates responses, raters (or a reward model trained on human preferences) score them, and the model is updated to increase the probability of high-scoring responses. The optimisation target shifts from “predict the next token in the corpus” to “produce outputs that score well on this reward signal.” Anthropic uses constitutional methods (RLAIF, AI feedback against a written constitution). OpenAI uses RLHF. Both stages do the same structural thing. This is where “be helpful,” “don’t make things up,” “refuse harmful requests” get shaped in. It is also where “be confident-sounding” gets reinforced, because raters reward confident answers more than hedged ones, and the model learns this.

Inference has no cost function. The weights are frozen. When the model is wrapped in an agent loop, no gradient signal flows back from “did the action produce a good outcome in the buyer’s environment” into the parameters. The agent loop is sampling from a trained distribution.

None of these three trains the model on whether the buyer’s decision worked out. For an EO use case, the LLM has been optimised to produce text that scores well with raters, sampled at inference time without further feedback from the environment.

A second mechanism compounds this. Default behaviour, what the model produces with no system prompt, no examples, no engineered constraints, is the centre of mass of its training distribution. For a generic helpful AI assistant, that centre of mass is balanced both-sides framing, hedging, polite preambles, tidy bullet lists, optimistic tone, and confident-sounding prose. Engineering behaviour means shifting the distribution by changing the input. System prompts, few-shot examples, structured context, conditional skills. What you get out is what you condition on (it works the same for people).

For most commercial agentic GeoAI products, the engineering done is enough to make the demo work. It is not enough to shift the distribution toward calibrated uncertainty propagation, because no one is conditioning on examples of properly propagated uncertainty in EO outputs. Those examples are rare in the training corpus and have not surfaced in the post-training reward signal. The default sits where the training data sits, which is confident-prose-with-numbers.

The model also resolves ambiguity by default. When given partial information, the most likely completion is one that names the pattern, proposes the framework, produces closure. This is structurally what a next-token predictor does. High-probability continuations are coherent ones, and coherence requires resolution. Holding ambiguity is statistically uncomfortable for the LLM. For an EO buyer whose decision depends on holding ambiguity until evidence resolves it, the LLM is pulling the wrong way. The agent will collapse the question before the buyer is ready to.

These three things, the cost-function stack, the default-as-distribution-centre, and the resolve-ambiguity-by-default tendency, explain why the LLM behaves the way it does.

What the data infrastructure would have to provide

If the LLM does not propagate uncertainty, and the buyer’s decision depends on calibrated uncertainty, the engineering has to live somewhere. Either in the agent’s input layer (what it conditions on) or in the buyer’s data infrastructure (what the agent’s output integrates into). The four conditions below are what engineering looks like for EO. They are the specifications a procurement template would require for an agentic GeoAI product to plausibly carry uncertainty through to a decision.

Provenance access on outputs. Every output is traceable to its source scenes, processing versions, and model checkpoints. The buyer can ask, “Where did this number come from?” and reconstruct the answer. STAC catalogues solve part of this at the data layer. Microsoft’s Planetary Computer Pro has STAC, COG, Zarr, GeoCatalog, Entra ID. The data layer is there. The translation layer is missing: when the agent writes a synthesis, it does not surface the STAC item IDs that informed it. The buyer cannot trace forward from output to source. The data layer carries provenance, and the agent layer drops it.

Uncertainty as first-class output. The agent produces calibrated intervals, distributions, or probability scores, not just point estimates. The model for what this looks like is Sentinel-2’s S2-RUTv1: per-pixel radiometric uncertainty available alongside the radiometric value. For an agent layer, the equivalent would read: “The deforestation rate is 4.2% with a 95% confidence interval of 2.8 to 6.5% based on retrievals from scenes [list], processed using model [version], evaluated against regional baselines [list].” Tomorrow.io’s resilience platform blog describes the move “from blind trust to total confidence.” The structural test inverts that. From total confidence to calibrated trust. The metric is whether expressed confidence matches empirical accuracy. KalshiBench is what this measurement looks like for general LLMs. No equivalent has been built for agentic GeoAI.

Auditable orchestration. A buyer or auditor can replay an agent’s tool calls, reasoning steps, and intermediate outputs to reconstruct a decision after the fact. NV5 GeoAgent claims “every action is recorded, including the parameters used and the analytical steps performed, creating a transparent and reproducible record.” That bar is the right bar. What current agent frameworks log is API calls and final outputs, not the LLM’s intermediate reasoning, and rerunning the same chain produces different outputs unless the model is seeded and tools are deterministic. Reproducibility in agentic GeoAI is closer to “we have the receipts of what we asked the system” than “we can reconstruct what the system did.” Real auditable orchestration would require structured logs of every tool call, every retrieved object ID, every intermediate model output, every choice the agent made between alternatives, and the prompt and context window at each step.

Out-of-distribution evaluation. The vendor publishes evaluations on conditions that the EO+ML chain has not been trained on. Marsocci and Scardapane’s 2024 evaluation across eleven areas of interest is a partial template. The agent-layer equivalent would extend it: performance on cloud-affected scenes from regions underrepresented in training, on sensor types the model was not pretrained on, and on atmospheric conditions outside the training distribution. GeoBenchX is one current benchmark, but it does not measure calibration. The published research is six months old and not yet a standard procurement requirement.

Two columns showing data infrastructure conditions on the supply side and the demand side. Supply side: provenance access, uncertainty as output, auditable orchestration, OOD evaluation. Demand side: identity resolution, decision instrumentation, audit trails, engineering capacity. Both sides have to close.
What you supply. What they demand. Both sides have to close.

The four conditions above describe what your data infrastructure has to provide as the founder building the agent. That is the supply side. The other side is your buyer’s data infrastructure. Identity resolution that connects your outputs to their decision objects: loan book entries, insured asset IDs, registered project boundaries, customer locations. Decision instrumentation that lets their pricing engines, capital models, and registry workflows ingest probabilistic outputs rather than rejecting anything that is not a point estimate. Audit trails that survive on their side, not just yours. Engineering capacity to maintain all of this in-house rather than outsource it to the vendor whose substrate the buyer is trying to evaluate. Across institutional finance under climate disclosure, parametric insurance, and carbon MRV, none of these is standard on the demand side either. The supply-side gap and the demand-side gap are mirror images of each other. Both have to close for an agentic GeoAI product to deliver a decision the buyer can use.

These conditions are the structural specifications EO has been building toward for a decade in deterministic ML, and the buyer infrastructure those outputs have to land in. The change required to ship them in agentic products is engineering. The research is done. It has not happened because the buyers requiring it have not yet demanded it in procurement, and the vendors shipping it have not yet had to do so.

Where each architecture fits and where it does not

If you are building agentic GeoAI, the structural question is whether your architecture fits the problem you are selling into. Some problems the architectures handle. Some they do not. The mirror is whether your marketing claim matches your architecture.

Deterministic ML on imagery fits when the task is narrow and well-defined (change detection, classification, segmentation), the input sensor is (somewhat) calibrated (Sentinel-2, Landsat, MODIS), the training distribution matches deployment, and the buyer needs a calibrated point estimate or probability. Most of what the EO sector ships at production scale is here. This architecture carries the work the field has done. If your problem fits, the engineering is mature and the path to a calibrated decision is mostly already built.

It does not fit when the task is multi-source synthesis across heterogeneous evidence, when the deployment region is out of distribution, when the buyer needs natural-language reasoning over the outputs, or when edge cases dominate the use case. Bolting an LLM on top to handle the synthesis changes the architecture.

RAG over an image archive fits when the corpus is closed and bounded, the query is a search pattern (find scenes matching X), source citation makes sense to the buyer, and the failure mode of “no good match” can be handled gracefully. Closed regulatory corpora work here. EuclidHL’s work on planning documents is the cleanest example: structured zoning text, ArcGIS layers, and an LLM that cites both. It works because of the corpus and the citations. Image-RAG over open archives fails the same test. The retriever finds images that look similar to a computer, not images that answer the question. When retrieval is wrong, nothing downstream catches it.

It does not fit when the corpus is open, when synthesis across non-similar sources is required, when calibrated uncertainty is needed in the output, or when the domain is shifting underneath the embeddings.

A language model with tool calls fits when the work is around well-validated deterministic ML rather than inside it. Document processing, briefing generation from structured inputs, query routing, analyst-augmenting workflows where the decision still happens elsewhere. If the LLM is the orchestrator and the deterministic ML is doing the decision-relevant work, the architecture carries. The LLM coordinates the workflow; the buyer’s decision rests on the deterministic outputs.

It does not fit when the buyer’s decision rests on the LLM’s synthesis itself. Pricing, capital allocation, parametric trigger design, regulatory disclosure of method change, registry methodology approval. These need calibrated uncertainty propagated to the output. The LLM does not propagate it. Wrapping the deterministic outputs in a confident prose synthesis erases the work the deterministic outputs did to carry uncertainty in the first place.

A chain of language model agents fits when each step is well-defined, each agent specialises, and intermediate checkpoints catch failure before it compounds. Most current commercial deployments at this tier are demo-grade. The published research on uncertainty propagation in chains (UProp, SAUP) shows the methods that would change this; the products that ship them are still announcements.

It does not fit when each step’s uncertainty compounds without checkpoints, when the engineering budget is small relative to the problem, or when the buyer’s decision depends on reproducibility across runs. The agent loop is not deterministic. Reruns produce different outputs unless seeded. For a regulator asking “what was the decision basis at time T,” that is a problem.

Your marketing claim describes the pitch. Your architecture describes what the pitch can deliver. If the pitch outpaces the architecture, the gap is structurally available for the credibility-borrowing pattern this piece is about. If the pitch matches the architecture, the engineering is mature, the buyer’s data infrastructure is at least partially in place, and the four conditions on each side are something you can meet. The category of agentic GeoAI is not one thing. Where you sit inside it determines whether you are doing decision-grade work or describing it.

What we know in May 2026, and what would change it

This piece is dated. The agentic GeoAI category converged in late 2025. Operational behavioural change in regulated buyers takes longer than that to surface in mandatory disclosures. Fiscal-year-2026 climate reports, insurance regulators’ annual filings, and carbon registry decisions all publish in early 2027. The earliest window in which “agentic GeoAI changed this decision” could appear in primary sources is roughly twelve months from now.

As of May 2026, no commercial agentic GeoAI product publicly documents a buyer decision driven specifically by the LLM layer. The decision changes that have surfaced in EO trace back to deterministic ML, not to the LLM orchestration above it. The gap between the pitch and the decision is the four conditions.

By early 2027, the buyers in your sector will be applying those four conditions at procurement. The founders who have closed the gap will be visible. So will the founders who have not.

In May 2027, I will check the disclosures and publish what I find.

Sources

Earth observation calibration and uncertainty

LLM calibration and agentic uncertainty