Treasury AI: What to Build, Why It Matters, and Why Now¶

Author: Emma Sjöström | Date: 2026-03-03

Executive Summary¶

AI is shifting from software budgets into labour budgets; a 50x expansion in addressable value. But in treasury, the opportunity isn't primarily about labour. It's about capital. Treasury teams influence decisions worth tens or hundreds of millions. Improving those decisions by even a small margin generates impact that dwarfs the cost of the function itself.

Getting there requires more than putting AI on top of a database. It requires three things: correctness (is the answer right?), structural safety (is it allowed?), and contextual alignment (is it right for this company?). Building these into a system that treasurers would actually trust is hard, and it's where the durable competitive advantage lies.

We're still in the early, foundational phase of AI adoption. What we build now becomes the substrate for everything that follows. The execution approach here should be workflow-led: start with the highest-value treasury workflows, run them against what we have, and let the failures define what to build next. Every gap we close, every correction we capture, every piece of institutional knowledge we structure compounds over time, and becomes progressively harder for a competitor to replicate.

The near-term focus: start with the highest-value workflows, build for the gaps they expose, and accumulate customer-specific intelligence as fast as possible.

1. The Shift¶

There's a growing consensus across the tech landscape that AI's economic impact will dwarf previous software transitions, not because it replaces software, but because it replaces (or augments) the work itself.

The math is simple and repeated everywhere. Sequoia Capital, one of the most influential venture firms in technology, calls it "Service-as-a-Software" in their 2025 AI Ascent keynote: AI products evolving from tools to copilots to autopilots, shifting from software budgets into labour budgets. The starting market, they argue, is "at least an order of magnitude bigger" than the global software market was at the start of the cloud transition. Andreessen Horowitz (a16z), the venture capital firm, describes "AI Inside" vertical SaaS as turning labour into software, expanding previously "small" markets by 2–10x via automating labour-heavy processes that software alone couldn't touch.

Jensen Huang, CEO of NVIDIA, put it plainly: "this is the first time in technology where companies are displacing labour budget, not creating a tool." NFX, an early-stage venture firm known for their work on network effects, frames the same shift in concrete numbers: the best SaaS charges ~$1,000/person/year while that person earns $100,000/year. SaaS competes for ~$1 trillion in global software spend. Global labour represents $50–60 trillion. That's a 50x expansion in addressable value.

But treasury is not just a labour story.

In treasury, teams are typically small. An enterprise TMS might cost $100K–1M per year. The treasury team operating it might cost $300K–2M. Those numbers matter, but they're not the leverage point.

The leverage is in capital.

1.1. The broader AI thesis is a labour story. Treasury is a capital story.¶

Treasury teams influence decisions that move tens or hundreds of millions: how much cash to hold versus invest, how aggressively to hedge FX exposure, how to structure intercompany funding, how to respond to liquidity shocks, which scenarios to plan against.

Improving those decisions by even a small margin can generate impact that dwarfs the entire cost of the function.

That's the real shift.

It begins with automating the mechanical parts of treasury work: reconciliation, data gathering, report building, and routine analysis work. But automation alone isn't the opportunity. The real opportunity is in improving the quality and speed of capital allocation decisions. Grounding them in complete, accurate data. Surfacing exposures and opportunities that would otherwise go unseen. Shaped by how the organisation actually makes decisions and operates in practice.

When treasurers make better decisions, faster, the function turns into a durable source of competitive advantage.

This is where the labour framing breaks down. It is tempting to see this as a labour cost story. It isn't. It is a capital allocation story.

Free a treasurer from mechanical work and give them the right context at the right time. The upside is not cost efficiency. It is higher returns on capital.

2. What It Takes to Get It Right¶

So the question becomes: what does it actually take to improve capital allocation decisions in treasury?

The instinctive answer (and the one most of the industry is converging on right now) is to put AI on top of existing data and hope that it figures out the rest. To be specific about what that means in practice: an LLM that can generate SQL queries against your database schema, or query a semantic layer of pre-defined metrics and business logic, and return results in natural language. This is what most "AI-powered analytics" features amount to today. It's genuinely useful: it removes a bottleneck between the person with the question and the data. Building a good semantic layer (structured metric definitions, relationship mappings, business rules) is real work that meaningfully improves accuracy. And it's often the right first thing to build, because it teaches you what users actually want to ask and where the system falls short.

But as a long-term product thesis, it's a feature, not a company. The pattern is commoditising fast; the models are available to everyone, and getting a basic version working is straightforward... Incumbents have architectural debt to work through, but they also have something startups don't: years of customer data already in their systems. Everyone will have some version of "AI-on-data" very soon.

The harder question is whether it'll actually work. Kyriba, one of the largest TMS providers, launched TAI, their AI "agent" for treasury. One of our customers tried it and lost trust after a few attempts. He asked it "how many payments do we have from X customer?" It replied with 10. He ran a report (in Kyriba's own system) and found 12. When he re-asked, TAI corrected itself: "Oh yeah, you're right... there are 12." He called it a dumb analyst.

The data was right there, in the same system. The issue wasn't data access or maybe even model capability; it was the absence of structured logic between the data and the question: what counts as a "payment from X customer," how to handle edge cases, which transactions to include or exclude. That's not something a model upgrade fixes. It requires building an entirely different layer, and that's where a big opportunity lies. Incumbents have the data and the distribution, but they're working against decades of architectural debt. Startups building this layer from scratch, with AI-native architectures, have a real headstart.

But the window won't stay open forever.

2.1. From wrong answers to trustworthy ones¶

So what's actually missing? What's the difference between an AI that confidently returns the wrong number and one a treasurer would trust enough to act on?

I think it comes down to three things:

1. Correctness: Is the answer right?¶

In treasury, the output of the system must be correct. Not plausible. Not statistically likely, unless it is explicitly a forecast.

If a treasurer asks, "Why was last week's forecast off?", the explanation cannot be a model taking a guess. It must resolve to the actual driver: a settlement that landed early, payroll timing that shifted, a one-off adjustment applied in a subsidiary.

If someone asks, "How much can we safely invest?", the number returned must reflect real constraints: liquidity policies, restricted accounts, near-term obligations, and current confidence in the cash forecast.

Large language models generate probabilistic outputs. Treasury decisions require structured, rule-bound reasoning. The answer must be numerically accurate, traceable to the actual driver, and defensible under scrutiny.

In high-stakes finance, answers shape capital allocation and risk posture. The tolerance for "probably right" is effectively zero.

2. Structural Safety: Is it allowed?¶

The treasury domain should not be thought of as just a dataset. It is a governed financial system.

Liquidity policies, facility limits, entity structures, and settlement rules are not soft guidelines. They are enforceable boundaries on how capital can move.

A system operating in this domain must encode those boundaries explicitly. Constraints cannot be left for a model to discover or approximate. They must be defined, versioned, auditable, and enforced.

Governance is not an add-on. It is part of the intelligence. Who approved this liquidity buffer? When was it changed? Under which policy version was this recommendation generated? Without traceability and control, even a highly capable model cannot produce answers that a treasurer, CFO, or auditor would confidently rely on.

At the same time, treasury is a tightly coupled system. Entity structures determine how cash moves. Policies affect investment decisions. Settlement timing influences forecast variance. Intercompany relationships create dependencies across entities. A change in one area propagates through the system.

Structural safety means modelling both the constraints and their interactions.

Capability without structure produces confident wrong answers. Structural safety comes from enforcing boundaries and modelling the full system in which they operate.

3. Contextual Alignment: Is it right for this company?¶

No two companies operate treasury the same way. Why was a buffer increased last quarter? Why is this counterparty treated differently? Why does the CFO prefer to pre-fund ahead of acquisitions? Why does this entity consistently deviate from forecast in Q4?

That knowledge lives in people's heads, inboxes, and spreadsheets.

No model gets this right by default. Treasury is nuanced and customer-specific. What matters is how well the system closes that gap.

Every corrected categorisation, every forecast adjustment, every overridden recommendation is structured feedback about where the system's reasoning diverged from this company's financial reality. Logged and incorporated over time, that learning loop compounds.

The system becomes more aligned with how this company actually manages liquidity, not because the underlying model becomes more intelligent, but because the accumulated context becomes richer.

Over time, that contextual alignment makes the system not just more accurate, but harder to replace. Not because switching is painful, but because what you would lose is irreplaceable.

This framing aligns with a piece from B Capital, a global venture firm, arguing that the model isn't the moat; the value is in the execution environment around it: workflow integration, production learning loops, organizational context. The model is a commodity. The accumulated context is not.

3. The Phases of Technology Adoption, and Where AI Actually Is¶

Every major technology goes through recognisable phases. Carlota Perez, the economist whose work on technological revolutions has become a standard lens for understanding these cycles, describes two great periods: an "installation period" where speculative capital floods in and new infrastructure gets built, followed by a "deployment period" where institutions catch up, regulation takes shape, and the technology becomes genuinely productive. We saw this with electricity. We saw it with the internet. We saw it with cloud computing.

Several prominent investors and firms are now mapping this pattern onto AI, each with slightly different language but converging on a similar structure.

Sequoia describes AI products evolving through three stages: tools, then copilots, then autopilots. The key distinction in their framing is that this progression represents a shift from software budgets into labour budgets: AI stops being a tool you pay for and starts being a worker you pay for.

a16z describes three waves of vertical SaaS: first cloud, then cloud plus fintech, then cloud plus fintech plus AI. Each wave expanded what software could capture: from workflow, to transactions, to labour itself. Their argument: AI turns what used to be pure service businesses into scalable software plays, and the addressable market expands dramatically because you're no longer limited to what software alone could automate.

NFX offers a framing that I find useful for what follows. They describe Wave 1 as Translation: digitising analog processes. Amazon was an online bookstore. TurboTax was a digital tax form. The work didn't change; the medium did. Wave 2 was Creation: building things that weren't possible in the analog world at all. Instagram didn't digitise photo albums; it created the creator economy. Shopify didn't digitise retail; it converted non-merchants into merchants, growing from $61B in facilitated sales in 2019 to $292B in 2024. These weren't better versions of what existed. They were entirely new categories. Wave 3 is the open question. NFX argues the category-defining products of this era "will look obvious in hindsight" but haven't been built yet.

Wave 1 was: analog process → digital process. Same work, new medium.

So what is "AI agent does what the knowledge worker does"?

It's: human labour → AI labour. Same work, different executor.

That's... translation. It's Wave 1 logic applied to people instead of paper. You're digitising the worker, not creating something new.

You can argue against this: that the speed, cost, and scale differences are so extreme they become qualitatively different. Email wasn't just "faster mail"; it enabled communication patterns that physically couldn't exist before. Similarly, an AI that can process 10,000 scenarios in the time a treasurer processes 3 might enable a fundamentally different kind of decision-making. That's fair.

But it's worth sitting with the possibility that replacing human labour with AI labour, even brilliantly, is still the translation phase. And if it is, two things follow.

First, translation phases are still enormously valuable. The digitisation of analog processes created trillions in value. If we're in the "digitisation of labour" phase, the value captured will be larger still (the Sequoia/NFX/a16z math). Being in a translation phase doesn't mean it isn't worth building in. It makes it foundational. What gets built in translation becomes the substrate for everything that follows.

Second, the truly disruptive products (the ones that will look obvious in hindsight) probably aren't the autonomous agents themselves. They are whatever becomes possible once that substrate is stable, trusted, and widely adopted, enabling humans and agents to truly work together. What that looks like is the harder question.

4. What Translation Actually Requires¶

Most of the conversation about AI replacing labour skips over what that actually requires. Even if we accept that replacing labour with AI is "just" translation, the actual execution of that translation in high-stakes, deterministic domains is extraordinarily difficult. It's not "point an LLM at a database and go."

I see three layers of capability that need to come together. I've taken some liberties defining them below. In practice, the boundaries between these layers are not clean; they overlap and blur, and an engineer implementing them will probably not experience them as neatly stacked. But as a way of framing the problem space and starting to identify where we build, where we buy, and where the boundaries of our product responsibility lie, I think the distinction is useful. This framing will likely evolve as we learn more and I welcome feedback.

One practical dimension worth flagging: model provider SDKs (like Claude SDK from Anthropic) sit primarily at the orchestration layer: they give you the agent loop, tool calling, and message management. Using them creates some coupling to that provider's API and underlying models, which is a trade-off worth being deliberate about. But the real value (the context engineering decisions and especially the knowledge infrastructure) should in theory be model- and vendor agnostic. That's ours to build regardless of which model and/or orchestration layer/SDK sits on top. In practice, though, the lines may not be that clean. Different models have different context window sizes, tool calling conventions, and reasoning behaviours, and design decisions made for one model may not transfer cleanly to another. Understanding where the real vendor lock-in risks are, and making deliberate choices about which couplings we accept and which we abstract away, is something we need to work through as we build.

Layer 1: Agent Orchestration (The "Harness")¶

"Orchestration governs the process"

The plumbing. Some call it the "agent harness," the infrastructure that wraps around a model to manage its lifecycle, context, and interactions with external systems. An AI agent that does real work doesn't make a single inference. It coordinates across multiple systems, executes multi-step workflows, handles errors, and maintains state across interactions. It needs to pull balances from APIs (or SQL queries), cross-reference against policy rules, check covenant thresholds, consider forecast outputs, and synthesize a recommendation, all while maintaining an audit trail of what it did and why.

This sounds straightforward until you try to make it reliable. The engineering challenges are surprisingly gnarly:

Context window exhaustion. Even with expanding context windows, long-running agents run out of working memory fast. Hightouch (a data activation platform that built one of the more documented agent harnesses) solved this by building file buffering systems (agents write large query results to temporary files, keeping only pointers in context) and spawning dynamic subagents that handle complex subtasks in isolated threads and return only concise summaries. Like showing cleaned final work instead of scratch paper.
Model reasoning degradation. LLMs are trained on short chat sessions. On sustained, multi-step tasks they tend to "give up" at local optima rather than push through to better answers. The fix: explicit planning-then-execution loops where the agent revises its own strategy based on what it discovers mid-task, rather than committing to a single plan upfront.
State persistence across sessions. Agents lose all work context when sessions end. Every new session starts from zero unless you build external memory: progress files, structured artefacts, git-like checkpointing. The agent's memory has to live outside the agent.
Error handling and recovery. When a tool call fails, or a banking API returns unexpected data, or a payment batch is cancelled mid-processing, the agent needs graceful degradation, not a crash. Incremental checkpointing after each completed unit prevents catastrophic context loss and enables recovery from mid-task failures.
Entropy management. Over time, AI-generated outputs drift. Documentation becomes inconsistent, constraints get violated, data structures decay at boundaries. You need periodic monitoring (deterministic linters, structural tests, cleanup agents) to keep entropy in check.
Knowing when to stop. Perhaps the hardest: designing the right human intervention points. Where does the agent decide autonomously and where does it pause for human judgment? Get this wrong in either direction and you have either an unreliable system or an expensive autocomplete.

This is the layer getting the most attention right now: LangChain, CrewAI, AutoGen, Claude SDK, and dozens of others building agent frameworks. There's an argument that the harness itself is the product, the moat. I'm sceptical. Orchestration frameworks are generalising fast. The hard problems above are real, but they're engineering problems with known patterns. They don't stay hard forever. The whole point of the harness is to create a context that enables the LLM to make good decisions. The harness is the delivery mechanism. The context is the value. Which brings us to the next layer.

Layer 2: Context Engineering¶

"Context engineering optimizes the inputs"

The emerging discipline. If 2024 was about prompt engineering and 2025 about agentic AI, 2026 is shaping up to be the year of context engineering.

Andrej Karpathy, the former Tesla AI director and OpenAI researcher, describes it as "the delicate art and science of filling the context window with just the right information for the next step." Tobi Lutke, CEO of Shopify, puts it more practically: "the art of providing all the context for the task to be plausibly solvable by the LLM." Both explicitly prefer "context engineering" over "prompt engineering," because the bottleneck isn't how you phrase the question, it's what information the model has access to when it tries to answer.

Context engineering means structuring everything an agent needs (memory, tools, data, rules, constraints) to make intelligent, autonomous decisions reliably. Not writing better prompts, but designing the full environment in which an agent operates.

The practical techniques are becoming well-documented. Manus, an AI agent platform that has rebuilt their agent framework four times, describes six core techniques: optimising KV-cache hit rates (the difference between cached and uncached tokens is a 10x cost difference), masking rather than removing tools mid-conversation (to avoid breaking cache and confusing models), externalising memory through file systems (treating the file system as unlimited persistent context), manipulating attention by having agents rewrite their own todo lists (pushing goals into recent attention), preserving failures in context rather than hiding them (so the model can learn from what went wrong), and introducing controlled variation to prevent pattern entrenchment in repetitive tasks.

Anthropic, the company behind Claude, published their own engineering guide emphasising: just-in-time context retrieval (don't pre-load everything, let agents discover context progressively), compaction (summarise approaching context limits and restart with compressed summaries), sub-agent architectures (specialised agents with clean context windows returning condensed summaries), and a core design principle for tools: "if a human engineer can't definitively choose which tool to use in a given situation, an AI agent can't either."

Cognizant, the global IT services company, has deployed 1,000 "context engineers" with a dedicated platform called ContextFabric. Their framing: context engineering "will help small language models become domain experts in industries such as healthcare and finance that have low tolerance for mistakes." That last part, "low tolerance for mistakes," is the key qualifier. In probabilistic domains, mediocre context engineering produces mediocre-but-usable outputs. In deterministic domains, mediocre context engineering produces wrong answers that cascade into real consequences.

An important distinction: context engineering as typically defined covers everything the model sees at inference time, which can include database query results, retrieved documents, tool outputs. But it doesn't include the database design, data pipelines, access control logic, or knowledge governance upstream of that. That upstream work (how you capture, structure, permission, and maintain the knowledge that context engineering draws from) is a different problem. Which brings us to the hardest layer.

Layer 3: The Knowledge Engineering Problem¶

"Knowledge engineering compounds the understanding"

If context engineering is about what tokens to put in the window at inference time, then the knowledge engineering problem is what sits upstream of that: how you capture, structure, permission, maintain, and govern the domain knowledge that makes context engineering actually valuable.

This is where most people stop short, and I think it's where the real product opportunity lives. Earlier I described three things that make intelligence trustworthy: correctness, structural safety, and contextual alignment. This layer is the engineering problem underneath all three: the infrastructure that makes them real rather than aspirational.

Several dimensions make this genuinely hard:

ACLs and scoping. What knowledge is company-wide (minimum cash policy), what's team-level (treasury operating procedures), what's personal (this treasurer's preferred liquidity buffer)? A policy rule visible to the CFO might be invisible to a junior analyst. An investment parameter set by the treasury director might not be something the AP team should see or modify. This is permissions infrastructure, not prompt design.

Knowledge lifecycle. Who validates a piece of institutional knowledge? When does it expire? What happens when a policy changes? Does the old version persist for audit, or get overwritten? If the CFO revises the liquidity buffer from 20% to 15%, the system needs to know both the current rule and that the old one existed (for explaining historical decisions). This is versioned, governed state management.

Structured vs unstructured knowledge. Some knowledge is clean and deterministic: covenant threshold = 2.5x, minimum cash balance = EUR 5M. Some is messy and contextual: "the CFO doesn't like us going below 20% even though the formal policy says 15%," or "Customer X always pays late in Q4 because of their fiscal year close." The system needs to handle both, and the messy kind is often the most valuable.

Domain expertise as a product layer. This is where companies like Arculae (a startup building a proprietary knowledge marketplace, currently in closed alpha) are interesting. Their thesis: "as public data depletes and the open web becomes increasingly synthetic, the edge shifts to governed proprietary knowledge." They're applying this horizontally: domain experts publish knowledge, AI agents query it through retrieval-only, policy-gated access. The vertical play is different but directionally the same: you structure the domain knowledge for your customers, because they won't do it themselves. A treasurer isn't going to write a knowledge graph. But they will correct a wrong answer, approve a policy suggestion, or confirm a pattern, if the system makes it easy. Every correction is a data point. Every confirmation is a validation.

The maintenance argument. This might be the most underrated insight. NFX celebrates "custom autonomous software" and "software for one": anyone can build faster than ever. True. But building is cheap. Maintaining is expensive. Industry data suggests 20-30% annual maintenance costs on enterprise AI, with 60% of projects exceeding cost estimates by 30-50%. And that's for generic AI. Maintaining correctly, with governance, accuracy guarantees, and accountability in deterministic domains, is significantly more expensive. That's exactly where other parts of the product opportunity lives: not only in building the custom thing, but in being the maintained, governed, compounding knowledge layer that makes every custom thing on top of it trustworthy.

The vision this points toward is something like hyper-personalised SaaS: the same product, radically different behaviour per customer, not through configuration screens but through accumulated institutional knowledge. Goldman Sachs CIO Marco Argenti said in January 2026: "Context is the new frontier." The system adapts per customer, per team, per user, because the knowledge layer underneath is different. But that knowledge layer requires real engineering: data pipelines, access controls, versioning, validation workflows, governance. It's not a prompt. It's infrastructure.

5. Testing the Thesis¶

Everything above rests on a claim: that the bottleneck isn't model capability, it's the structured knowledge underneath. Before building on that assumption, it's worth stress-testing it against the available research. Some of it supports the thesis strongly. Some of it complicates it, but both matter.

The most direct test is text-to-SQL: give a model a database schema and a natural language question, and let it write the query. On simple, clean benchmarks with straightforward schemas and simple tasks (Spider 1.0), the latest models now score 91-94%. That benchmark is essentially solved. But on benchmarks that reflect real enterprise complexity (messy schemas, multi-step reasoning, implicit business logic), the same class of models manages 10-25% (Spider 2.0, BIRD-Interact). The models didn't get worse. The tasks got harder in ways that raw model capability doesn't address.

The alternative is a semantic layer approach: structured metric definitions and business logic sit between the model and the raw data, so the model queries business concepts rather than raw tables. Snowflake's Cortex Analyst, which pairs an agentic system with a semantic model, achieves 90%+ on real-world enterprise queries. The pattern isn't new: back in 2023, dbt Labs showed that adding a semantic layer to GPT-4 jumped accuracy from 16.7% to 83%. What's notable is that three years of model improvements haven't made the structured layer unnecessary.

The pattern across both approaches points the same way: accuracy is determined by the quality of what sits between the raw data and the question, not by the model itself. The full benchmark data and analysis is in Appendix C. But the picture isn't as settled as that summary makes it sound.

Two points worth highlighting here. First, every major production deployment (Snowflake's Cortex Analyst, Databricks' AI/BI Genie, Google Cloud's recommended architecture) requires some form of structured business context. Raw text-to-SQL, as one industry analysis put it, is "a demo, not a product." But that's not the full picture. MotherDuck, an analytics database company, achieved 95% accuracy with text-to-SQL alone, no semantic layer, by testing against well-modelled schemas with descriptive column names and simple joins. Their argument: if the underlying data is clean enough, the model doesn't need an additional abstraction layer. The catch is that those conditions (simple joins, shallow query depth) are inherently simple tasks, and it's unclear whether clean modelling alone scales to the multi-step reasoning enterprise users actually need. But the insight matters: the quality of the data model itself is a variable, not just what you layer on top of it.

For a startup like ours, this is actually relevant. We have the opportunity to design our data model from scratch with AI-native querying in mind, rather than working against decades of accumulated schema debt. Whether that means we can lean more on text-to-SQL, or still need a semantic layer, or some combination, is something we need to figure out deliberately rather than assume.

But even with the right approach (whether text-to-SQL, a semantic layer, or both), treasury adds another layer of difficulty, because the knowledge that's missing isn't just generic metric definitions that a data team can write once. It's customer-specific: this company's minimum cash policy, that entity's restricted accounts, the unwritten rule that the CFO wants a 20% liquidity buffer even though the formal policy says 15%. A generic semantic layer (or a well-modelled schema) gets you from 25% to 90%. Customer-specific institutional knowledge is what gets you from 90% to "I'd trust this enough to act on it."

6. From Theory to Practice¶

The sections above make a case for where durable value will accrue in treasury in the age of AI: correctness, structural safety and contextual alignment. Building toward that likely requires orchestration infrastructure, context engineering, and deep knowledge engineering. That's a lot.

Some of the assumptions underneath all of this need testing, not just arguing, or guessing. We don't yet know how far a well-designed data model gets us before we need additional layers. We don't know where the boundary falls between "solvable with better engineering" and "requires institutional knowledge the model can't infer." We don't know which parts of the orchestration and context layers we need to build ourselves versus what we can lean on existing frameworks for. The way to find out isn't to design the full architecture upfront and it is not by delaying these decisions "for later" either. It's to start with what we have, run real use cases against it, and let the failures tell us what's actually missing.

The only way to answer these questions is to build. Not build the full architecture, but build toward real use cases and learn from what breaks. Building is the research method.

The starting point should be the workflows that matter most to treasurers. Not picked at random, but driven by the jobs treasurers are already telling us they need to do; both the ones that keep operations running and the ones that drive the highest-value decisions. Ideally already now strive to answer questions like "How much can we safely invest this month?" "Why was last week's forecast off?" "Where are we exposed if this counterparty is late?" These aren't "prompts". They're multi-step financial workflows. Answering the first might require pulling balances across entities, adjusting for restricted accounts, applying minimum liquidity policies, incorporating near-term obligations, checking forecasts and reflecting informal buffers or CFO preferences. Each step is a test of everything described in the sections above: is the required data accessible? Is it structured for deterministic reasoning? Are policies encoded and versioned? Can we produce a traceable explanation? Do we have the orchestration to coordinate across these steps reliably?

Where the system fails, that failure defines what to build next. Not in an abstract way, but concretely: this step failed because we don't have consolidated cash pool positions, that step failed because there's no policy representation for minimum balances, this one broke because the explanation isn't traceable. Each failure points to a specific gap in a specific layer.

The most valuable input is the treasurer's own reasoning. When we interview treasury teams and ask them to walk through these decisions step by step, what emerges are structured reasoning blueprints: explicit accounts of how experienced treasurers approach the problem. We've already started. Giannis interviewed On's treasury team and produced a structured blueprint for their daily liquidity analysis: an 8-step workflow covering cash position assessment across entities and currencies, cash pool consolidation, overdraft detection with severity classification, a funding priority ladder (same-entity transfers before intercompany loans before FX trades), investment account treatment as backup liquidity, high-balance alerts for repatriation candidates, and capital control flags for restricted jurisdictions. It wasn't generated by an LLM. It was extracted from a treasurer's head. And it encodes exactly the kind of knowledge the sections above describe: deterministic rules (overdraft thresholds, severity criteria), domain structure (cash pool netting logic, the funding priority hierarchy), and institutional context (customer-specific minimum balances, informal preferences, regional considerations). The full blueprint is in Appendix A.

Running this blueprint against our current product immediately tells us what we know and what we don't. Can we pull consolidated cash pool positions? Can we apply the funding priority ladder automatically? Do we have the policy data structured and accessible? Can we produce a traceable explanation for each recommendation? Every "no" is a product requirement. Every "yes" is a foundation to build on. We don't need to guess which layers matter most or which architectural choices to make upfront. The workflows tell us.

The same logic applies at the product interaction level. When someone asks the system a question and it can't answer correctly, it reveals a gap: a missing policy threshold, a liquidity buffer rule that was never formalised, a pattern that exists in the treasurer's head but not in any system. Each gap tells you exactly what to build into the knowledge layer. And when the system hits a gap, could it ask the user for clarification and store that input in a structured way for next time? If so, each interaction doesn't just answer a question; it strengthens the knowledge layer. The interaction becomes a mechanism for accumulating institutional intelligence, not just executing a template.

The more workflows we run, the more blueprints we extract, the more interactions we capture, the clearer the picture becomes. What started as open questions (do we need a semantic layer? how much orchestration do we build versus buy? where does institutional knowledge become the binding constraint?) get answered not by analysis but by evidence. The substrate emerges from the pattern of failures and successes across concrete use cases, not from a grand architecture exercise.

The near-term priorities follow from this: depth over breadth. Start with the highest-value workflows, build for the gaps they expose, and let each iteration compound. Deterministic correctness over flashy autonomy. Structured knowledge capture over UI polish. Success is not "the agent answers." Success is "a treasurer would act on the answer."

And the sooner we start earning that trust, the sooner it begins to compound.

Sources & References¶

Appendix A: Treasury Liquidity Analysis Blueprint¶

The following is a structured reasoning blueprint extracted from a customer interview. It encodes the step-by-step workflow a treasury team follows for their daily liquidity analysis. This is the type of artefact referenced in Section 6.

Analysis Procedure¶

Step 0: Setup

Determine the latest available balance date and the latest forecast start date. Never hardcode dates.
Confirm with the user:
Analysis horizon: default 4 weeks, max 13 weeks.
Overdraft threshold: default is balance below zero. Some customers use target minimum balances per entity.
Cash pool treatment: default is consolidated (net pool position). Some users prefer individual account level.

Step 1: Current Cash Position (Today's Actuals)

Pull today's actual account balances, excluding investment accounts. Produce two views:

Entity-currency view: Total balance by entity and currency. Count accounts and flag any with negative balances.
Currency-level view: Total balance by currency across all entities. Count how many entities hold each currency.

What to look for: - Any entity-currency with total balance below zero is currently in overdraft and needs immediate attention. - Entities with zero balances are likely cash-pooled accounts where the balance sweeps to a header account; this is normal. - Note the largest balances per currency as potential funding sources for later steps.

Step 2: Map the Cash Pool Structure

Identify all cash pools: their names, member entities, and currencies. Record which pools exist and which entities participate in each. This is critical context because individual account overdrafts within a pool are offset by other pool members. Only the net pool position determines whether there is a real shortfall.

Step 3: Cash Pool Consolidated Forecast

For each cash pool, sum the ending balance across all member accounts for each forecast week within the horizon. Group by pool name, currency, and week number.

What to look for: - Any pool where the net balance goes negative in any week. This is a real overdraft that needs funding. - Pools with rapidly declining balances that may go negative beyond the horizon. Flag as a watch item. - Note the week-over-week burn rate for declining pools.

Step 4: Non-Pooled Account Forecast

Look at the forecast for accounts that are neither cash-pooled nor investment accounts, focusing on those with a negative ending balance.

What to look for: - Material negative balances. Ignore sub-100 amounts in the local currency (likely bank fees or rounding). - Focus on amounts above 10K in local currency terms. - For each negative account, note the entity, currency, bank, and account code, as these are needed for funding recommendations.

Step 5: Currency-Level Aggregation

Sum all positions (pooled and non-pooled, excluding investments) by currency across all entities for each forecast week. For each currency-week, compute: total balance, sum of all negative account balances, sum of all positive account balances.

Interpretation: - Total negative = the currency is globally short. Internal transfers cannot fix this. An FX conversion is required. - Total positive but large negative subtotal = the currency has enough cash overall, but it is in the wrong places. Internal redistribution needed. - Total barely positive and declining = watch closely. May go short soon. - Total strongly positive = healthy. Can serve as a funding source for other currencies.

Step 6: Identify Funding Sources for Each Shortfall

For every shortfall identified in Steps 3-5, work through the funding priority ladder. Always recommend the simplest sufficient action.

Priority 1: Same entity, same currency, different account — Check if the entity in overdraft has other accounts in the same currency with surplus. Recommend a bank-to-bank or internal book transfer. Simplest action with no intercompany or FX implications.
Priority 2: Different entity, same currency — Search for other entities holding the same currency with large positive balances. Recommend an intercompany loan. Requires IC loan documentation and coordination with accounting.
Priority 3: Different entity, different currency (FX trade) — If the currency is globally short, identify the currency with the largest available surplus. Calculate the amount to sell, recommend adding a 20-50% buffer if the trend shows continued weekly burn.

FX considerations: - Check which entities hold the surplus currency and can execute the trade. - Prefer entities that already bank with an institution that deals in both currencies. - If the shortfall entity is in a capital control country, the trade may need to originate from the receiving side.

Step 7: Investment Accounts as Backup Sources

Pull investment balances for active positions. Record each instrument's name, type (time deposit, MMF, notice account), counterparty, currency, amount, and maturity date.

What to look for: - Time deposits maturing within the analysis horizon provide natural liquidity without early termination costs. - Maturities that coincide with a forecast shortfall may eliminate the need for a separate funding action. - MMFs can typically be redeemed faster than time deposits. - Do not recommend breaking a time deposit unless operational accounts are insufficient and the shortfall is material.

Step 8: High Balance Alerts

From the Step 1 results, flag entities with unusually high balances. These are candidates for: - Repatriation: Moving excess cash to the parent entity or treasury center. - Repayment: Paying down intercompany loans to reduce interest costs. - Investment: Deploying idle cash into time deposits or MMFs.

Compare the current balance to the forecast trend. If a high balance is growing over the horizon, it is especially worth flagging. Pay special attention to capital control countries where repatriation is restricted.

Report Template¶

Structure the deliverable as a treasury morning brief:

Executive Summary — 2-3 sentences: overall position health, count of overdraft risks, whether any currency is globally short, total operational cash in major currencies, total investment balances.
Currency Health Dashboard — Table per currency with weekly balance and status (Strong / Stable / Declining / Tight / SHORT).
Overdraft Alerts and Recommended Actions — One subsection per alert, ordered by severity (HIGH → MEDIUM → LOW), with specific funding instructions.
High Balance Alerts (Repatriation Candidates) — Table: Entity, Currency, Balance, USD Equivalent, Notes.
Time Deposit Maturity Schedule — Table: Maturity Date, Instrument, Amount, Counterparty, Rate.
Action Summary (Priority Order) — One row per action: Priority, Description, Amount, Type, Deadline.

Alert Severity Classification¶

Severity	Criteria
HIGH	A cash pool goes negative (affects multiple entities), OR a currency is globally short requiring FX
MEDIUM	A non-pooled account goes negative by more than 1M in local currency, but surplus exists within the same entity or same currency elsewhere
LOW	A non-pooled account goes negative by less than 1M, or surplus is readily available at the same entity and same bank

Funding Action Complexity¶

Type	Description	Complexity
Same-entity, same-bank transfer	Move funds between accounts at same bank within same entity	Trivial
Same-entity, different-bank transfer	Move funds between banks within same entity	Low
Intercompany loan (same currency)	Transfer between entities in same currency	Medium
FX conversion	Sell one currency to buy another	Medium
Intercompany loan + FX	Transfer between entities AND convert currency	High
Investment liquidation	Break a time deposit or redeem MMF early	High

Capital Control Countries¶

Country	Currency	Key Restrictions
China	CNY	SAFE approval for outbound, documentation heavy
Brazil	BRL	Central Bank reporting, IOF tax on FX
South Korea	KRW	Domestic free, cross-border needs BOK reporting
Vietnam	VND	State Bank approval for outbound, limited convertibility
Indonesia	IDR	Bank Indonesia reporting for large transfers
Kenya	KES	Central Bank approval for large outflows

Appendix B: What Comes After Translation¶

If the current phase is translation (digitising human labour into AI labour) then the question is: what's the Wave 2 moment? What becomes possible that wasn't before?

I don't have this fully figured out. But the NFX Wave 2 examples share a pattern: they expanded the surface area of what was possible, not just the efficiency of what existed. Shopify didn't make existing merchants faster. It created millions of new merchants. Lyft didn't make taxis cheaper. It created rides that never would have happened.

Some candidates for what this looks like beyond translation:

Cross-company intelligence that no individual company could build alone. "Companies with your profile and seasonality typically see X in Q4. You're deviating, here's why that matters." No human team, however large, could do this. It requires scale that only exists at the platform level.
Real-time continuous operations instead of batch mode. Not "faster monthly reporting" but a fundamentally different operating cadence where every transaction triggers immediate downstream analysis, policy checks, and recommendations. Not a faster version of what existed. A different thing entirely.
Domain intelligence as enterprise infrastructure. Not a tool for one function, but the substrate that makes every other business decision contextually aware. The procurement agent needs cash position before committing to a large order. The sales agent needs payment terms' impact on working capital. The intelligence layer every other agent connects to, not a tool for one team, but infrastructure for the enterprise.
Governance as competitive advantage. Today, operational data is scattered, inconsistent, and hard to leverage. But if a system continuously structures and validates that data (transaction patterns, counterparty behaviour, payment timing, policy compliance) it creates a new class of enterprise intelligence that didn't exist before. That intelligence becomes leverage: negotiate better terms with banks because you have granular, validated data on your cash flows, payment reliability, and liquidity patterns. Better governance doesn't just reduce risk. It creates new types of competitive advantage for the companies that have it.

None of these are "do the same thing, but with AI." They're things that weren't possible before: new categories of activity, new market participants, new capabilities that only exist because the intelligence layer exists. The agents are the enabler, not the product. The product is what humans can now do, or what can now happen, that couldn't before.

This is the part I find most exciting and least figured out. The industry narrative is "AI agents will do your job." And they will, the mechanical parts. But the bigger opportunity might be in the things nobody is doing today, because the infrastructure to do them doesn't exist yet.

Appendix C: The Evidence in Detail¶

Text-to-SQL, Semantic Layers, and Benchmark Data

There are broadly two approaches to making AI query enterprise data. The first is text-to-SQL: the LLM receives a database schema and generates SQL directly. This is flexible: users can ask questions that were never pre-defined in a dashboard, and you don't need to anticipate every metric combination upfront. The second is a semantic layer approach: structured metric definitions, business logic, and relationship mappings sit between the model and the raw data, and the LLM queries business concepts rather than raw tables. Snowflake's Cortex Analyst and Databricks' AI/BI Genie both take this approach, requiring semantic models as the foundation for their AI analytics products.

Neither approach, on its own, solves the problem.

Text-to-SQL progress and limitations:

On Spider 1.0, the original academic benchmark with clean schemas and straightforward questions, the latest models (Claude Opus 4.5, GPT-5) now score 91-94%. That benchmark is essentially solved. But enterprise databases aren't clean schemas with straightforward questions:

BIRD dev set (real-world messy schemas): GPT-4 scored ~57% in 2024. Claude Opus 4.5 reaches ~73% in 2026. Better, but far from trustworthy.
Spider 2.0 (enterprise workflows, multi-step reasoning): GPT-4o manages 10.1%. o1-preview: 17.1%. Claude 3.7 with an agent framework: 25.4%.
BIRD-Interact (conversational, interactive queries): o3-mini achieves 24.4%. Claude 3.7: 17.8%.

The failure modes are well-documented: join logic errors that silently produce wrong results rather than errors, hallucinated column names that cascade through complex queries, missed secondary conditions in aggregations. Google Cloud's engineering team describes these as the "six failures of text-to-SQL" and recommends multi-agent correction pipelines to catch them.

Semantic layers dramatically improve accuracy:

In late 2023, dbt Labs benchmarked GPT-4 on enterprise natural language questions: 16.7% accuracy against raw schemas, jumping to 83% with a properly defined semantic layer. Snowflake's Cortex Analyst, which pairs an agentic AI system with a semantic model, achieves 90%+ accuracy on real-world enterprise queries. Snowflake's own documentation acknowledges that "generic AI solutions often struggle with text-to-SQL conversions when given only a database schema"; the semantic model is what bridges that gap.

The MotherDuck counterpoint:

MotherDuck argues that "your data model IS the semantic layer." Testing 500 queries from the BIRD benchmark against frontier LLMs using only well-modelled schemas (descriptive column names, clean relationships, max 2-3 join hops) they achieved 95% accuracy without any additional semantic layer. The catch: those conditions (simple joins, descriptive names, shallow query depth) are inherently simple tasks. Whether clean data modelling alone scales to enterprise complexity is an open question.

The pattern across all of this is the same: the structured knowledge between the raw data and the question is what determines accuracy. The approach matters less than the quality of what sits underneath.