Repos, Tokens & Context: It's not the size of the window, it's how you use it that counts
LLMs don’t lose their minds; they lose their tokens. And with it, their ability to count.
How much of the UK Repo market is cleared? I want to give LLMs another shot at getting it right.
Adding another stop on my journey of LLMs vs Excel vs Python led me deep into how context windows work in large language models. It turns out that understanding the context window is critical when working with LLMs and structured financial datasets.
Even frontier models like GPT-4 can lose coherence when overloaded with raw data. I found that LLMs work best as summarizers, not calculators. Here’s what happened when I tried to help the model out.
What do we want from an LLM?
This blog explores whether models from OpenAI, Anthropic, and Google can support niche capital markets use cases. Could we simply upload market data to an LLM and have it:
Understand what the data describes?
Identify the data source (e.g. UK SFTR from the DTCC APA)?
Respond to user demands?
Manipulate the data on our behalf?
To test that workflow, I used UK SFTR data from last week’s blog. The data is loosely based on gilt repo transactions, and it lets us explore how LLMs cope with different context window sizes and formats.
Context Windows
I quickly discovered that managing the “context window” of an LLM is a critical part of the user journey above. If we send data to ChatGPT;
There is a limit to the amount of data that an LLM can process at one time.
It is somewhat akin to the number of rows a person can read on a spreadsheet if it is displayed on screen.
LLM context windows are measured in terms of “tokens” (not words, or letters, or phrases or bytes. See the FT explanation of gen AI and LLMs here).
In spreadsheet terms, a CSV row with 8 fields is equivalent to about 30-40 tokens (it depends on the exact data).
One of the popular models in ChatGPT is called GPT-4. This has;
8,192 token context window.
Equating to analysis of ~180 rows of data with ~8 columns in it.
This is the default model when making API calls to OpenAI (mainly because it is cheap, not because it is the “best”).
Moving up to the latest LLM, GPT-4o (“turbo”) offers;
A 128,000 token window.
Equating to 3,000+ rows of data.
It is a more expensive API call, hence the models with smaller context windows are still offered. Progress don’t come for free people!
Fear and Loathing with ChatGPT
My first attempt with the SFTR data produced the most explicit LLM hallucination I’ve seen. I uploaded my SFTR data (see it for yourself here), shown below:
ChatGPT then confidently asserted;
Errr…..no. The Raw File does NOT have columns with Asset Type, Collateral Country, Currency or Maturity Bucket. I would LOVE a file like that! ChatGPT be like:
Most users would lose trust at that point. I decided instead to manage the data carefully before passing it to the LLM.
Vibing with ChatGPT
On the positive side of LLMs (!), ChatGPT was able to use the SFTR data from last week and combine it with my old Streamlit script to quickly produce a new dashboard on SFTR data for me:
The main point of the exercise was to send specific data into the ChatGPT context window (instead of uploading files). A dashboard plugin lets me select which columns and rows to send to an LLM based on the query and time period. But even then, I hit the GPT-4 8,000-token wall immediately.
Switching to GPT-4o gave me more room, but the output was just as bad:
It’s great to have a large context window…..except when it means ChatGPT can spew out nonsense like this.
ChatGPT goes RAGing
Going back to our context window, if you ask an LLM to perform an operation on the data in its context window, it then creates tokens for each raw number as it explains what it is trying to do. It literally thinks “the sum of week one, week two, week three repo volumes is equal to” etc and tries to keep this “in memory”. The LLM’s then get overwhelmed (!) and yes, the LLMs lose focus and arithmetic precision as a result! How very human.
To avoid this, ChatGPT recommended that I try “Retrieval-Augmented Generation” (RAG). Llama-index and LangChain are RAG tools that parse data into documents, so that each row of a CSV is seen as a document. I effectively pass a series of long text strings to the LLM such as:
“Week of 2025-04-11, Venue: NonGB, Reporting CP: GB, Other CP: NonGB, SFT: REPO, Cleared: Yes, Lent: €1.8tn, Txns: 60,000”
My hope was that LLMs could reason over this structured, object-style text. They couldn't.
RAG made no difference (see below). The LLM still returned incorrect calculations — likely due to poor chunking, a bloated context window, or both.
Multi-Select
I therefore concluded (with the help of ChatGPT) that RAG is powerful when;
You have unstructured or semi-structured data (e.g. clearing house Rulebooks).
You want GPT to synthesize across disparate info.
You don't know exactly how users will ask things.
In the case of SFTR data (and maybe all financial data?);
It is better to pre-group data across multiple dimensions (SFT type, venue, clearing status in the case of the UK SFTR data).
This has benefits in terms of precision (no hallucinations), speed and cost savings (no LLM API calls).
The LLM can then be used to provide natural-language summaries of results or to help explain patterns or anomalies.
The Result
Employing the concept of “multi-select”, I arrive at a rich Streamlit dashboard, providing data manipulation over multiple axes. Which looks much nicer than an Excel pivot-table, but is essentially the same thing:
But scrolling down, I now have an LLM doing LLM-type things, and providing an accurate summary of the data:
There are still gotchas here - there is no need for raw numbers, it is too lengthy, and GB venues are Great and British, not Government Bonds. But it is significantly better as a result of careful management of the context window.
In Summary
Context windows define the limits of what LLMs can reason about. Know your limits.
Even GPT-4 breaks down under too much structured input.
It’s not garbage in, garbage out — it’s overload in, nonsense out.
RAG struggles with clean, numerical data. Pre-aggregation works better.
LLMs shine as summarizers, not calculators.
Next time? I’ll be exploring whether Large Data Models can move beyond text and truly understand structured market data as objects.