Journal · 2026-04-26 · Engineering

Persistent memory
— with dates.

Four layers, each defending a different failure mode. The result is a private intelligence that can answer what was true on the 14th, and what has changed since.

The single most repeated complaint about AI assistants in 2026 is not that they are unintelligent. It is that they forget. They forget what you said on Monday by Tuesday morning. They forget which client is which. They forget the supplier you decided to drop in March, and they cheerfully recommend them again in October.

The fix is not a bigger context window. The fix is structural: a memory architecture that records, retrieves, relates, and time-stamps every fact the system has been given. OZRIC ships one. This is the technical write-up.

The shape of the problem

An operator’s memory has four properties at once. It is indexed — you know which folder the contract is in. It is semantic — you can recall what a conversation was about even if you cannot remember the exact words. It is relational — you know that Roman is the cofounder of Just Foster, that Just Foster is one of OZRIC AI’s clients, that OZRIC AI is your firm. And it is temporal — you know that the fact "Just Foster is parked" was true in October and is not true now.

No single data structure does all four. The naive AI memory pattern — throw everything into a vector store and hope — handles only the second. It misses the index, it misses the relations, and it cannot answer a question that has a time axis. The OZRIC stack adds three more layers, each defending the failure mode the layer below it has.

L0 — The markdown index

The cheapest, most underrated layer. Every fact that is canonically true right now is a one-line entry in a flat markdown index, with a link to the topic file that holds the detail. Project state, client identity, brand rules, contact details, locked decisions — the kind of fact that an experienced PA would know off the top of their head.

The index is human-written and human-edited. There is no embedding step. Retrieval is grep. The reason this layer comes first is that it answers the question "what is canonically true right now" in milliseconds, with zero hallucination surface. If the answer is in the index, the system never reaches for the more expensive layers below.

The cost of running the index is editorial discipline: every new project, every status change, every decision must land in the right one-line entry. The benefit is that the most-asked questions never touch a vector store, never invoke the model, never pay a token.

L1 — Vector retrieval

The semantic layer. Everything written into the OZRIC AI knowledge wiki — topic files, project memos, technical documentation, banked research — is chunked, embedded, and stored in a local vector index.

The implementation is a local vector store with a local embedding model. Embeddings are 768-dimensional. The vector index runs entirely on the operator’s paired Mac. It never touches a remote API.

The retrieval path is straightforward: the user’s question is embedded with the same model, the top-k nearest chunks are pulled by cosine similarity, and the chunks are passed into the model as context. Latency is sub-second on a Mac mini. Cost is nil — embeddings are local, retrieval is a vector dot product, and the only token cost is the prompt that uses the retrieved context.

L1 answers questions of the shape "what does the system know about X?". It is the workhorse layer. Most everyday questions are answered by L0 plus L1 alone.

L2 — The knowledge graph

The relational layer. As each document is ingested, a parallel pass with a larger local model extracts named entities and relationships and writes them into a graph database. The graph stores the structural facts that vectors cannot represent — that Roman is the cofounder of Just Foster, that Just Foster is a client of OZRIC AI, that OZRIC AI operates the Meta pipeline for Just Foster.

The reason a graph is necessary is that vectors are good at similarity but bad at structure. Asking "who are OZRIC AI’s UK fostering clients?" is, semantically, close to asking "what is the capital of France?" and to vector-cosine that closeness is meaningless. A graph traversal answers it cleanly: walk from the OZRIC AI node along the has-client edge, filter by vertical = fostering and country = UK, return the matching nodes.

The graph also gives you the visualisation we ship inside OZRIC — the orb you see breathing on the homepage. It is not decoration. It is the live structure of the memory under the system, rendered. As of the most recent build:

3,735 nodes — entities the system has learned about across the wiki and project memory.
7,590 connections — typed relationships between those entities.
140 communities — clusters of densely related entities that the graph algorithm has identified as coherent topics.

The graph compresses the corpus by roughly fifty-seven times when used as a retrieval surface for a model. That compression is the practical reason OZRIC can hold the entirety of an operator’s context without burning tokens.

L3 — The temporal fact engine

The hardest layer. The one that no chatbot replicates. The one that is the moat.

The previous three layers all describe state. They tell you what is currently in the index, what is currently in the wiki, what is currently in the graph. They cannot, by themselves, tell you what was true on the 14th of February. They cannot tell you what has been replaced since. They cannot tell you which decision superseded which.

The temporal fact engine adds a fourth dimension to the memory: time. Every fact written into the system carries valid_from and valid_to timestamps. Every status change — "deal status: parked" superseding "deal status: warm" — is a separate node in the temporal graph, linked to the previous version with a supersedes edge.

The implementation is a temporal-aware graph store with extraction running locally on the paired Mac. The combination lets the system answer four shapes of question that the layers above cannot:

When did X change? Walk the supersession edges from the current node back to the first version.
What was true on date Y? Filter every fact in the graph by valid_from ≤ Y ≤ valid_to and return the snapshot.
What has been replaced since date Y? Find every node whose valid_from is later than Y, return the lineage.
Why does the system believe X? Return the source episode that introduced the fact, with the source document and the date it was added.

That last property — provenance — is the one that matters most for an HNW operator. The system does not just answer; it shows its work. Every claim has a date, a source, and a lineage. There is no black-box recall.

How the four layers compose at query time

The query path is deliberately layered, cheapest-first:

Read the L0 markdown index. If the answer is canonical and timeless, return it.
If the question has a time axis (when, before, since, changed, replaced), route directly to L3 and run a temporal query.
Otherwise, run vector retrieval over L1 to fetch the semantically nearest chunks.
If the question is structural (who, which, related to), augment the L1 result with an L2 graph traversal so the answer carries the right relational shape.
Compose the prompt with whichever subset of layers contributed and pass it to the model. Cite the layer each claim came from in the answer.

Most queries hit one or two layers. Few hit all four. The system is fast because of the routing, not because the layers are individually fast.

Why this is hard to build

Three reasons.

First, the temporal layer is not a feature you bolt on. It has to exist from day one. Once you have written six months of facts into a vector store with no valid_from timestamp, you cannot retrofit the timeline. We learned this the hard way; the L3 layer is the one we rebuilt last.

Second, every layer needs a different kind of write discipline. L0 is editorial. L1 is automatic on every wiki edit. L2 needs a larger model and a larger compute budget. L3 needs explicit episode-style writes whenever a status changes. A system that gets one of these wrong looks fine in week one and is unusable in month six. We have working daemons, hooks, and ledgers for each one. They are dull. They are also the reason the architecture compounds.

Third, every layer must run locally. Sending a temporal fact graph to a remote API is a non-starter for an HNW operator. The full memory stack runs on the operator’s paired Mac — the same Mac that hosts the rest of the OZRIC AI System. Zero data leaves the estate by default.

The moat, in one line

A chatbot that has a vector store can answer "what does Roman do?". OZRIC can answer the same question and add: "He’s the cofounder of Just Foster. That fact has been true since the 28th of October. Before then, the system had him recorded as a partner candidate only. Source: the project memo Romaan opened in week two of the deployment."

That is the difference between a memory and a search. It is also the difference between a chatbot and a private intelligence.

Where this lives, in your install

When OZRIC AI provisions an OZRIC estate for an operator, the full four-layer memory stack ships with it. The markdown index is bootstrapped from a structured intake interview. The vector layer is seeded from your existing documents. The graph layer extracts entities from those documents on first run. The temporal layer begins recording from day one of the install and grows from every approved decision the system observes after that. By month three, the system knows your business in the way a senior assistant who joined in January would.

The phone is the keyhole. The Mac is the estate. The four-layer memory is the reason the estate is worth having.

Romaan Sheikh is the founder and sole director of OZRIC AI, the UK studio behind OZRIC. The four-layer memory architecture runs entirely on the operator’s paired Mac. OZRIC AI LIMITED is registered at UK Companies House (17241218).

Persistent memory— with dates.