Technical

Cognitive Foundations for AI · Part 4

Short-term Memory Beyond the Context Window

January 21, 20266 min read

Working memory holds and manipulates information for short periods, from seconds to minutes. It differs from long-term memory, which stores information indefinitely. When you remember a phone number just long enough to dial it, or hold driving directions in mind while navigating, you're using working memory. The Fidelius memory framework applies a similar architecture to manage conversation context (e.g. conversation context window) without overloading or confusing its processing capacity.

Baddeley and Hitch's Multi-component Model

In 1974, Alan Baddeley and Graham Hitch proposed a multi-component model of working memory in Psychology of Learning and Motivation. Rather than treating short-term memory as a single passive store, they described a system with distinct components:

Central executive: Directs attention, allocates resources, and coordinates the other components. It handles task-switching and suppresses distractions.
Phonological loop: Holds speech-based information through verbal rehearsal, like repeating a grocery list to yourself.
Visuospatial sketchpad: Manages visual and spatial information, such as mentally rotating an object.
Episodic buffer (added in 2000): Integrates information from the other subsystems with long-term memory and sensory input, binding them into coherent chunks.

Functional MRI studies reviewed in Psychological Bulletin have mapped these components to brain regions: the prefrontal cortex for the central executive, temporal lobes for the phonological loop, and parietal areas for visuospatial processing. Working memory capacity varies between individuals, typically holding 4-7 items, and can be affected by stress, aging, or training.

The model emphasizes active manipulation rather than passive storage. Solving a math problem, for example, requires holding numbers (phonological loop), visualizing the problem (visuospatial sketchpad), and accessing prior knowledge of arithmetic (episodic buffer), with the central executive coordinating the process.

Real World Examples

Mental Arithmetic

Calculating a 20% tip on an $85.50 bill: your phonological loop holds the numbers, your visuospatial sketchpad helps you visualize the multiplication, and your episodic buffer pulls in your knowledge of percentages. The central executive keeps you focused despite background noise. Losing any component mid-calculation causes you to start over.

Before ubiquitous use of GPS: you would repeat the directions aloud (phonological loop) while building a mental map of the turns (visuospatial sketchpad). The episodic buffer connects this to memories of similar routes. The central executive switches your attention between the route and traffic.

Conversation

During a debate: you hold your opponent's argument (phonological loop), imagine counterexamples (visuospatial sketchpad), and draw on past discussions (episodic buffer). The central executive manages turn-taking and filters out irrelevant thoughts.

Working memory in Fidelius

Fidelius uses four long-term networks for persistent storage: World (facts), Experience (personal events), Opinion (judgments), and Observation (patterns). For real-time processing, it adds working memory components:

Active conversation context is the core working memory, holding the current session's state including user queries and recent responses.

Memory recall at inference is the episodic buffer equivalent. It retrieves relevant information from long-term memory networks, a pipeline that combines semantic search, keyword matching, and graph traversal.

Agent orchestration is the central executive. It controls what to retain, recall, or process, and enforces token limits to prevent overload.

Behavioral profile injection influences processing in a way analogous to how personality traits affect human cognition. A profile with high skepticism, for instance, might prioritize fact-checking.

Before generating a response, Fidelius assembles context through semantic searches across its networks, ranks the results, and interprets them according to the user's profile. After responding, it extracts new information for long-term storage, clearing the working memory for the next turn.

AI Examples

Budget Calculation

User query: "Help me budget for an $85.50 dinner with 20% tip."

The active context holds the numbers and query. The recall mechanism retrieves a stored opinion: "User prefers rounding up tips." The agent orchestration computes the math while checking for relevant patterns.

Response: "Base: $85.50. Tip at 20%: $17.10. Total: $102.60. You usually round up, so $18 tip would make it $103.50."

Route Planning

User query: "Route me to the new office."

The active context holds the destination and current location. Recall integrates traffic data from the World network with past commute records from Experience. The behavioral profile (set to avoid stressful routes) influences the recommendation.

Response: "I-95 south, about 20 minutes in current traffic. You mentioned frustration with rush hour last month, so here's a back-road alternative if you have extra time."

Debate Preparation

User query: "Prepare counterarguments for an AI ethics debate with Alex."

The active context tracks the debate topic. Recall pulls opinions on AI ethics and observations about Alex's employer. The central executive filters for relevance.

Response: "TechCorp's merger raises antitrust questions; their 15% stock increase suggests growing market power. In your last conversation with Alex, you both agreed that ethical frameworks drive innovation. Build from there."

Practical Benefits

This design addresses several problems in conversational AI:

Token limits prevent context overload and keep response times reasonable
Retrieval from long-term storage reduces fabricated responses by grounding replies in stored information
Behavioral profiles make responses more relevant to individual users
Separating working memory from long-term storage allows multi-turn conversations without losing context

Resource Efficiency

The working memory architecture also reduces computational costs. Traditional AI systems often include entire conversation histories or large knowledge bases in every prompt, which increases token usage and latency. Fidelius instead enforces token limits on its active context and retrieves only relevant information from long-term storage through targeted searches. The result is lower token usage and faster response times without losing access to relevant history.

Continue reading: Cognitive Foundations for AI

Previous · Part 3

Even AI Needs Sleep

Next · Part 5

Memory & a Game of Telephone from 1932

Technical

Building Production RAG Systems

15 min read

Cognitive Science

Personality Affects Your Memories

6 min read