8 min. reading time

What happens when you give AI the steering wheel - and you just manage the road signs? That was the idea behind *hippo-llm-memory*: a practical experiment in exploring whether today's large language models (LLMs) can help build something beyond my own reach, if I delegate most of the intellectual and implementation work to them. I wasn't trying to code a lot myself. I wanted to know whether AI could do the research, generate the architecture, and implement a working prototype - while I merely provided direction.

Spoiler: it mostly worked. But not always, and not without some real lessons. 

Tools Involved

Before diving into the story, here is a quick overview of the tools I used and what role they played:

  • ChatGPT (DeepResearch mode): for deep investigations into neuroscience and LLM architectures, and for cross-reading both domains to derive memory-inspired algorithms.
  • ChatGPT (Thinking mode): for planning, breaking work into milestones, generating Codex tasks, and performing reviews against research ground truth and project plans.
  • Codex: for implementing the actual code tasks. Given a task description, Codex produced code suggestions that I reviewed and selected from.
  • Git & GitHub: for version control and as the main collaboration backbone. All generated artifacts, from research notes to code and evaluation reports, were preserved here. GitHub Actions were used for Continuous Integration (CI), running tests and linting automatically, especially useful since most code was generated with minimal manual checks.

This combination defined the development loop: research with ChatGPT, implementation by Codex, oversight through Git and CI, and minimal but critical human guidance in between.

Step One: Let the AI Think

The project began with a simple question: can I build a memory system for LLMs that's inspired by the human hippocampus? I had no background in neuroscience, and no plans to acquire one. So I did what any lazy AI enthusiast might: I delegated the research. Using a custom prompt I called DeepResearch, I asked ChatGPT to dig into hippocampal memory on a deep neurobiological level. It returned detailed notes on fast encoding, sparsity, pattern completion, replay consolidation, schema effects, and more. Did I read it all? No. But it wasn't for me - it was input for the next step. In parallel, I used another DeepResearch prompt to explore LLM internals: attention, positional encodings, long-context strategies, memory augmentation tricks, and recent architectural variations. Again, this wasn't for human consumption. It was meant to be fed into ChatGPT itself.

Then came the interesting part. I instructed ChatGPT - again, via a metaprompt it helped design - to cross-read both documents and derive possible algorithms that connect hippocampal mechanisms to practical memory modules for LLMs. The result? Three promising concepts:

  • HEI-NW - Episodic memory with neuromodulated writes
  • SGC-RSS - Schema-guided consolidation into a semantic store
  • SMPD - Spatial maps with procedural distillation

All of this - from research to cross-domain mapping - was done in one day. Not by me, but by a well-structured, prompt-guided LLM.

Step Two: Let the AI Build

With research-derived designs in hand, I moved to implementation. Again, the process was AI-centric, but at the beginning, it was also quite messy. Initially, I skipped structured planning. I simply asked ChatGPT to generate a list of Codex tasks based on the algorithms from the cross-read research documents and fed those tasks directly into Codex. Codex implemented them, one after the other. I did a quick visual check, accepted the code, and moved on. Then I asked ChatGPT to review the resulting implementation against the "ground truth" - the original research-based design - and generate follow-up Codex tasks to close the gaps. This process created an illusion of fast progress. New code kept appearing. Evaluation reports began to show up. Reviews got more and more positive.

But I started to feel uneasy. I had no clear sense of where we were in the overall development cycle. Were we almost done? Halfway? Was the code valuable or just verbose? What exactly had been implemented, and what still lived only in the plan or in my head?

At that point, I realized something critical was missing: project structure. So I paused and switched gears. With help from ChatGPT, I created a proper project plan. This became the anchor. Instead of asking ChatGPT to generate isolated tasks, I now worked milestone by milestone. For each one, ChatGPT broke down the work into smaller packages, generated Codex prompts for each, and reviewed completed code against the plan - not just the original research. While the loop of task generation, implementation, and review remained the same, it now had a clear trajectory. I could track what had been done, what was missing, and whether each part met its milestone criteria.

In short: progress became measurable. This change - from reactive generation to plan-driven development - marked a turning point. It didn't fix all problems (as later sections describe), but it gave the project a skeleton. Without it, everything risked collapsing under its own ambiguity.

When the Magic Wore Off

After a few milestones, the workflow began to break down. The first issue was looping. A typical pattern emerged:

  • >Review reveals issue X.
  • Codex implements fix for issue X.
  • New review\... reveals issue X again OR or its cousin.
  • Repeat.

After multiple days of cycling through the same feedback loop, I had to step in manually. Only by analyzing the underlying design flaw and reshaping the input plan could I break the cycle. The second issue was code entropy. Because I let ChatGPT define Codex tasks and rarely pushed back, the implementation grew complex quickly. Some files exceeded 1500 lines. Functions were deeply nested, poorly structured, and impossible to reason about. Refactoring attempts failed or at least were too costly to complete: the code was too tightly coupled and under-specified. In short, I had fallen into what might be called vibe coding: things felt productive, but structure was lacking.

Lessons Learned

Several core insights emerged from this first iteration:

  1. 1. Design and architecture must be understood by humans. While the AI can assist understanding, it often fails to ask the right questions or challenge its own assumptions. That means *I* need to be able to ask the right questions to uncover flaws. If no one understands the architecture, no one can fix it when it breaks.
  2. 2. AI knows how to write functional code, but lacks a sense for maintainability.> Without constraints, it happily creates massive functions, adds layer upon layer of if-statements, and grows complexity exponentially. Without human intervention, you'll end up with something that works once - but can't evolve.
  3. 3. Code must be visually reviewed, always. Even if the logic appears to be correct, warning signs like function length, nesting depth, and naming chaos are indicators of problems to come. Large functions must be refactored immediately, or the AI will bury you in complexity - until even it can't reason about the code anymore.
  4. 4. Code quality guidance must be baked in from the start. Tests, naming conventions, file boundaries, and review prompts should not be afterthoughts. They are the guardrails that keep AI-generated code from collapsing under its own weight.

What Comes Next

The project isn't over. But it's starting over - with a stronger foundation:

1. Second Iteration, Same Research

The new cycle uses the same deep research results, but everything else is rethought. Design, architecture, and planning are approached with a more critical and structured eye.

2. More Human Involvement Early

Planning artifacts are developed in greater detail before any code is written. I don't just accept the algorithm proposals because they're "too complex to question" - instead, I ask ChatGPT to explain them clearly, question the structure, and defend design decisions. If it can't, we revise.

3. Version-Controlled Documentation

> All design notes, research summaries, architecture decisions, and planning artifacts are preserved in the repo. This provides continuity not just for me, but for the AI tools that follow. LLMs have no memory - so we must give them one.

4. Improved Engineering Practices

From the beginning, the workflow now includes:

  • a project plan a human can understand
  • smaller, testable tasks
  • code coverage metrics
  • style constraints and complexity checks
  • automation for milestone audits
  • clear separation of prototypes from production paths

The goal is still the same: build hippocampus-inspired memory modules for small LLMs. But the bigger mission is emerging: design a sustainable, auditable, human-AI collaboration process that actually works in practice. AI Software Engineering. There's more to come: how to map professional practices into an AI workflow, deep dives into each memory algorithm (after successfull implementation and validation), reproducibility tricks, prompt engineering patterns, and more.

The repo is here: https://github.com/ArneDeutsch/hippo-llm-memory

If you're building something similar - or want to avoid falling into a vibe coding trap - I'd love to exchange ideas. The next generation of tools will be built with AI. Let's make sure we build them well.

 

# hippo-llm-memory

Comments