What happens when you give AI the steering wheel - and you just manage the road signs? That was the idea behind *hippo-llm-memory*: a practical experiment in exploring whether today's large language models (LLMs) can help build something beyond my own reach, if I delegate most of the intellectual and implementation work to them. I wasn't trying to code a lot myself. I wanted to know whether AI could do the research, generate the architecture, and implement a working prototype - while I merely provided direction.
Spoiler: it mostly worked. But not always, and not without some real lessons.
Before diving into the story, here is a quick overview of the tools I used and what role they played:
This combination defined the development loop: research with ChatGPT, implementation by Codex, oversight through Git and CI, and minimal but critical human guidance in between.
The project began with a simple question: can I build a memory system for LLMs that's inspired by the human hippocampus? I had no background in neuroscience, and no plans to acquire one. So I did what any lazy AI enthusiast might: I delegated the research. Using a custom prompt I called DeepResearch, I asked ChatGPT to dig into hippocampal memory on a deep neurobiological level. It returned detailed notes on fast encoding, sparsity, pattern completion, replay consolidation, schema effects, and more. Did I read it all? No. But it wasn't for me - it was input for the next step. In parallel, I used another DeepResearch prompt to explore LLM internals: attention, positional encodings, long-context strategies, memory augmentation tricks, and recent architectural variations. Again, this wasn't for human consumption. It was meant to be fed into ChatGPT itself.
Then came the interesting part. I instructed ChatGPT - again, via a metaprompt it helped design - to cross-read both documents and derive possible algorithms that connect hippocampal mechanisms to practical memory modules for LLMs. The result? Three promising concepts:
All of this - from research to cross-domain mapping - was done in one day. Not by me, but by a well-structured, prompt-guided LLM.
With research-derived designs in hand, I moved to implementation. Again, the process was AI-centric, but at the beginning, it was also quite messy. Initially, I skipped structured planning. I simply asked ChatGPT to generate a list of Codex tasks based on the algorithms from the cross-read research documents and fed those tasks directly into Codex. Codex implemented them, one after the other. I did a quick visual check, accepted the code, and moved on. Then I asked ChatGPT to review the resulting implementation against the "ground truth" - the original research-based design - and generate follow-up Codex tasks to close the gaps. This process created an illusion of fast progress. New code kept appearing. Evaluation reports began to show up. Reviews got more and more positive.
But I started to feel uneasy. I had no clear sense of where we were in the overall development cycle. Were we almost done? Halfway? Was the code valuable or just verbose? What exactly had been implemented, and what still lived only in the plan or in my head?
At that point, I realized something critical was missing: project structure. So I paused and switched gears. With help from ChatGPT, I created a proper project plan. This became the anchor. Instead of asking ChatGPT to generate isolated tasks, I now worked milestone by milestone. For each one, ChatGPT broke down the work into smaller packages, generated Codex prompts for each, and reviewed completed code against the plan - not just the original research. While the loop of task generation, implementation, and review remained the same, it now had a clear trajectory. I could track what had been done, what was missing, and whether each part met its milestone criteria.
In short: progress became measurable. This change - from reactive generation to plan-driven development - marked a turning point. It didn't fix all problems (as later sections describe), but it gave the project a skeleton. Without it, everything risked collapsing under its own ambiguity.
After a few milestones, the workflow began to break down. The first issue was looping. A typical pattern emerged:
After multiple days of cycling through the same feedback loop, I had to step in manually. Only by analyzing the underlying design flaw and reshaping the input plan could I break the cycle. The second issue was code entropy. Because I let ChatGPT define Codex tasks and rarely pushed back, the implementation grew complex quickly. Some files exceeded 1500 lines. Functions were deeply nested, poorly structured, and impossible to reason about. Refactoring attempts failed or at least were too costly to complete: the code was too tightly coupled and under-specified. In short, I had fallen into what might be called vibe coding: things felt productive, but structure was lacking.
Several core insights emerged from this first iteration:
The project isn't over. But it's starting over - with a stronger foundation:
The new cycle uses the same deep research results, but everything else is rethought. Design, architecture, and planning are approached with a more critical and structured eye.
Planning artifacts are developed in greater detail before any code is written. I don't just accept the algorithm proposals because they're "too complex to question" - instead, I ask ChatGPT to explain them clearly, question the structure, and defend design decisions. If it can't, we revise.
> All design notes, research summaries, architecture decisions, and planning artifacts are preserved in the repo. This provides continuity not just for me, but for the AI tools that follow. LLMs have no memory - so we must give them one.
From the beginning, the workflow now includes:
The goal is still the same: build hippocampus-inspired memory modules for small LLMs. But the bigger mission is emerging: design a sustainable, auditable, human-AI collaboration process that actually works in practice. AI Software Engineering. There's more to come: how to map professional practices into an AI workflow, deep dives into each memory algorithm (after successfull implementation and validation), reproducibility tricks, prompt engineering patterns, and more.
The repo is here: https://github.com/ArneDeutsch/hippo-llm-memory
If you're building something similar - or want to avoid falling into a vibe coding trap - I'd love to exchange ideas. The next generation of tools will be built with AI. Let's make sure we build them well.
# hippo-llm-memory