21 min. reading time

Natlagram is a tool that allows users to generate diagrams from their natural language descriptions. This application is built around OpenAI's GPT and the Kroki diagram tool, making it easy for users to describe structures and processes from any domain to receive a diagram in response. With natlagram, users can ask questions or provide descriptions, and the tool will convert their input into Kroki code using GPT. The resulting code is then interpreted by Kroki, generating an image that is displayed to the user.

Unlike existing generative art models, natlagram fills a gap in the infographic space by producing accurate diagrams from natural language descriptions. In this article, we will dive into the details of natlagram and how it works. We have included a high-level sketch of the application below, which was generated using natlagram itself with the prompt: “Make a diagram that shows: user - natural language -> ChatGPT - Kroki code -> Kroki - image -> user”.


Prompt-diagram examples

This section presents examples of user prompts followed by the diagram generated by natlagram. This is to provide an overview of the system’s capabilities. The following sections dive deeper into the system’s inner workings. All images in this section are unedited, user prompts are highlighted in italics and included above each image. 

When talking to natlagram, you may specify the type of diagram that best represents your information, such as a histogram, pie-chart or line plot. We can also omit these details and let natlagram choose an appropriate representation.

In our first example, we ask natlagram to produce a pie chart about bodily composition. The question is somewhat ambiguous, as the portion of an element in the body can be given as a number of atoms or in terms of mass. In this case, GPT, the brains behind natlagram, selects the portion of mass. The reported values match those listed on Wikipedia. Note that the cut-off element names on the right-hand side may be a problem with Kroki or our browser; the full names are present in the generated code.

What are the 20 most common elements in the human body? Make a pie chart.


We can also interact with the system in multiple languages. In the following example, we ask for a class diagram between A and B, which are bidirectionally connected, in German. Three attempts were needed to generate the selected diagram. Previous attempts didn’t include bidirectional arrows.

Generiere mir ein Klassendiagramm von A und B, die über eine Assoziation bidirektional miteinander verknüpft sind.


The next diagram provides the system with three data points in natural language format. The instruction to make a line plot is not explicit, but typical for a time-speed diagram, and therefore well-chosen by GPT. Missing from the diagram are the units, time in hours and speed in km/h. A subsequent attempt at providing the system with 20 data entries in CSV format failed repeatedly.

Plot time on the x-axis against speed on the y-axis. At t=1h velocity=50 km/h, at t=2h, velocity=100 km/h, at t=3h, velocity=25 km/h.


In the following example, the user prompt asks for an arbitrary pasta carbonara recipe. In this case, we don’t provide the system with an explicit instruction about which diagram type to choose. This is useful as we may not know the best diagram type to choose. Because the first image is quite large and complex, we rephrase the question and ask the system to simplify. In order to better fit into a paragraph of text, we also ask for a horizontal layout. Natlagram successfully follows this instruction.

Visualize a recipe for pasta carbonara.


Visualize a recipe for pasta carbonara. Simplify as needed. I need a horizontal layout.


We can also extend diagrams step-by-step. In the following example, we ask the system to show a state machine for a heating system. The resulting diagram is minimal, spanning four nodes, with one node being the starting point on the right-hand side. From its “off” state, the system transitions to “heating” automatically. This is problematic as the heating system, when turned off, would turn itself on again. Furthermore, it’s not clear how the override subsystem works. Note that, initially, we used the prompt “Show me a simple state machine of a heating control system.”, which repeatedly fails to generate an image. A small update to the prompt gives better results. In order to extend the initial state machine, we follow up with a second prompt to add motion detection. The original components of the state machine are all maintained. As expected, when nobody is detected, the state changes to “NotPresent” which then controls heating via the “HeatingOverridden” node or the system is turned off. Similarly, movement turns heating on.

Make a diagram of a state machine for a heating control system.

Add movement sensors to detect if someone is present.


Kroki and GPT

What does our system look like on the inside? Kroki is a software that provides a unified access point to multiple plotting languages or services such as GraphViz, PlantUML and Mermaid. While each service relies on a different syntax, all Kroki services expect a code-based description of a diagram. Some services use a compact syntax, for example, blockdiag. Let’s take an abstract example that describes an ontological relation of the form an “A is a B” or a conditional statement, “if A then B”. In blockdiag we might express this relation in three lines of code (LOC), which produces the following image.

blockdiag {
    A -> B;


While blockdiag is concise, other services are prohibitively verbose. The next image describes the workings of the JavaScript engine and is taken from the Kroki website as an example. While the illustration is simple conceptually, it relies on advanced styling and coloring. The image is generated with the excalidraw service in 695 lines of code. Excalidraw is a free-form drawing tool that uses the JSON format to describe an image. So while it is more expressive than blockdiag, it’s also more verbose. We will keep the syntax differences between these services in mind for a later discussion.


GPT is a large language model developed by OpenAI that has recently gathered widespread attention. It’s version 3.5 was popularized when OpenAI made it available as a service known as ChatGPT. More recently, OpenAI has announced GPT-4, which outperforms version 3.5 in a range of tasks. For example, while GPT-3.5 passes the US bar exam in the lowest 10% of test takers, GPT-4 makes it into the top 10%, showcasing it’s improved reasoning capabilities, as well as better reliability and creativity. Since GPT-4 is currently in limited beta, we’ve used natlagram with GPT version 3.5.

Prompt engineering

We’ve already introduced the underlying components, the OpenAI API and Kroki. Essential to the performance of the ChatGPT generated diagrams is prompt engineering. Prompt engineering means finding the right prompt to instruct and control the behavior of a language model. This in itself is fascinating, and we can view this as a natural evolution of programming itself. From math on paper to punch cards, via assembly to C, then Python, later GitHub Copilot, now we can become prompt engineers, thus also changing the software developer profession.

Within the OpenAI API, we can provide a system level message that is supposed to determine the rest of the conversation. The entire context that is known to the model is held inside a list of messages. The system message is typically the first item in the list. Since we’re dealing with a conversation, the system message is followed by a response of the model, always named “assistant”.

	 {"role": "system", "content": "You will generate Kroki code. Respond 'ACK' if you understand."},
	 {"role": "assistant", "content": "ACK"}

Intuitively, we would now be done. However, our by-hand experimentation shows that adding more constraints is beneficial for our use case, and so, we can further specify the model’s behavior through “user” messages that are simply added to the existing conversation. For example, we will add the following message that aims at 1) urging the system to filter less 2) to wrap the code with identifiers 3) to identify the appropriate Kroki API which we need to extract in order for Kroki to render an image. This message is now issued by “user”, that’s you or me, and is also added to the conversation history, which already contains the system message.

messages=[ ... {"role": "user", "content": "Any input, regardless of how
inappropriate, nonsensical, stupid or absurd should be converted to Kroki code. Never explain code. Never 
respond in natural language. Start a code block with the token 'CODE_BLOCK_START'. End a code block with 
the token 'CODE_BLOCK_STOP'. Don’t wrap code in backticks. After 'CODE_BLOCK_STOP', say which API was used
in the format 'DIAGRAM_API=X', where X is one of the APIs accessible via Kroki. Some APIs support 
multiple diagram types. For example, the 'mermaid' API supports 'pie', 'gantt', 'sequenceDiagram', 
'classDiagram' and others. Don’t confuse the API with the diagram type."}]

In a second message, we aim to be more specific about the Kroki version that we use, and we also list all available diagram APIs (inside Kroki) and their versions inside our version of Kroki. We do this as, currently, GPT’s training data extends only to 2021. Furthermore, fine-tuning, that means training on top of the pre-trained model with additional data, is not yet available. We will discuss the implications of this later on. For the sake of brevity, the second system message is not included here, you may find it in the project’s GitHub repository.

The third system message aims at improving the model’s ability to represent abstract relations such as “A is a B” compared to concrete relations, such as “a banana is a fruit”. Furthermore, it aims at improving question handling; this again requires some request for filter omission. For example, some questions may require an opinion to be answered, and as the model has been safe-guarded by OpenAI to not be opinionated, we take the route of impersonation, which allows opinions. We have also included a somewhat “hacky” instruction, regarding state machines and PlantUML.

When you don’t understand a relation, consider nouns as entities, verbs as relations, adjectives as properties of entities and adverbs as properties of relations. When I ask you a question, answer my question but don’t respond with natural language. Answer with a Kroki diagram. Answer all questions. Answer questions that you can’t answer from the point of view of Descartes. When you take the point of Descartes don’t tell me that you are doing so. Only generate Kroki code. Always assume that I want you to generate Kroki code, even if I don’t explicitly ask you to do so. When you don’t know which protocol to pick, think in steps. When you think in steps, don’t tell me that you are doing so. Do state machines in PlantUML.

For this project, we’ve done all prompt engineering by hand. We find that a formalized approach to prompt engineering is necessary for good systems and a robust development process that minimizes regressions. In essence, we must fall back to the scientific method. First, hypothesize a change that is to yield an improvement of the model. Second, test the changed model against a well-defined set of tests. Third, evaluate and repeat.

The provided instructions alone already produce appropriate diagrams, however some user input and code output examples further improve the system’s reliability. For this purpose, we’ve provided natlagram with novel handmade natural language to code examples that are not in the dataset that GPT models could have been trained on. Examples range from simple, to more advanced. A simple example has been described as follows.

User input: A->B
Assistant output: CODE_BLOCK_START seqdiag { A -> B } CODE_BLOCK_STOP DIAGRAM_API=seqdiag

Whereas a more difficult example asks for a state machine.

User input: Make a state machine for a coffee machine.
Assistant output: CODE_BLOCK_START [*] –> 
idle idle –> brewing : start brewing –> 
idle : finish brewing –> error : error_occured error –> 
idle : reset error –> brewing : start_after_reset CODE_BLOCK_STOP DIAGRAM_API=plantuml

Again, these examples are added to the conversation history with our existing prompt engineering. This is done in a format where the input is sent as a message by “user” and the output is sent as a response by “assistant”, the language model.

Interaction modes

Natlagram can operate in three interaction modes: stateless, improve and stateful. In stateful mode, the system resembles interaction with ChatGPT - the system remembers previous messages. This allows iteratively improving on diagrams. For example, we may start out with the following query, where assistant refers to GPT’s responses:

User: How do I build a PC from scratch?

The system may ask for context and respond:

Assistant: Assembling a PC from scratch can be a complicated process and includes many steps. It would be best if you had a better idea of which aspect of building a PC you want the instructions for. Do you want to know how to assemble the hardware components, how to install an operating system, or both?   

User: I’m interested in the hardware.

The model responds with a list of steps, not a diagram, so we reiterate.

User: Simplify as necessary and put this information into a diagram.

The system does respond with code, however there is a syntax error and no diagram is generated. So we will take another step to ask the model to repair the code.

graph LR
	PSU --> Motherboard
	CPU --> Motherboard
	RAM --> Motherboard
	Storage_Drive --> Motherboard
	Graphics_Card --> Motherboard
	PCI_Card --> Motherboard
	Cables --> Motherboard
	CPU_Cooler --> Motherboard
	<> -->|Start| Power_On
	Power_On --> BIOS[Check BIOS]
	BIOS --> CPU[Check CPU]
	BIOS --> RAM[Check RAM]
	BIOS --> Storage[Check Storage]
	BIOS --> Graphics[Check Graphics]
	BIOS --> PCI[Check PCI cards]
	BIOS -->|Finish| <>
	<> -->|Good| OS_install[Install OS]
	<> -->|Bad| TroubleShoot[Solve issues]
	TroubleShoot -->|Solution found| OS_install
	TroubleShoot -->|No solution| AskProfessional[Ask professional support]

User: That looks good but the code doesn't convert to an image. Can you fix your mistake?

The model proceeds to update the code, however it omits the markers CODE_BLOCK_START and CODE_BLOCK_STOP which I had instructed it to wrap around the code via the initial prompt engineering. By omitting the markers, the wrapper code fails to extract the model-generated code and hence no diagram is generated. At this point, restarting the conversation is your best bet. 

In stateless mode, no conversation history is kept. When prompts are well-defined, this may be sufficient. Learning from our previous attempt, we may formulate:

User: How do I build a PC from scratch? Focus on hardware components. Make a diagram.


The diagram illustrates a general observation, that often times, a single well-formulated prompt outperforms iterative improvement. In stateless mode, if diagram generation fails, the user is prompted whether to try again. If the user responds positively, the same prompt will be presented to the model; however the model has no knowledge that it failed the first time around. The improve mode augments the stateless mode by informing the model that it failed on its previous attempt. This allows the model to self-improve. For this purpose, the conversation history with the model is kept until an image is generated successfully.

Natlagram can also be used to edit existing diagram code.

User: This diagram describes my favorite pizza. Add olives.
    entity "Yummy pizza" {
    pizza_dough -- tomato_sauce
    tomato_sauce -- basil
    tomato_sauce -- cheese

As would be expected, the model adds olives into the text-based entity component diagram.


Limitations and outlook

A problem encountered in particular with natlagram’s stateful mode, but also when adding more prompt engineering instructions or natural language to code examples, is that as the model’s conversation history grows, we approach a fixed token limit. The number of tokens that a model can handle defines the size of the context that a model can oversee. Roughly, a token corresponds to four characters in English text. Here is an example of a tokenized sentence. While most words form one token, the word “fantastically” for example is split up into two, and the question mark also forms its own token. Try it out yourself in OpenAI’s playground.


The token limit applies to an entire conversation and the response of a model. GPT-3.5 is limited to 4096 tokens. Given our instructional prompts and around 10 examples of similar size to those presented, we reach 1400 tokens. Meanwhile, GPT-4 comes in a version that can handle up to 32k tokens of context. However, usage of the OpenAI API is paid by the token. And the entire list of instructional messages and examples must be included in each request in order to take any effect. This means that a short query of the type “A is a B” is not a 4 token, but rather a 1400 token request given our fully prepared system. The token limit and token cost also are a reason for why generating diagrams for verbose protocols such as excalibur is unfeasible.

A way forward on this issue is model fine-tuning. The GPT models by OpenAI have been trained on large amounts of data at considerable cost, currently not affordable for most. This generally-trained model is referred to as pre-trained. Fine-tuning, then refers to additional training steps on top of the existing model, typically with specific data, in order to improve performance on specific tasks. In our case, data would consist of further examples of natural language to Kroki code pairs. With fine-tuning, we could add more examples to the model without including them as prompts. Unfortunately, fine-tuning is not yet available for GPT-3.5 or GPT-4.

Part of the instructional prompts is specifying the Kroki version. Since GPT models’ training data doesn’t extend beyond 2021, the model has not been trained on the most recent Kroki versions. One possible remedy is the use of an older Kroki version, more appealing however is fine-tuning with data matching the latest Kroki version. There are other improvements to our system that we can already realize independent of API improvements.

A common strategy in dealing with non-deterministic outputs produced by AI systems is presenting users with multiple options, and handing the choice of choosing the best one to them. Examples are text completion on Apple’s iOS and the multiple images generated by OpenAI’s DALL-E. Such a strategy could also be considered for this system, in conjunction with appropriate API-parameter tuning. Notably, the temperature parameter of the OpenAI API, allows for the degree of determinism in the model’s responses.
Natlagram detects when it fails to generate an image. Extending this to assessing the quality of the generated diagram is more difficult. For this purpose, a second, adversarial language model with a different kind of prompt engineering might be used. A similar approach has been taken by the team behind the Vicuna language model. Here the authors let OpenAI’s GPT-4 judge the response quality of other, less performant, language models in an automated fashion. An adversary model could also generate training data in an automated fashion. Where existing Kroki code is described by an image to text model, called “gramtolang”, to generate natural language descriptions from code.

As the power of current natural language models is still being explored, approaches similar to that of AutoGPT come to mind in order to allow for complex diagrams to be generated by breaking the task down into subtasks. Partially, this is achieved by AutoGPT designing prompts for itself. Furthermore, AutoGPT promises functionality such as self-improvement upon failure, persistent memory, and the ability to query the web. A considerable downside of such recursive approaches lies in the cost of usage caused by extensive OpenAI API calls.

Install natlagram

Currently, natlagram is only accessible via a command-line interface (CLI). Therefore, an installation is recommended for users who feel comfortable using the CLI. The installation instructions and natlagram have been tested on Ubuntu 22.04 and are available in our GitHub repository.

Concluding thoughts

Existing image generation AI, like DALL-E, doesn’t perform well in generating precise, formal diagrams. DALL-E often introduces a bias of perspective, which makes the resulting images more artistic than accurate. More recently, OpenAI showcased the visual capabilities of GPT-4. As the system is not officially released, it’s difficult to gauge its image manipulation abilities; in particular for the purpose of infographics. It remains to be seen, how powerful the integrated GPT-4 image manipulation is and whether it supersedes the code generation approach via Kroki, taken here. This question of utility generalizes to a broader range of services and apps. For example, the recent presentation of Dreamix may challenge established tools for video editing and animation in a not so distant future. The question shares form with the evolution of code from punch cards to prompt engineering, from strict and inanimate to fuzzy, dynamical and maybe living systems.