There is a Better Framework to Work with LLM's than Prompt Engineering
And Context is All You Need.
Its called Context Engineering - or at least that is what I call it.
It is my attempt to present a conceptual framework for reducing hallucinations, increasing Large Language Models (LLMs) output reliability and predictability, and generally practically aligning in production environments with user intent, rather than just instruction.
My intent is merely to share my observations and understanding of how to best work with these models based on my experiences. I make no claim that these concepts are applicable in the Grand AI Alignment debate; rather, I am setting forth a framework and toolkit to practically get the model to do what you want with less trial and error and overall less token spend. The goals are to:
Save money on API calls: Reduce your API costs by reducing iterations and reprompts needed to get your desired output. Reducing overall token count (also token limits ty OpenAI)
Increased output reliability: Less hallucinations and overall model misalignment with your intent.
Get the output you want with less Iteration → Faster Method than just “I’ll know it when I see it”: One of the challenges of working with LLMs is that users may not know the exact output they want, but they have a general sense of the direction they want the model to go in. I’ll provide a framework to get to “I see it” faster.
The other side of my motivation for publishing is that I see a gap in the current AI discourse.
I have not seen a meaningfully useful set of principals for thinking about how to deploy and productize LLMs safely and effectively to create useful value-creating user experiences that is accessible to viewers not deeply into the space already.
From my perspective, there two camps that occupy 80% of the overall discourse:
Deep End Technical Discourse: Arvix papers, the Alignment debate / effective accelerationists vs doomers, etc.
Watered Down Discourse: “What is ChatGPT?”, “10 prompts that x your y” and the general “AI is going to replace these jobs the first”, etc.
There is a huge disconnect between these two and very few people who can translate one to two and vice versa. I think this disconnect is bad for everyone who wants to see AI safely used to create value and make life better for the average person.
There is lot of room between number one and number two, and only a relative handful of people playing in the space.
The most useful document I’ve seen that I think you could consider a “meaningfully useful set of principals” is the OpenAI Cookbook, which I think still falls into number one (which btw is a great doc and you should check it out if you haven’t already)
I believe the more people that can have a conceptual understanding how to work with LLMs, the better. The best way to ensure that LLMs hold up to the bright possibility we want, and are deployed safely and effectively, is to increase broad understanding and improve overall reliability (aka less hallucinations).
I don’t think this framework is 100% the answer to this problem, but I think it could be helpful for helping people in camp #1 translate the work they’re doing to people who would normally consume camp #2 content while providing higher fidelity information.
I also honestly want to know if this framing maps for anyone else, so, reader - you are invited to critique! Feedback is appreciated.
Before jumping into the framework and the solutions it can provide - we need to focus on what problems we are solving.
What are We Solving?
I believe there two problems inherent to working with Large Language Models (LLMs):
The Intent to Output Gap Problem
This problem is made up of two parts:
Part 1 of this Problem: User intent does not always fully translate into user model instruction. →/
There is often a gap, between the user’s intent and their input/instruction. This is the first part of the Intent to Output Problem.
User Intent →/ User Input
We can call it the Intent to Input Problem.
Part 2 of this Problem: LLM Output is not always what the user desires, even if their instructions truly matched their intent.
The LLM Output generated based on the user input (prompt on top of current context window) and the LLM Output the user actually wanted.
LLM Output based on User Input →/ User’s Desired LLM Output
We can call it the Desired vs Actual Output Problem.
The Intent to Output Gap Problem:
(Intent →/ Input) + (Actual Output →/ Desired Output)
The Context Constraint Problem
This is just a straightforward resource constraint problem.
There is limited toolkit to address the above problem. Regardless of of model size or context window, there will always be a limit on the number of tokens you can pass in at once. So the question is - what context do we need to pass into the context window to address the Gap Problem and leave room for iteration?
And Context Engineering Totally Solves these Problems?
No. It presents a framework to think about them from both an conceptual and practical level, presents a model of why these problems could be occurring and potential ways to solve them at a practical level.
I don’t have a grand unified theory. I just want to share what I think works right now based on my current understanding and I have found Prompt Engineering to be a poor framing for these problems.
Understanding the Problem Further. Why can’t we use Prompt Engineering here?
Simply put, I don’t think this is Prompt Engineering. Both Context Engineering and Prompt Engineering play important roles when interacting with these models but they approach the challenge from different perspectives.
Prompt Engineering only focuses on the direct input into the model. Using the problems we explained, it only addresses half of the context constraint problem. It does not go far enough to consider the broader background these LLMs are often deployed in.
This is where Context Engineering comes in.
If Prompt Engineering only considered the Context Constraint Problem, Context Engineering considers the Context Constraint Problem holistically in the framing and along with the Intent to Output Gap Problem. So you don’t have to scroll up:
The Intent-Instruction Gap: This gap exists when there's a discrepancy between what the user says (their instruction) and what they actually want (their intent). Context Engineering seeks to minimize this gap, ensuring that the LLM understands not just the user's words, but their underlying intention.
The Input-Output Gap: This gap emerges when there's a difference between what the user puts into the model (their input) and what they expect in return (the model's output). Context Engineering works to align the model's output more closely with the user's intent, even if the specifics of the user's input doesn't change (we can use context to do this).
So, while Prompt Engineering is about crafting the right questions, Context Engineering is about making sure we understand the true intent behind those questions and that the model responds in a way that is more closely aligned with that intent.
What does Context Engineering means in Practice? You already do it on a daily basis.
The Problem: You can only pass in n number of tokens into a LLM at a time. What do we pass in?
Even fully acknowledging ever expanding context windows (100k tokens, 1m tokens, 10 millie tokens - idc this is will always be true) - engineers, designers and business leaders will always have to make tradeoffs about what goes in the immediate context window. Even as the window grows, I think it is a safe bet there will always be tradeoffs to make; since we’ll just find more reasons to stuff more information into the window to do address more use cases and → just general hedonic adaptation.
The types of questions we need to ask in context engineering are centered around providing the right context using as few tokens as possible:
Should we include the full API doc or just a summarized version? Will the quality of the code meaningfully drop if it doesn’t have the context from the docs. Does it need it at all?
Will giving the LLM our company wide KPIs help it write better project-specific KPIs for this project? What about our project timeline - can it do without? How good of a project scope will it write?
Does this LLM agent need to know it what type of email this is? Does it need to see have the thread that came before this thread? Is this a reply to an existing thread or is it a new draft? How would this impact what it says as a response?
This is Context Engineering. And you already do it on a daily basis!
Yes, you engage in Context Engineering everyday - yes you, my most deeply appreciated reader, you engage in Context Engineering everyday when you work with other people; you just probably don’t think of it as “engineering context”, or you do and are just 10 steps ahead me.
For all those example questions above, swap in your favorite colleague instead. You've almost certainly asked yourself:
“Does Khaled (your most beloved coworker who is just an all around awesome dude) have the context on this project? He isn’t directly on it…so we might need to get him up to speed. I’ll send him all of the relevant docs and see if he has any questions.
Because you already intuitively know what information he needs, there is never a need to deeply consider what docs you’re going to give Khaled other than a basic review of what information exists to give him, or in what detail - Khaled doesn’t have a limited 8k context window like GPT-4. You just give him the docs and if he has questions he asks you.
We are used to asking and answering questions about context intuitively and iteratively (question and answer) not explicitly with as few attempts as possible due to a limited context window.
But now that we are working with context-limited reasoning engines, a framing I find the most helpful for thinking about these types of models, we do have to think about context hyper-explicitly.
The quality and nature of the context we provide an LLM can cause meaningful shifts in the quality, reliability and relevancy of the output; and if we are using such a model in a production system, that directly translates to cost associated with API calls, the value of our product, and the quality of our user experience.
Here’s an example. Help me solve my problem please. Let’s say I ask you:
I am going on a trip. What should I pack?
You would probably respond:
“It depends, but maybe some clothes, toiletries and your laptop.”
Whatever - right? Its a vague question. You can’t really provide a meaningful response because there isn’t much to go on.
Ok now what if I said:
I am flying from NYC to LA to spend time with family. It is currently May 2023. The weather has been cold recently in LA but the forecast for the general time of year is in the 70s and 80s. I grew up in LA and have some clothes there but not much now. I plan to be back for several weeks and need to know what I should pack. What should I pack?
This is more helpful, right. You could take a stab and say “you should pack enough clothes to have while you’re there and have half warm clothes and half cold clothes.”
But would the usefulness and relevancy of your response be impact if we cut that context down:
I am flying from NYC to LA. The weather has been cold recently in LA but the forecast for the general time of year is in the 70s and 80s. What should I pack?
You could give me pretty much the same response in regard to relevancy and usefulness.
This is of course, a silly toy example, but if we both had limited context windows like LLMs, that would have saved us both about 50 tokens.
Scale this up to saving 5000 tokens in a prompt template, that you’re going to use in your product 10,000 times a day, and I think you start to see how really deeply considering context makes a difference at scale.
When we try to get Language Models to do something, we can’t assume they will have the context in their training data, represented in the model latent space the way we’d want to answer our specific question.
Finding the minimum necessary context a LLM needs to achieve our goal, and determining what exactly our goal is, is context engineering.
Understanding Context Itself
For our purposes and in the realm of language models, context refers to the depth of understanding or background knowledge required for a model to generate a suitable response/output. We can also say a suitable output is one that is:
relevant and free from weak falsifiability hallucinations (more on this later)
addresses the user’s intent
matches the user’s desired detail level
Beyond solving for these three factors, I have not found it useful to attempt to quantify context.
If someone asked you how much context a person would need to answer a relevant, detailed answer that addresses the intent behind your question, the best you could muster would be “it depends on the scenario” because different scenarios require different background information and level of detail in the information. This applies to working with context for LLMs - at least for now.
I think it is important to acknowledge this is a squishy problem with a squishy answer. Attempting to quantify context in a strict mathematical sense may make sense some day, but because we don’t have a solid understanding how neural networks semantically represent concepts in their latent space (the blackbox problem) - I think the best we can do is qualitatively measure and model this problem.
How to Qualify Context: Context Sensitivity
What would a person need to know to “have the context”?
My first job ever was working at a family friend’s one-person law practice. He was semi-retired but still did some municipal law consulting consulting. When I first started - I had no idea what was going on.
I had zero context on who his major clients were, what they expected, the jobs I needed to (it was mostly replying to emails and revising documents based on his written notes). Over time, by gathering information, learning by doing and iteration, I developed the context necessary to do my job with little instruction. But when I first started, the attorney had to ask himself “what context does Jacob need to get up to speed to his job?”
Every time we work with someone else for the first time we have to explicitly or implicitly ask “do they have the context”
When we are working with other people, we are often times depending on them to do some sort of knowledge work.
We can think of LLM as reasoning engines that we want to do knowledge work for us. So if the desired output is relatively the same, the necessary inputs should be relatively the same. It comes down to the proper context.
This is the background to a basic principal I call Context Sensitivity.
Context Sensitivity is a qualitative measure of an LLM and the present scenario trying to be solved or otherwise addressed. It measures if how much the quality of the LLM output would vary depending on changes in the provided context.
We can imagine the model and scenario we are solving as the input variables for context sensitivity.
(Model : Scenario)
Now when we go to solve a problem with an LLM it can be represented as something like:
(Model : Scenario) * Context → Output
Some Model:Scenario
pairs have extremely low Context Sensitivity, meaning that there is little to no need for additional context to address the scenario. An example of this would be asking GPT-4 “what is 2+2?”
(GPT-4 : Arithmetic[what is 2+2?]) * Context() = “2 + 2 equals 4.”
Using this example, if we provide some basic context, we can confidently say there will be no qualitative gain the model output. We are going to get our desired output essentially every time regardless of the context we provide. For example:
(GPT-4 : Arithmetic[what is 2+2?]) * Context(“Here is how you add numbers together → 1+1 = 2”)
Giving “here is how you add numbers together” will not make GPT-4 better at arithmetic and will not qualitatively boost the output. We know GPT-4 already knows how to do arithmetic, and therefore:
(GPT-4 : Arithmetic[what is 2+2?]) * Context(None)
Will get you the same you the same as:
(GPT-4 : Arithmetic[what is 2+2?]) * Context(“Here is how you add numbers together: 1+1 = 2”)
With less tokens.
We are of course ignoring a lot here of this example, such as telling the model “redefine 2 to equal 3”, as the context, then you will of course get a different answer, but this beyond the scope of this model for now.
So while we’ve defined low context sensitivity, what does high Context Sensitivity look like?
An example of high Context Sensitivity would asking GPT-4 to help you write a script using a Python package published after 2021 (the GPT-4 training data cutoff date). We will use Langchain our example package.
(GPT-4 : Code[Write a chatbot script in Py using Langchain]) * Context(None)
This is going to get you a worthless output. GPT-4 will object saying it doesn’t know what that package is - because it doesn’t have the context!
But if you told it what Langchain is, and gave it some code snippets and docs in the context, the quality of the output would increase dramatically.
(GPT-4 : Code[Write a chatbot script in Py using Langchain]) * Context(Langchain is…here are code snippet examples…) = a somewhat useful answer
Behind these examples are a simple set of core questions:
what can we assume the model knows?
What is the delta between what:
The model currently actually knows
What it needs to know to create our desired output.
What type of output do we want? What do we, the user, know about what we would say would be “suitable response”?
(fun fact - the study of Knowledge is called Epistemology)
LLM Context Sensitivity is highly variable from scenario to scenario - for the same reasons it would be variable for a person. The novel context needed depends on the preexisting knowledge the person already has. This is a fairly straightforward problem to solve when working with a person because
You can ask them questions to establish a baseline
They can ask you questions! They can tell you they don’t understand!
My bet is that currently LLM’s cannot actually develop intention behind questions they ask. Intention would require some experience of the self, which LLMs do not have. They just ask questions because that’s probabilistically the “correct” tokens to output based on their training data (although agents are starting to address this problem in very interesting ways)
The person (hopefully) doesn’t make shit up that looks plausible but is actually wrong (a hallucination)
The Context Sensitivity Spectrum
As we saw above, Context Sensitivity varies from scenario to scenario - to better conceptualize Context Sensitivity, we it as continuous spectrum.
At one end of the spectrum, we have low-context problem sets, which require minimal prior knowledge or understanding. The language model can provide appropriate responses based on the general knowledge it has been trained on.
We explored this type of A different example of a low-context situation is asking “please write me a poem in the style of Shakespeare” The model, or a person does not need further context to start to solve the problem in a reasonable and useful manner. This example is a low context sensitive prompt because:
Bounded: the scenario is bounded with a clear desired outcome (a poem in a specific style) and makes few to no assumptions on what specific knowledge model knows.
At the other end of the spectrum, we have high-context prompt that demand a more in-depth understanding of the context. To respond accurately and meaningfully in these prompts, the model needs specific, relevant data that you cannot assume will be represented in its latent the space in a manner useful for your specific question.
A different example of a high-context prompt is asking “how can I make $100k a month?” That is a question requires a lot of background to even begin to know where to start. We can broadly characterize it as:
Nonbounded: The underlying scenario and question do not have a definite start and end to the question. The prompt contains several inherent, sometimes layered, assumptions.
Multivariant problem: Because we cannot determine where the problem starts and stops, there are innumerable variables to consider. (this belongs in hallucinations as it is why there are hallucinations)
The possible variables to consider in such a situation necessitate the model to stack assumptions on assumptions to begin answer it.
A useful response to our given example would vary widely depending on if the user asking the question already has a has a business with $99,999 MRR or is making $100k a year as an employee.
The background information is more important than the question you are asking if you want to get a useful answer and avoid annoyingly useless hallucinations.
Understanding Hallucinations
Hallucinations, in terms of language models, refer to instances when the model generates outputs that don't align with reality or verifiable facts. It could be due to a lack of accurate training data or incorrect sampling biases. Essentially, the model is "making shit up" or "hallucinating." Hallucinations are reciprocally related to context.
How to Qualify Hallucinations: Falsifiability
So what characteristic of hallucinations is this spectrum measuring? Falsifiability, “the capacity for some proposition, statement, theory or hypothesis to be proven wrong”, is what is most important when considering hallucinations.
We need to consider falsifiability before all other properties because nothing else matters if we cannot even begin to verify a statement. Falsifiability is at the core of the hallucination problem because all true knowledge is derived from the ability to falsify a statement like a LLM’s response to the question “how do I make $100k a month.”
(needs clearer language)
When you have a statement that you cannot prove to be wrong, or otherwise test, it is worthless. The core of logic and reasoning, the basis the scientific method, is based on the ability to test and verify knowledge. For there to be true fact, there must be a process to convert raw information to knowledge via falsifying statements and sorting false statements from true statements.
In the context of LLMs, statements that we cannot falsify are actually worse than useless, they are an active hinderance to getting the answer, or other sort of desired output, we want. They are pure noise when we are looking for signal.
To better conceptualize falsifiability in the context of LLMs, lets create a way to measure falsifiability.
The Hallucination Falsifiability Spectrum: Weak Falsifiability → Strong Falsifiability
The hallucination falsifiability spectrum was conceptualized to qualify the reliability of outputs produced by Large Language Models (LLMs). On this spectrum, an output's position is determined by how easy it is to falsify its correctness and how much it relies on inferred or "imagined" context (hallucination).
At one end of the spectrum, we have outputs with strong falsifiability. These are outputs that can be practically and easily disproven. An example is the statement "2+2=5."
On the other end of the spectrum, we have "weak falsifiability". In this zone, the model's outputs are harder to verify because they're based on high-context scenarios or abstract concepts that require the model to make numerous assumptions or fill in gaps with invented context. The output is a kind of educated guess, or "hallucination". An example would be asking the model “how do I make $100k/month” It can generate a response based on its training data, but the specifics of the answer are largely inferred or created by the model, making it difficult to objectively validate the output.
The goal with context engineering is to move as many interactions as possible towards strong falsifiability, minimizing assumptions and hallucinations, thus improving the reliability and usefulness of the model's outputs.
Practical Implications - What does this mean?
Hallucinations are a feature, not a bug.
We want strong falsifiability because they offer reliability - why? If you can falsify a statement reliably you can separate outputs that are directionally correct (i.e they are closer to solving your problem/your desired output) from ones that are not. This allows for improvement by iteration, and has analogies to gradient descent used in model training.
(this feels like a non sequitur - how can be we reframe?)
Lets say we are brainstorming solutions to a complex software engineering problem and to uncover nonobvious solutions we have our top p and temperature cranked all the way up. Having these variables at their max is going to cause a lot of hallucinations - regardless of the type.
For the sake of argument lets say this LLM hallucinates 90% of the time, but the other 10% are absolute breakthroughs and are novel and viable solutions to our software engineering problem. We need our hallucinations to be strongly falsifiable so we can separate the 90% from the 10%. Otherwise it all just looks like noise.
(use personal anecdote we use in hallucinations intro
This is what I mean by hallucinations are a feature not a bug.
In order to produce novel, creative and useful outputs, you have to engage in the iterative creative process - the same way the human brain engages in convergent and divergent thinking. We just need to ensure we are working with strongly falsifiable hallucinations so we can identify signal from noise.
A Unified Model for Improving LLM Output Via Iteration
Context Defines the Framework
The level of context a language model has - whether low or high - establishes the potential framework for its output. In other words, the richness of the context can influence the detail, accuracy, and relevancy of the model's output. This can be likened to giving an artist a blank canvas (low context) versus a half-finished painting (high context) to work on.
Hallucinations are Influenced by Context
The level of context also impacts the likelihood and nature of hallucinations. In low-context scenarios, the model may generate outputs that veer towards weak falsifiability - broad statements that are challenging to verify. On the other hand, in high-context scenarios, the model has more information to pull from, and any hallucinations might be of strong falsifiability - specific claims that can be readily checked.
Context and Hallucinations Influence Each Other
Context and hallucinations can influence each other. The level of context can shape the falsifiability of hallucinations, and inversely, the type and extent of hallucinations can illuminate the context richness. For instance, frequent strongly falsifiable hallucinations in the model's output might indicate a high-context situation, as it's pulling from a rich tapestry of information, even if inaccurately at times.
Low context situations presented to LLMs create strong falsifiability outputs. You can present the question “what is 2+2” confident that you will get a useful answer because there you are not requiring the model to make assumptions about what is being asked, and and then make assumptions on how you answer the question.
High context situations presented to LLMs, without additionally engineered context, create weak falsifiability outputs. If you present “how do I make $100k/month" as a standalone statement you are really asking the model to:
Make multiple assumption about a multivariant scenario: broader financial situation, education, education level, skillset, personal history, etc. → basically create a unique world modelMake a second layer of multiple assumptions on how to answer the assumptions it made:It has to assume based on the assumptions it made to even begin to answer the question → hypothesis assumed solutions based on the assumed world modelFinally it can attempt to synthesis a coherent logical solution but at this point it is two degrees removed from baseline reality.we need to clean this up and explain sub step by sub step)
(
We asked the reasoning engine to reason through a multivariant, unbounded hypothetical and then then hypothesize solutions to this unbounded problem space.
OF COURSE IT IS GOING TO HALLUCINATE - THATS WHAT YOU IMPLICITLY TOLD IT TO DO.
In order to not implicitly ask it to make context up, we have to build a fence around the problem we are asking.
We have to reduce the number of variables in our problem statement and we have to give it background information to use it as a part of its world model so it doesn’t have to make as many assumptions.
This is where we get into the pragmatic tools context engineering provides us to better align our productized LLMs to do what we actually want, not what instruct it to do.
Practical Constraints of Context Engineering - What are the fundamental actions we can take?
There are basically ways we can make sure it has the context are:
(expand here what we mean - present it as laws of physics, every tool we make on top of this has to work within these rules)
Include minimum necessary context in context window - or in the training data: To explicitly the necessary information (context) into the Model’s context window or the training data and hope it gets represented in the latent space the way we’d like for our specific problem. The model benefits from context that's crucial to the task at hand. An example of this is supplying API docs for For programming
tasks. Something I like to do is use multi context windows to firstIn context window A:Dump in the API docs I know it would need for proper contextIn context window A:Iteratively generate a code snippets, document summaries and overall instructions on how to use the API. I call this a “context block.” This can help reduce overall tokens needed to get the context into the next window.In a new context window - context window B:Use our newly generated context block and iteratively write the end code.(this needs a lot of help)
Include Information in the context window on how to get more Context: The model has to given information, in the context window, so it can understand explicitly that it has the ability to retrieve information, and that this information probably includes further XYZ information. This is where agents and Vector DB retrieval come in.
Breakdown Complex Problems Until Preexisting Context is Sufficient: To minimize the need for extensive context, we simplify complex problems into smaller, manageable segments. This step-by-step approach benefits both the user and the model.
Determining what option to select for any given use case is hard. There is a lot of unseen complexity. Generally using all three is the best answer.
Understanding the tradeoffs, limitations and implementation methods of each one is difficult, requires iteration, testing and most of all, the ability to synthesis world models on the fly to make a decision on what context the model needs at any given time to accomplish the immediate task.
Using our Unified Model & Practical Restraints to Build Tools for our Context Toolkit.
The more context you give a system the more falsifiable the output.
The more falsifiable the output the more reliable the LLM - not because it never makes mistakes, but because the mistakes are easier to spot and account for with corrective measures. It is better to have a reliably 85% accurate system with a deviation of 5%, rather than a system that sits at 95% but will a deviation of 30%. You can work with 5%. 30% is untenable.
By giving more context, guardrailing potential outputs you are:
reducing assumptions needed to be made by model
narrowing the scope of possible outputs
Increasing resolution of output
This moves us up our context spectrum, and therefore, we are also moving up the falsifiability spectrum.
By moving up the falsifiability spectrum, we are making it easier to sort verifiably good responses out from verifiably bad responses; which makes it easier to iterate to our desired output, with fewer total tokens used and less time spent.
You can think of this as controlling for outside variables in a scientific study and the necessary statistics to ensure you are only testing for your specific desired variable.
I think this is fairly intuitive for anyone who has spent any meaningful amount of time using a LLM, but I haven’t seen anyone frame it as a function of context. The OpenAI cookbook mentions “think step by step” 15 times and “prompt"“ 63 times but context is mentioned only 3 times. Why? Not sure. But I think framing hallucinations as a direct function of context makes the most sense based on my experience.
A Formal Definition of Context Engineering
Well here’s what GPT-4 printed out after I dumped in all of my notes:
Context engineering is an advanced approach to managing and manipulating large language models (LLMs). It prioritizes understanding and controlling the broader context within which these models operate rather than focusing solely on inputs, known as prompts. This field involves a deep grasp of how LLMs generate knowledge, while concurrently developing techniques to guide their outputs, thereby reducing undesirable hallucinations and leveraging them when desirable. Through context engineering, we aim to enhance the accuracy, consistency, and value of the responses generated by these AI models.
And a more refined one in a different GPT-4 context window:
Context Engineering is the practice of providing a systematic framework for AI systems, enabling them to comprehend and navigate the nuances and complexities of their operational environment, user intentions, and relational dynamics between concepts. The aim is to enhance the system's ability to generate outputs that are closely aligned with human intent, adapt to evolving circumstances, and respond effectively to a wide array of scenarios.
In the course of writing this piece, I’ve come to this definition:
Context Engineering is a practical framework for aligning human intent with productized LLM output.
Please remember the use of productized there - I make no claims that any of this has any relevance to the grand Alignment Debate.
Maybe a new field of thought? Maybe worthless ramblings. Either way, I thought it was cool when GPT-4 synthesized “applied epistemology.”
Context Engineering straddles the boundary between philosophy and practical computation, acting as a kind of 'applied epistemology' for the realm of machine learning. It sets up a philosophical framework for thinking about what can be known from language models and provides concrete principles that can be integrated into machine learning methodologies.
Things that I would like to Explore More and Expand on but made me feel like my brain was on fire:
If/how context engineering makes sense in model pretraining, training and fine tuning. The farthest I’ve gotten in training my model models so far, its on my list I swear, is watching Let's build GPT: from scratch, in code, spelled out, and playing around very briefly with some HF models, so no idea if this maps or these ideas are expressed somewhere more technical that I just haven’t seen yet.
I think hallucinations are the key to one day developing an a deep understanding of how LLM neural nets actually work.
Context engineering in the context of agents.
The connection and similarities between the creative process, divergent and convergent thinking and working with LLMs. I think framing LLM knowledge work in the context of divergent and convergent thinking adds clear guiderails for what action you should prompt next to get at each phase of a chat to get to your desired outcome(s).
This idea:
Step 3: Context and Hallucinations Influence Each Other
Lastly, it's important to recognize that context and hallucinations can influence each other. The level of context can shape the falsifiability of hallucinations, and inversely, the type and extent of hallucinations can illuminate the context richness. For instance, frequent strongly falsifiable hallucinations in the model's output might indicate a high-context situation, as it's pulling from a rich tapestry of information, even if inaccurately at times.
What’s Next? More Context Engineering.
Over the next coming weeks and months I’ll be publishing:
specific examples of how I use context engineering in product development, writing and as a general accelerant to my creative process.
Further thoughts on hallucinations, the different types of hallucinations and how to work with each one. My thinking here can be broadly characterized as “make hallucinations work for your goal rather than try to fight them head-on” which I think is antithetical to the current “use xyx super prompt to enumerate the things it should do so it doesn’t hallucinate” which I think doesn’t produce optimal outcomes as that brute force prompting kneecaps the model’s potential.
An end to end example of my current programming process that uses a lot of context engineering to write most of the code for me. I’ve found some really cool tools that I’ve combined together so I have a tool for every level of context throughout the process.
And now to force myself to call my shot:
I personally grow the most when I throw myself off a cliff and make myself figure it out. I am going to make some predictions so I can make myself keep working towards verifying their validity.
Hallucinations are at the core of productizing LLMs and they are a feature not a bug.
The long term bottleneck for creating deeply useful LLM-driven and transformational products will not be compute and base model limitations but rather context. Designing systems that are capable of context-on-the-fly will be what allows general application of LLMs and specific Agentic LLM products. GPUs are hard to come by now but that problem has decades on inertia trying to solve it - there is no precedent how to productize a reasoning engine like GPT-4 - these things have never existed before.
While the initial iterations of agentic LLM apps has made everyone think are worthless, I think agent-drive LLM apps will be the most powerful and value-creating use case of LLMs (I realize this is probably not that hot of a take)
For now if you want to read some ChatGPT chat’s I’ve used to iterate and experiment with this idea feel free:
Context Engineering Verbal Brainstorming (Pro tip go for a walk and brainstorm with the ChatGPT ios app using the voice input)
In Conclusion:
AI is wild.
Thanks for reading and if you have any feedback I’d love to hear it!