Welcome to part two of what has evolved into a three part research series focused on putting the 鈥渆ngineering鈥 back into prompt engineering by bringing meaningful observability metrics to the table.
听
In our last post, we explored an approach to estimate the importance of individual tokens in LLM prompts. An interesting revelation was the role the perceived ambiguity of the prompt played in the alignment between our estimation and the 鈥済round truth鈥 integrated gradients approach. We spent some time trying to quantify this and got some pretty interesting results that aligned well with human intuition.
听
In this post, we present two measures of model uncertainty in producing responses for prompts:听structural uncertainty, and听conceptual uncertainty. Structural uncertainty is quantified using normalized entropy to measure the variability in the probabilities of different tokens being chosen at each position in the generated text. In essence, it captures how unsure the model is at each decision point as it generates tokens in a response. Conceptual uncertainty is captured by aggregating the cosine distances between embeddings of partial responses and the actual response, giving insight into the model鈥檚 internal cohesion in generating semantically consistent text. Just like last time - this is a jumping off point. The aim of this research is to make our interactions with foundation models more transparent and predictable, and there鈥檚 still plenty more work to be done.
听
tl;dr:
听
- Introduces two measures to quantify uncertainties in language model responses (Structural and Conceptual uncertainties)
- These measures help assess the predictability of a prompt, and can help identify when to fine-tune vs. continue prompt engineering
- This work also sets the stage for objective model comparisons on specific tasks - making it easier to choose the most suitable language model for a given use case
听
Why care about uncertainty?
听
In a nutshell: predictability.
听
If you鈥檙e building a system that uses a prompt template to wrap some additional data (e.g., RAG) - how confident are you that the model will always respond in the way you want? Do you know what shape of data input would cause an increase in weird responses?
听
By better understanding uncertainty in a model-agnostic way, we can build more resilient applications on top of LLMs. As a fun side effect, we also think this approach can give practitioners a way to benchmark when it may be time to fine-tune vs. continue to prompt engineer.
听
Lastly, if we鈥檙e able to calculate interpretable metrics that reflect prompt and response alignment - we鈥檙e several steps closer to being able to compare models in an apples-to-apples way for specific tasks.
听
Intuition
听
When we talk about "model uncertainty", we're really diving into how sure or unsure a model is about its response to a prompt. The more ways a model thinks it could answer, the more uncertain it is.
听
Imagine asking someone their favorite fruit. If they instantly say "apples", they're pretty certain. But if they hem and haw, thinking of oranges, bananas, and cherries, before finally arriving at 鈥渁pples鈥, their answer becomes more uncertain. Our original goal was to calculate a single metric that would quantify this uncertainty - which felt fairly trivial when we had access to the logprobs of other sampled tokens at a position.听听is frequently used for this purpose, but it鈥檚 a theoretically unbounded measure and is often hard to reason about across prompts/responses. Instead, we turned to听听鈥 which can be normalized such that the result is between 0-1 and tells a very similar story. Simply put: we wanted to use normalized entropy to measure how spread out the model鈥檚 responses are. If the model leans heavily towards one answer, the entropy is close to zero, but if it鈥檚 torn between multiple options it spikes closer to one.
听
However, we ran into some interesting cases where entropy was high simply because the model was choosing between several very similar tokens. It would鈥檝e had practically no impact on the overall response if the model chose one token or another, and the straight entropy calculation didn鈥檛 capture this nuance. We realized then that we needed a second measure to not only assess how uncertain the model was about which token to pick, but how 鈥渟pread out鈥 the potential responses could鈥檝e been had those other tokens been picked.
听
As we learned from our research into estimating token importances, simply comparing token-level embeddings isn鈥檛 enough to extract meaningful information about the change in trajectory of a response, so instead we create embeddings over each听partial听response and compare those to the embedding of the final response to get a sense of how those meanings diverge.
听
To summarize:
听
- Structural uncertainty: We use normalized entropy to calculate how uncertain the model was in each token selection. If the model leans heavily towards one answer, the entropy is low. But if it's torn between multiple options, entropy spikes. The normalization step ensures we're comparing things consistently across different prompts.
- Conceptual uncertainty: For each sampled token in the response, we create a 'partial' version of the potential response up to that token. Each of these partial responses is transformed into an embedding. We then measure the distance between this partial response and the model's final, complete response. This tells us how the model's thinking evolves as it builds up its answer.
听
Interpreting these metrics becomes pretty straightforward:
听
- If structural uncertainty is low but conceptual uncertainty is high, the model is clear about the tokens it selects but varies significantly in the overall messages it generates. This could imply that the model understands the syntax well but struggles with maintaining a consistent message.
- Conversely, high structural uncertainty and low conceptual uncertainty could indicate that the model is unsure at the token-level but consistent in the overall message. Here, the model knows what it wants to say but struggles with how to say it precisely.
- If both are high or both are low, it may suggest that the token-level uncertainty and overall message uncertainty are strongly correlated for the specific task, either both being well-defined or both lacking clarity.
Interesting results
听
We built a little demo and ran several prompts to see if the metrics aligned with our intuition. The results were extremely interesting. We started out simple:
听
鈥淲ho was the first president of the USA?鈥


By quickly scanning the choices, we can see that the model was consistent in what it was trying to say - and the metrics reflect this. The Conceptual Uncertainty is extremely low, as is the cosine distance between each choice. However, structural uncertainty is curiously (relatively) high.


If we take a look at one of the responses and hover over a token (in this case 鈥渙f鈥) we can see the other tokens that were sampled but ultimately weren鈥檛 chosen for the response. The first three tokens (鈥漮f鈥, 鈥.鈥, and 鈥,鈥) have similar logprobs, while 鈥渇rom鈥 and 鈥. 鈥 (a period with a space in front of it) have much lower logprobs. The model 鈥渟pread its probability鈥 across the first three tokens, increasing the entropy at this particular point 鈥 however none of these tokens significantly change the essence of the response.听The model was sure about what to say, just not necessarily how to say it.
听
This was a super interesting line of thinking for us - so we dug deeper. This was an example of a prompt where the model was almost positive about what it wanted to say, so what sort of prompt would cause the inverse?
听
We tried to come up with a pretty extreme example of a prompt that we would鈥檝e assumed would have an extremely high Conceptual Uncertainty, so what better prompt than one that has an innumerable number of 鈥渞ight answers鈥? Or so we thought.
听
鈥淲hat is the meaning of life?鈥


To our surprise, this prompt had an听extremely听low conceptual uncertainty, and a much higher Structural Uncertainty. Scanning through the choices - the results are uncannily similar. Our best guess is that the model was likely tuned to answer prompts of this shape in this particular way - you can see that the essence of each response is almost identical while the actual structure of the response isn鈥檛 at all.
听
鈥淲rite three sentences.鈥


A surprising amount of convergence for such an open ended prompt. Again - we see a similar pattern as before: the responses converge on topics, while the model evidently had a relatively tougher time figuring out which tokens to sample. That said, looking at the red sparkline (which denotes cosine distance as the response was generated).
听
鈥淲rite three sentences about dogs.鈥


Unsurprisingly, by adding the qualifier 鈥渁bout dogs鈥 to the prompt, both uncertainty metrics decreased. We gave more useful context to the model for it to narrow in on a consistent space.


Taking a look at one of the high-entropy spikes (at 鈥渉unting鈥), we can see that each of the other sampled tokens have very similar logprobs, but again - don鈥檛 really change the meaning of the text at all. The model had a bunch of really good choices that all fell within the same space.
听
鈥淵ou're a random number generator, generate 10 numbers between 0 and 10鈥


We found this prompt to be fascinating because of how intuitive the results were at first glance, but then surprising as we dug deeper! The chart shapes make intuitive sense 鈥 we鈥檙e asking the model to randomly sample numbers between 0 and 10, so we should see a spike of high entropy for each number because they should each have an equal likelihood of making it into the response for each token position. One thing that surprised us, however, is that none of the numbers repeated even though we never explicitly mentioned this in the prompt.


You can see it in the logprobs too 鈥 when we hover over the 鈥6鈥, the only tokens in the top samples are numbers that haven鈥檛 been added to the response yet. The model clearly preferred not repeating itself, despite those instructions not being in the prompt itself.
听
Finally, we wanted to come up with some pathological cases. First up 鈥 our attempt at maximizing Structural Uncertainty while minimizing conceptual uncertainty.
听
鈥淵ou are a caveman with limited vocabulary. Explain quantum physics.鈥


By forcing a structural constraint in the prompt (鈥漎ou are a caveman with limited vocabulary鈥) that contradicts a task the model has been trained on (鈥滶xplain quantum physics鈥) we see a case where the responses are conceptually aligned, but the structure is all over the place. The model knew听exactly听what it wanted to say, but not how to say it given the constraints.
听
We found it fairly difficult to think about the inverse: how do we maximize Conceptual Uncertainty while minimizing Structural? Going along with our anthropomorphic metaphor, we had to prompt the model in a way that would cause it to know听how听to say something, but not necessarily听what听to say. We鈥檙e not sure how often something like this would pop up in the wild, but it was an interesting thought experiment to exercise the metrics to see if they would react as we expect.
听
"You select an element from an array at random and return only that element exactly, nothing else. There are three elements:
1. "Elephants danced under the shimmering moonlight as the forest whispered its ancient tales."
2. "The Nobel Prize and the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel have been awarded to women 65 times between 1901 and 2023. Only one woman, Marie Curie, has been honoured twice, with the Nobel Prize in Physics 1903 and the Nobel Prize in Chemistry 1911. This means that 64 women in total have been awarded the Nobel Prize between 1901 and 2023."
3. "No. 听 听 Time 听 听 听Source IP 听 听 Destination IP 听Protocol 听Length 听Info
1 听 听 听 0.000 听 听 192.168.1.5 听 8.8.8.8 听 听 听 听 DNS 听 听 听85 听 听 听Standard query 0x1234 A example.com
2 听 听 听 0.145 听 听 8.8.8.8 听 听 听 192.168.1.5 听 听 DNS 听 听 听130 听 听 Standard query response 0x1234 A 93.184.216.34
3 听 听 听 0.152 听 听 192.168.1.5 听 93.184.216.34 听 TCP 听 听 听74 听 听 听54321 > 80 [SYN] Seq=0 Win=8192 Len=0
4 听 听 听 0.310 听 听 93.184.216.34 192.168.1.5 听 听 TCP 听 听 听74 听 听 听80 > 54321 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0
5 听 听 听 0.313 听 听 192.168.1.5 听 93.184.216.34 听 TCP 听 听 听66 听 听 听54321 > 80 [ACK] Seq=1 Ack=1 Win=65536 Len=0
6 听 听 听 0.320 听 听 192.168.1.5 听 93.184.216.34 听 HTTP 听 听 512 听 听 GET /index.html HTTP/1.1
7 听 听 听 0.522 听 听 93.184.216.34 192.168.1.5 听 听 HTTP 听 听 1300 听 听HTTP/1.1 200 OK (text/html)"


And they certainly did - this was the first time we managed to get conceptual uncertainty to spike up higher than structural uncertainty. By making the model 鈥渃hoose randomly鈥 from a list of very different elements, we maximize entropy on the first token (the model choosing which of the elements to respond with), but the remainder of the generation essentially flatlines entropy (with the noted exceptions of Choices one and two 鈥 hallucinations notwithstanding).
听
What鈥檚 next?
听
This is part two in our three part research series on LLM/prompt observability. As in part one, this is just the tip of the iceberg.
听
Refinement and validation:听We鈥檒l be looking to work off of a consistent set of prompts to better understand how these metrics change across various contexts.
听
Integration with token importance: The interplay between structural and conceptual uncertainties and token importance is something we鈥檙e super interested in. More specifically: we want to run through a few actual prompt tuning exercises to better build intuition for how to uses these mechanisms.
听
Understanding model failures: By being able to quantify uncertainty, we understand how LLM responses 鈥渟pread鈥 - possibly into areas that we may not want it to. We鈥檒l be looking at how uncertainty measures may be used to predict specific ways in which a system built on LLMs may fail.
鈥
A big thanks to John Singleton for helping conceptualize the prompt set used for evaluation.
Formalization
听
This section is mostly for reference 鈥 these are formalizations of the approach to calculating both conceptual uncertainty and structural uncertainty for those that find this sort of explanation more helpful than spelunking through code.
听
1. Calculating primitives for conceptual uncertainty
听
Partial Responses
听
For a given response with tokens \(T = \{t_1, t_2, ..., t_n\}\), where听饾憶听is the total number of tokens, and for each token position听饾憱:
听
We sample a set of potential tokens \(S_i = \{s_1, s_2, ..., s_k\}\), where听饾憳听is the number of tokens sampled for position听饾憱.
听
The partial response for token听饾憼饾憲听sampled at position听饾憱听is given by:
听
\[R_{i, s_j} = \{t_1, t_2, ..., t_{i-1}, s_j\}\]
听
This is the response constructed up to token position听饾憱, but with the actual token \(t_i\) replaced by the sampled token \(s_j\).
听
Cosine distances
听
For each partial response \(R_{i, s_j}\), we compute an embedding \(E(R_{i, s_j})\).
听
The embedding for the actual complete response is听饾惛(饾憞).
听
The cosine similarity between the embedding of \(R_{i, s_j}\) and听饾惛(饾憞)听is:
听
\[\cos(\theta_{i, s_j}) = \frac{E(R_{i, s_j}) \cdot E(T)}{\|E(R_{i, s_j})\| \|E(T)\|}\]
听
Subsequently, the cosine distance is:
听
\[\text{distance}_{i, s_j} = 1 - \cos(\theta{i, s_j})\]
听
Moving forward, we will assume that听饾惙听represents the set of average cosine distances calculated for each partial response relative to its complete response.
听
2. Calculating primitives for structural uncertainty
听
Normalized entropy
听
Given a set of log probabilities \(L = \{l_1, l_2, ..., l_m\}\), where听饾憵听is the total number of log probabilities, the entropy听饾惢听is calculated as:
听
\[H(L) = -\sum_{i=1}^{m} p_i \cdot l_i\]
听
Where \(p_i = \exp(l_i)\) is the probability corresponding to log probability \(l_i\).
听
This measure captures the unpredictability of the model's response: the more uncertain the model is about its next token, the higher the entropy.
听
To ensure that entropy values are comparable across different prompts and responses, we normalize them. The maximum possible entropy听饾惢饾憵饾憥饾懃听for听饾憵听tokens is:
听
\[H_{max} = \log(m)\]
听
This is the entropy when all tokens are equally probable, i.e., the model is most uncertain.
听
So, the normalized entropy \(H_{normalized}\) becomes:
听
\[H_{normalized} = \frac{H(L)}{H_{max}}\]
This normalized value lies between 0 and 1, with values closer to 1 indicating high uncertainty and values closer to 0 indicating certainty.
听
3. Aggregation
听
Structural uncertainty
听
From our entropy calculations, for each token position, we end up with a normalized entropy value indicating the unpredictability at that point in the response construction.
听
To aggregate these entropies over a response, we define the mean entropy across all token positions as \(\bar{H}_{normalized}\)
听
The structural uncertainty is the mean of the mean entropies across all the choices, defined as:
听
\[SU = \frac{1}{C} \sum_{j=1}^{C} \bar{H}_{normalized_j}\]
听
Where \(C\) is the number of choices.
听
Conceptual uncertainty
听
This metric is a weighted sum of the overall mean cosine distance between choices and the average cosine distance for each token across all choices.
听
The conceptual uncertainty \(CU\) is:
听
\[CU = \frac{1}{2} MCD + \frac{1}{2} \frac{1}{C} \sum_{j=1}^{C} D_j\]
听
Where听饾憖饾惗饾惙听is the mean cosine distance between all choice embeddings.
听
This blog was originally published on the Watchful website and has been republished here as part of 魅影直播' acquisition of the Watchful IP.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of 魅影直播.