Extra Metrics
The metrics on this page require extra information from the Step 1.
Expected Answer
Context Precision, Context Recall, Context Entity Recall, Answer Similarity, Answer Correctness
Response Time
Latency
LLM Calls
LLM Calls
Conversation
User Frustration
Context Precision
Context Precision measures if all the important items in the context are ranked at the top. Ideally, all relevant parts should be ranked highly. The higher scores mean better precision.
Context Recall
Context Recall measures how well the retrieved context matches the expected answer. Higher scores mean better performance.
To calculate this, each point in the expected answer is checked to see if it can be linked to the retrieved context. Ideally, all points in the expected answer should match the retrieved context.
Context Entity Recall
This metric measures how well the retrieved context includes the important entities from the expected answers. It compares the number of matching entities in both the expected answers and the retrieved context to the total number of entities in the expected answers.
In simple terms, it shows what portion of the important entities were found. It can assess how well the retrieval system works by checking if it includes the important entities, which is crucial when those entities are significant.
Answer Similarity
Answer Similarity measures how similar the meaning of the generated answer is to the expected answer. This is scored from 0 to 1, with higher scores indicating better alignment.
Assessing this similarity helps determine the quality of the generated answer. A cross-encoder model is used to calculate the similarity score.
Answer Correctness
Answer Correctness measures how accurate the generated answer is compared to the expected answer. Scores range from 0 to 1, with higher scores meaning the answer is more accurate.
It looks at two main factors: how similar the meanings of the answers are and how factually correct they are. These factors are combined using a weighted system to create the correctness score.
Latency
Latency measures if the time it takes for the LLM to provide an answer is shorter than a specific amount of seconds.
LLM Calls
LLM Calls refers to how many times the system needs to call the LLM before it gives the final result.
User Frustration
Administrators need to know if a user is feeling frustrated during their interaction. User frustration evaluation can be applied to a single exchange or across the whole conversation to see if the user is becoming frustrated.
Last updated