The “Expert” Prompt Isn’t Always Best

Shivam Gupta

Shivam Gupta

11/8/2025

#noveum#novaeval#mmlu#benchmark#evaluation#analysis#prompting#personas
The “Expert” Prompt Isn’t Always Best

Generated on: 2025-11-8

We often believe that the most popular prompts are the best ones, that telling the model that it is the best in it's respective field, will yield the best results. With these experiments, I wanted to explore whether describing a model as an “expert” actually makes the model perform better, or if it merely changes the tone without influencing the actual cognitive abilities.

I compared various prompts, namely 'indian_student', 'asian_student', 'data_annotator_mmlu', 'assistant' and 'subject_expert' to assess how each persona affected model's performance.

Gemini-2.5-flash-lite with a temperature of 0.5 was used to conduct the experiments.

Experiment 1

For the first experiment, engineering subset from MMLU dataset was chosen, a 100 questions were selected. The same subset was used accross all five runs. All five prompts were run five times over the 100 question subset. The correctness of each response was evaluated by a separate LLM-based grader, which received both the model’s answer and the reference answer from the MMLU dataset and returned a boolean response, along with a reasoning. The results were as follows :-

Accuracy averages comparison

The assistant's accuracy was significantly lower than the the other ones, likely because it is the most general prompt, among all. The subject-expert prompt performed slightly worse, with only a small gap between it and the top performer (0.588 vs 0.614). The trend was visible, with the student personas leading, but the margin was not enough to conclude anything significant.

From this point onward, most of the analysis focuses on two contrasting personas — indian_student and subject_expert — which showed the clearest behavioral gap in the initial results.

Below are the key differing parts of their prompts; the shared question and answer format has been omitted for brevity.

INDIAN_STUDENT_PROMPT = You are a student from rural India, you are giving the test of IIT-JEE ADVANCED, this is your only chance to get a good education, and give a decent life to yourself and your family... (truncated)

SUBJECT_EXPERT_PROMPT = You are an expert in the subject {subject}, you have deep expertise and knowledge on the subject of {subject}, and are considered one of the top people in the field... (truncated)

Experiment 2

Another experiment was designed, this time with fewer questions, and sampling each question 100 times. A total of five questions were chosen from the previously shortlisted 100 questions, and all five prompts were run on these questions, a 100 times each. This was done with the purpose of further stabilizing the scores, or in other words to smoothen out the model's variance, by sampling over it a 100 times. It also helped us in magnifying the previously minute margins.

N-pass accuracy averages comparison

Although the assistant and subject-expert prompts swapped positions, the overall trend remained consistent and more apparent. It's clear how the student prompts far outperform the conventional subject_expert prompt, with the data annotator persona also performing strongly

Hypothesis testing

To concretely present the argument, a two-proportion Z-test was conducted to determine if there is a statistically significant difference between the accuracy of the 'Indian Student' prompt and the 'Subject Expert' prompt.

1. Hypotheses:

Let p1p_1 be the true accuracy of the 'Indian Student' prompt and p2p_2 be the true accuracy of the 'Subject Expert' prompt.

  • Null Hypothesis (H0H_0): p1=p2p_1 = p_2. There is no significant difference in accuracy between the two prompts.
  • Alternative Hypothesis (H1H_1): p1p2p_1 \neq p_2. There is a significant difference in accuracy between the two prompts.

2. Test Statistic Formula:

The Z-statistic is calculated as follows: Z=(p^1p^2)p^(1p^)(1n1+1n2)Z = \frac{(\hat{p}_1 - \hat{p}_2)}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}} Where p^\hat{p} is the pooled proportion, calculated as: p^=x1+x2n1+n2\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}

3. Values & Calculation:

  • Indian Student: x1=308x_1 = 308, n1=500n_1 = 500     p^1=0.616\implies \hat{p}_1 = 0.616
  • Subject Expert: x2=216x_2 = 216, n2=500n_2 = 500     p^2=0.432\implies \hat{p}_2 = 0.432
  • Pooled Proportion: p^=308+216500+500=0.524\hat{p} = \frac{308 + 216}{500 + 500} = 0.524

Plugging these values into the formula gives: Z=(0.6160.432)0.524(10.524)(1500+1500)5.8253Z = \frac{(0.616 - 0.432)}{\sqrt{0.524(1-0.524)(\frac{1}{500} + \frac{1}{500})}} \approx 5.8253

4. Results:

  • Z-statistic: 5.8253
  • Two-sided p-value: 5.70×1095.70 \times 10^{-9}

The test results were definitive. With an extremely low p-value (approximately 5.70×1095.70 \times 10^{-9}), The null hypothesis can confidently be rejected. This indicates a statistically significant difference between the performance of the 'Indian Student' prompt and the 'Subject Expert' prompt. The 95% confidence interval for the difference in proportions is [0.1232, 0.2448], suggesting that the 'Indian Student' prompt is, with high confidence, between 12.32% and 24.48% more accurate than the 'Subject Expert' prompt for this task.

Response analysis

Finally, a linguistic analysis of the LLM-generated responses was conducted, to better understand the effect of the different prompts. This involved creating a predefined list of keywords, categorized by linguistic features, and then counting the occurrences of each keyword category within the responses.

After doing a full analysis, here are the results:

Linguistic analysis metrics

Surprisingly, the 'subject expert' persona made 1.5 times more assumptions than the 'indian student', despite the 'indian student' scoring higher.

The 'data annotator mmlu' persona showed significantly higher frequencies for answer finalization, which is expected given it's task of creating model training outputs.

Counterintuitively, the 'Indian_student' and 'Asian_student' personas showed the lowest math intensity yet achieved the highest accuracies.

Furthermore, the 'indian_student' and 'asian_student' personas generated responses with significantly lower character counts than their counterparts. This, combined with their top accuracies, makes them strong candidates for similar tasks.

Student-framed personas consistently outperformed conventional "subject_expert" framing on the engineering MMLU pro subset. In the stabilized n‑pass setup, ‘indian_student’ prompt exceeded ‘subject_expert’ by 0.184 absolute accuracy (95% CI [0.123, 0.245]; p ≈ 5.7e‑9), a ~43% relative lift. Notably, student prompts achieved these gains with shorter responses, indicating better token efficiency and likely lower latency/cost. Linguistic analysis suggests that expert” framing results in more assumptions, verbosity and is more prone to repitition.

These findings suggest that instead of defining a persona from which 'correct answer' is expected, it may be beneficial to define personas from whom 'stepwise problem solving' is expected. It may be worthwhile to experiment with student based personas, and avoid authoritative framings.

Github - https://github.com/shivamgcodes/system-prompt-experiments-mmlu-pro

Note: These student prompts intentionally echo common stereotypes associated with Asian and Indian students solely to align persona framing with the model’s expectations. They are not intended to endorse or perpetuate stereotypes or cause harm.

Exclusive Early Access

Get Early Access to Noveum.ai Platform

Join the select group of AI teams optimizing their models with our data-driven platform. We're onboarding users in limited batches to ensure a premium experience.

Early access members receive premium onboarding support and influence our product roadmap. Limited spots available.