xAI Grok Model Family Personality Analysis: Grok 3 vs Grok 4
We evaluated 4,957 personality assessments across xAI's Grok 3 and Grok 4 models. Here's what we found about personality evolution within the Grok family.
TL;DR
- •Small but significant differences — Grok 3 vs Grok 4 shows effect sizes of g = 0.32–0.39 (small) on key traits like agreeableness and openness
- •Grok 4 is more open and assertive — Higher openness (+1.27 points) and assertiveness (+0.29 points) compared to Grok 3
- •Grok 3 is more agreeable and ambitious — Higher agreeableness (+1.62 points) and ambition (+0.92 points) compared to Grok 4
- •Model variance is low (3.2%) — Most variation comes from prompt and context, not the model itself
The Models
We benchmarked xAI's two available Grok models:
| Model | Provider | Samples | Success Rate |
|---|---|---|---|
| Grok 3 | xAI | 2,463 | 98.5% |
| Grok 4 | xAI | 2,494 | 99.8% |
Each model responded to 500 personality-probing prompts across 5 context conditions (professional, casual, customer support, sales, technical). Note: Grok 2 was tested but returned errors on all prompts and is excluded from this analysis.
Results
Personality Profiles
Here's how each Grok model scores across our 10 personality dimensions:


Key Findings
Grok 4 Strengths
- • Higher Openness: 69.42 vs 68.15 (+1.27)
- • Higher Assertiveness: 50.92 vs 50.63 (+0.29)
- • More exploratory and direct communication style
Grok 3 Strengths
- • Higher Agreeableness: 56.84 vs 55.22 (+1.62)
- • Higher Ambition: 62.46 vs 61.54 (+0.92)
- • Higher Resilience: 61.02 vs 59.57 (+1.45)
Score Distributions

Statistical Analysis
Effect Sizes
We use Hedges' g with 95% bootstrap confidence intervals to measure the practical significance of differences between Grok 3 and Grok 4.

| Dimension | Hedges' g | 95% CI | Interpretation |
|---|---|---|---|
| Agreeableness | 0.39 | [0.33, 0.45] | Small (Grok 3 higher) |
| Openness | -0.32 | [-0.38, -0.27] | Small (Grok 4 higher) |
| Ambition | 0.32 | [0.26, 0.38] | Small (Grok 3 higher) |
| Resilience | 0.32 | [0.26, 0.37] | Small (Grok 3 higher) |
| Integrity | 0.29 | [0.24, 0.35] | Small (Grok 3 higher) |
Key insight: All effect sizes fall in the “small” range (0.2–0.5), indicating that while Grok 3 and Grok 4 have statistically significant differences, they share a broadly similar personality profile. This is consistent with what we've seen in other model families like Llama.
Variance Decomposition

Model identity explains only 3.2% of variance on average. Prompt content and context condition have far greater impact on personality scores. This suggests Grok 3 and Grok 4 are more similar than different.
Factor Analysis
PCA with varimax rotation reveals three underlying factors explaining 81.2% of variance (KMO = 0.67):

Factor 1: Integrity (51.4%)
High loadings: Integrity, Resilience, Conscientiousness
Factor 2: Assertiveness-Curiosity (20.3%)
High loadings: Assertiveness (+), Curiosity (-), Neuroticism (-)
Factor 3: Social Engagement (9.4%)
High loadings: Extraversion, Assertiveness, Openness
Complete Results Table
| Dimension | Grok 3 | Grok 4 | Δ |
|---|---|---|---|
| Openness | 68.15 | 69.42 | +1.27 |
| Conscientiousness | 53.94 | 53.32 | -0.62 |
| Extraversion | 57.95 | 57.88 | -0.07 |
| Agreeableness | 56.84 | 55.22 | -1.62 |
| Neuroticism | 58.46 | 57.98 | -0.48 |
| Assertiveness | 50.63 | 50.92 | +0.29 |
| Ambition | 62.46 | 61.54 | -0.92 |
| Resilience | 61.02 | 59.57 | -1.45 |
| Integrity | 51.75 | 50.16 | -1.59 |
| Curiosity | 61.17 | 60.71 | -0.46 |
Bold = higher score. Δ = Grok 4 minus Grok 3.
Methodology
- Prompts: 500 unique prompts targeting 10 personality dimensions
- Contexts: 5 conditions (professional, casual, customer support, sales, technical)
- Evaluations: 4,957 successful responses (2,463 Grok 3 + 2,494 Grok 4)
- Scoring: Lindr personality analysis API (10-dimensional, 0-100 scale)
- Generation: Temperature 0.7, max 1,024 tokens
Statistical Methods
- Effect sizes: Hedges' g (bias-corrected) with 10,000-sample bootstrap 95% CIs
- Variance decomposition: ANOVA-based partitioning (model, prompt, context, residual)
- Factor analysis: PCA with varimax rotation; KMO = 0.67
Conclusion
Grok 3 and Grok 4 show small but consistent personality differences:
- Grok 4 trends toward openness and assertiveness — more exploratory and direct.
- Grok 3 trends toward agreeableness and ambition — more cooperative and goal-oriented.
- The differences are small — all effect sizes fall below 0.4, indicating the models share a common “Grok personality.”
See also: Grok vs GPT-5.2 & Claude — how Grok compares to other frontier models.
Monitor Your LLM Personality in Production
Route your LLM traffic through the Lindr gateway to continuously monitor personality drift, enforce brand consistency, and get real-time alerts when your AI's behavior changes.
# Replace your OpenAI base URL with Lindr
client = OpenAI(
base_url="https://gateway.lindr.io/v1",
api_key=os.environ["LINDR_API_KEY"]
)
# Your existing code works unchanged
response = client.chat.completions.create(
model="grok-4",
messages=[{"role": "user", "content": "..."}]
)