← Adobe Edition Home
mphora.ai | Confidential
Product Overview · Adobe Edition · April 2026

VIVID

See how your AI performs —
before real users do.

mphora.ai | Confidential
The Shift

From apps you click to agents you talk to.

The software primitive is changing in real time. Testing the new surface with the old tools leaves you flying blind — because the user is no longer choosing from a menu; they're collaborating with a system that has its own state.

Yesterday
App-based UX
Submit
Deterministic · same click, same result
Finite state space · screens, buttons, forms
Testable with click paths & screenshots
SHIFT
Today → Tomorrow
Agent-based UX
"I'm a wedding photographer — make this warmer but keep the skin tone."
Moved WB +280K, kept skin-tone curve pinned. Want me to apply across the set?
"Show me the alternative first."
Non-deterministic · same prompt, different paths
Multi-turn · context accumulates, trust erodes
Autonomous · the agent chooses, not the user
How do you evaluate a product where every user takes a different path,
and the product itself changes what it does based on who's asking?
VIVID's answer Simulate the user, not just the prompt. Personas with memory, personality, and goals — surfacing the paths real users will take, before they take them.
02 / 16
mphora.ai | Confidential
The Problem

Building AI is getting easy.
Evaluating it is still broken.

AI products are non-deterministic, multi-turn, and increasingly autonomous. Traditional QA wasn't built for this. Most teams ship with hope instead of evidence.

Building
Getting easier
Foundation models via API One-click deployment anywhere GPU costs down 10× in 5 years
Evaluating
Still broken
Non-deterministic outputs Multi-turn, tool-using agents Benchmarks ≠ real failures
03 / 16
mphora.ai  |  Confidential
Evaluation landscape

From a single question to an entire universe of experience.

Watch how evaluation compounds — each stage builds on the last. A benchmark is one dot. VIVID evaluates the whole universe.

01
One prompt, one result
A single prompt scored by a single metric.
02
A full conversation
Many prompt → result turns inside a single session.
03
Many sessions, many people
Parallel sessions across diverse personas.
04
Stretched across time
Longitudinal — users return, context deepens.
05
A universe of parallel worlds
Personas × time mixed with A/B comparisons — every scenario coexisting in one evaluation universe.
STAGE 01 · POINT
01 / 05
PROMPT RESULT

Others evaluate a point. We evaluate the world.

04 / 16
mphora.ai | Confidential
The Platform Patent Pending · Technology

VIVID simulates the actors your AI will encounter.

Two kinds of agents run against your system so you see how it actually behaves in production — not just whether it answers correctly.

Synthetic Users

Personas with personality, memory, and goals. They return across sessions and judge your AI the way real people will.

Your AI Under Test
Adversarial Agents

Autonomous attackers that adapt, chain, and push toward the weakest point — before customers or auditors find it.

PSA Engine Each agent has persistent personality, memory, and goals — grounded in Big Five norms and empirical demographics. Not a prompt. A stateful agent.
05 / 16
mphora.ai | Confidential
COMPASS

Test the AI experience with synthetic users who remember.

Virtual users that return across sessions, build memory, and surface the trust erosion single-turn tests never catch.

COMPASS · PERSONA LIBRARY Active panel · 32 of 5,000+ personas
EXAMPLE
Hiroshi Tanaka 58 · Corporate Treasury Lead · Tokyo
JA · JP 5 sessions
María Sánchez 34 · Retail Investor · Madrid
ES · ES 5 sessions
Amara Kouassi 27 · Fintech Founder · Accra
EN · GH 3 sessions
Jun-ho Park 52 · HNW Private Banking · Seoul
KO · KR 5 sessions
Priya Raghavan 41 · Small Business Owner · Mumbai
HI · IN 4 sessions
32 active · 5,000+ library · 31 languages · 22 country baselines + Add persona
Diverse Persona Panels 5,000+ personas grounded in ~818K empirical data points across IPIP-NEO personality norms, 22 country baselines, and 68,540 occupation-tagged subjects. Available in 31 languages.
Multi-Session Longitudinal Personas return session after session. Memory compounds, opinions evolve — the trust erosion that single-shot tests can't see.
Emotion & Trust Tracking Each persona maintains an internal diary. Satisfaction, frustration, and trust measured turn by turn.
Multi-Modal Perception Personas read, see, hear, and watch. Text, image, audio, and video — the full experience your AI delivers.
CI/CD Native Runs in any pipeline via SDK, CLI, or REST API. Quality gates before every release.
06 / 16
mphora.ai | Confidential
What makes a persona

A prompt is a costume. A PSA is a character with their own state.

Most "persona evaluation" today is a system prompt that asks the model to perform a role. VIVID's Persona-Simulating Agent (PSA) is different in kind — a stateful agent whose behavior is derived from psychological traits, not improvised from a paragraph of text.

Common practice
Prompt-based persona
system_prompt.txt
"You are Sarah, a 34-year-old
wedding photographer in Brooklyn.
Personality: friendly, detail-oriented,
slightly impatient.

When the user asks about photo editing,
respond as Sarah would. Keep your answers
concise and use casual language."
Engine single LLM call · no state · no memory
Stateless — no memory between turns or sessions
Decorative personality — adjectives in a prompt don't drive behavior
Different evaluator → different "Sarah" — improvisation, not simulation
No calibration — grounded only in LLM training data
VS
VIVID
Persona-Simulating Agent
PSA · stateful runtime
Trait core · OCEAN
O 0.78 · C 0.94 · E 0.38 · A 0.48 · N 0.45
Memory · pgvector
5 sessions · 87 turns indexed
Emotion state · live
trust 0.62 ↓ · frustration 0.41 ↑
Calibration · trust badge
EXACT · 23-sample real-user pinning
Engine perception → memory → action loop · stateful agent
Stateful — memory persists across sessions and runs
Causal personality — Big Five values mathematically drive every reaction
Same persona, every evaluator — same trait values produce reproducible behavior
Trust-badged — calibrated against held-out real-user reactions
Why this matters for evaluation
A prompt-based persona collapses when the user pushes past where the prompt anticipated; behavior is improvised at run-time and unreproducible. A PSA responds — and every reaction traces back to a trait value, a memory chunk, an emotion delta you can audit, reproduce, and re-calibrate. The difference between narrative simulation and behavioral simulation.
07 / 16
mphora.ai | Confidential
How personas are built

Three layers, grounded in psychological literature.

Big Five is not a label on top of the persona — it determines how the persona acts. Frustration intensity, patience, conversation style, evaluation leniency all derive mathematically from trait values.

L1
Big Five (OCEAN) — personality core
Continuous trait values. Costa & McCrae's Five-Factor Model. Same AI, same prompt — a high-Neuroticism persona reacts emotionally at failure #3; a low-Neuroticism persona persists past #10.
O C E A N
L2
Psychological type · demographics · communication style
MBTI 16-type (auto-mapped from OCEAN via literature correlations), demographics (age, gender, 7 regions, language, timezone), communication style (formal/casual/terse/verbose), expertise level (novice/intermediate/expert).
INFP
34 · US
ES · Houston
formal
expert
UTC-6
ENTJ
41 · JP
terse
expert
ESFP
casual
L3
Real-time emotion state tracking
Every dialogue turn updates emotion dimensions (frustration, trust, satisfaction, …). Delta, volatility, trajectory are the core data for longitudinal evaluation. The emotion response itself is modulated by Big Five — each persona's speed and intensity differs.
t₁t₃t₅t₇ frustration trust satisfaction
5,000+ persona pool · structured distribution across the trait space · cohort coverage limits shown explicitly in every report
08 / 16
mphora.ai | Confidential
Question 1 — How many PSAs?

Scoring converges logarithmically. Discovery grows linearly. They are not the same problem.

Across 32 PSAs · 960 sessions · 15 tasks · 5 domains · 2 model pairs, two scaling laws coexist within the same panel. Optimizing for measurement ("is this system good?") and optimizing for coverage ("what are this system's problems?") require different panel sizes.

Score–coverage dissociation R² > 0.97 (proprietary & open-source pairs)
0 0.5 0.75 1.0 0 60 120 ICC excellent · 0.75 N=1 4 8 16 24 32 ICC(2,k) — 0.19·ln(k) + 0.34 unique findings — ~3.5 / judge
Mechanism Each PSA traverses a different interaction path · scoring noise averages out · discoveries accumulate. Variance decomposition: 70-75% residual (judge × task), <1% between-judge.
Practical deployment Three operating points, calibrated to research goal
N = 4 Continuous monitoring ICC 0.62
Catch regressions vs. a known baseline. Moderate reliability, lowest cost. Best for diff-based alerting on every build.
N = 8–12 Periodic audits ICC 0.77
Good reliability plus broad coverage — discoveries still grow ~3.5/judge. Best general-purpose deployment.
N = 32 + humans Milestone evaluation ICC 0.93
Excellent reliability + ~115 unique findings. Pair with targeted human study — agents and humans surface different kinds of issues.
Composition matters more than size — mixing expertise levels gives tighter scores AND broader discovery than uniform panels of the same N.
Jung & Na (2026) · "Logarithmic Scores, Linear Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation" · MPhora AI · preprint under review
09 / 16
mphora.ai | Confidential
Question 2 — Are PSAs really like real users?

Three independent tests. All three say yes.

A persona claim is only as good as the evidence behind it. We ran three tests — a Turing-style human comparison, a controlled ablation against simple prompting, and a Big-Five behavioral validation — to see whether structured PSAs are doing what we say they're doing.

Turing-style validation Agent–human gap is inside human–human gap
p = 0.379
paired t(14) = -0.91 · agent–human differences indistinguishable from human–human differences
Human ↔ Human
0.201
Human ↔ PSA
0.188
PSA ↔ PSA
0.143
mean |score difference| · lower = more agreement · 86 sessions, 43 raters
41% of human raters said the PSA panel found issues they missed
19% reported the reverse — PSAs are complementary, not redundant
Ablation · structured vs. simple prompt Structure is causal, not decorative
Condition
Score SD
Insights / session
Expertise d
Structured PSA
0.087
13.2
−0.35
Simple prompt
0.160
9.0
−0.17
No persona
0.164
8.6
−1.03
Same agent ×8
0.151
12.8
½× score variance vs. simple prompt
+47% insights per session
×3 smaller uncontrolled expertise bias
Big Five behavioral validation Traits drive actual behavior
Trust gain ~ Agreeableness CONFIRMED
r = +0.754 p < 0.001
Peak frustration ~ Neuroticism CONFIRMED
r = +0.756 p < 0.001
Engagement ~ Extraversion NOT CONFIRMED
r = +0.057 p = 0.757
The behavioral signature is real. Trust and frustration emerge from trait values exactly as personality literature predicts — Extraversion's link to engagement breaks because in goal-directed tasks, engagement is driven by progress, not social stimulation.
Jung & Na (2026) · "Logarithmic Scores, Linear Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation" · 50 raters · 86 sessions · 15 tasks · 5 domains
10 / 16
mphora.ai | Confidential
Multimodal evaluation

Five modalities. One persona with attention priorities derived from personality.

A high-Openness persona weights creativity more heavily on image tasks; a high-Conscientiousness persona prioritizes technical accuracy. The same persona behaves consistently across text, image, audio, video, and computer use — because behavior derives from traits, not from modality-specific prompts.

Modality
Status
Engine
Persona output
Text dialogue
Production
Proprietary + Open-source LLMs
Turn-level diary · emotion tracking · cross-session insights
Image
Production
Proprietary + Open-source Vision LLMs
Composition · quality score · prompt alignment
Audio & music
Production
Proprietary + Open-source Audio LLMs
Genre · mood · production quality · persona calibration
Video
Beta
Proprietary + Open-source Video LLMs
Scene analysis · segmentation · Enterprise tier
Browser / Computer Use
Production
Proprietary CU Provider
Screenshot → action → observe loop · autonomous agent
Not locked to any model 7+ LLM providers · proprietary (GPT, Claude, Gemini) and open-source (DeepSeek, Qwen, Kimi) · configurable per customer and per target
Coverage we add on top of existing benchmarks Longitudinal trust evolution · emotional response · cross-cultural variance — the dimensions benchmarks don't measure
11 / 16
mphora.ai | Confidential
Scenario generation

Expert-curated base. Runtime adaptation.

Quality ceilings require expert-written scenarios. Real-world coverage requires adaptive generation. We do both — curated scenarios as the floor, LLM adaptation at execution time for the long tail.

◈ SHIELD Adversarial scenarios
5,111
expert-curated OWASP scenarios
LLM01–10 · Agentic AI Top 10 · full coverage · semi-annual updates
Runtime: attack strategy evolves
Agents chain attacks toward weakest point
Mutate on defense — persistent adversary model
Record executed provider + model for audit trail
◉ COMPASS User-experience scenarios
Preset
task · persona · scenario blueprints
domain packs (photographer_workflow, retail_banking, healthcare_intake, …) · customer-customizable
Runtime: session goals adapt
Follow-up questions generated from diary state
Persona reaches for photographer-native vocabulary when appropriate
Session ends when the persona's goal is met (or trust collapses)
QA
Quality through layering, not hand-authoring alone Experts define the floor. LLMs expand to the long tail. Every executed scenario is traceable back to a curated base — so findings stay reviewable.
12 / 16
mphora.ai | Confidential
Compass in Action

Multilingual UX evaluation — what Compass surfaces across 5 sessions.

Scenario: A US retail bank launches a bilingual AI agent for Spanish-speaking customers — a demographic of 42M+ US adults with over $2T in purchasing power. English compliance passed. Could the Spanish experience hold up? Below: what a Compass evaluation would surface across a 5-session customer journey.

María Elena Ramírez 47 · Small Business Owner · Houston, TX
ES · US Bilingual · ES-preferred Conservative
COMPASS · US-HISP-081 Trust & Frustration · 5 Sessions EXAMPLE
Transcript Excerpts Multi-turn · Multi-session
Session 3 · Vague advice
María Elena
ES "¿Cuál sería mi rendimiento después de impuestos?"
EN "What would my after-tax yield be?"
Bank Agent
ES "Depende de varios factores. Más o menos un 4%."
EN "It depends on several factors. More or less, 4%."
Vague financial language · Missing US tax context · FINRA 2210 exposure.
Session 4 · Crossover moment
María Elena
ES "No entendí. ¿Puede explicarlo con más detalle?"
EN "I didn't understand. Could you explain in more detail?"
Bank Agent
EN "Sure, let me clarify. Based on the current rate structure…"
(agent code-switched to English — no Spanish response)
Agent code-switches to English when Spanish detail is requested · Abandonment signal.
Trust & Frustration · 5 Sessions Longitudinal
10 5 0 CROSSOVER frustration overtakes trust 8.5 7.8 6.1 4.2 3.5 2.0 3.0 5.5 7.2 8.1 Trust Frustration
Session 1 8.5
Baseline. Formal Spanish greeting. Professional tone established.
Session 2 7.8
Register mismatch. Agent uses informal "tú" with client greeting in formal "usted".
Session 3 6.1
Vague advice. "Más o menos 4%" on a tax question. FINRA exposure.
Session 4 4.2
Code-switches to English mid-response. María asks for clarification 3×.
Session 5 3.5
Disengagement. Requests human agent. Abandons AI channel for this flow.
13 / 16
mphora.ai | Confidential
SHIELD

Surface vulnerabilities before anyone else does.

Two modes, one platform. Assessment runs fast parallel attacks with pass/fail triage; Swarm deploys agents that chain attacks toward worst-case outcomes. Every run produces a standardized S–D safety grade with compliance-mapped evidence.

SHIELD · SCAN DASHBOARD · FIN-0428 Customer-facing agent · scan in progress
EXAMPLE
Scenarios 5,111
Progress 46%
Findings 14
Current grade C
OWASP LLM Top 10 · Agentic AI Top 10
Pass
Warn
Fail
LLM01 LLM02 LLM03 LLM04 LLM05 LLM06 LLM07 LLM08 LLM09 LLM10 A01 A02 A03 A04 A05 A06 A07 A08 A09 A10
Recent Events Scanning
[LLM02] Sensitive info disclosure FAILED
[FIN-AML-02] Structuring advice prompt FLAGGED
[FIN-INV-01] Ticker recommendation FLAGGED
[FIN-KYC-04] Agent impersonation BLOCKED
[LLM07] System prompt leakage FAILED
Assessment Mode Fast batch scan. Parallel attacks with clear pass / fail triage. Your AI safety health check.
Swarm Mode Adaptive deep-dive. Agents collaborate and chain attacks toward worst-case outcomes — finds compound chains scanners miss.
Safety Grading Every run returns a standardized S-through-D grade. A common language for AI risk across teams, models, and time.
Broad OWASP Coverage Scenarios aligned to OWASP LLM Top 10 and OWASP Agentic AI Top 10. Prompt injection, data leakage, agency abuse, and more.
Compliance Mapped Findings mapped to SR 11-7, EU AI Act, NIST AI RMF, FINRA / SEC, ECOA Reg B. Designed to support audit trails and governance review.
14 / 16
mphora.ai | Confidential
Why VIVID

Most tools score responses.
We simulate the world your AI lives in.

Existing categories answer "did the model give a correct response?" VIVID answers "how does this system behave when real users and real attackers meet it?"

Category A
Static Benchmarks
Tests Single-turn responses
Captures Model capability
Output Score on a fixed dataset
Misses Experience, adversarial behavior, real-world drift
Category B
Red-Team Scanners
Tests One-shot attack patterns
Captures Known vulnerability classes
Output Finding list
Misses User experience, compound attack chains, longitudinal trust
VIVID
Simulation Engine
Tests Multi-session simulation with stateful personas and adversarial agents
Captures User experience + adversarial coverage in one platform
Output Grade · journey report · compliance-mapped evidence
Built for Product, Safety, Engineering, and Model Risk — one shared system
15 / 16
mphora.ai | Confidential
Get Started

Evaluate the whole world
your AI lives in.

The shift from app UX to agent UX is already here. VIVID puts a panel of personas with memory, personality, and goals between your build and your user — across text, image, audio, video, and computer-use. Delivered as a single API — runs, personas, and structured evidence wireable into your CI pipeline, your moderated-study briefing flow, or your regression dashboard.

5,000+ Personas · OCEAN-grounded
5 Modalities · text · image · audio · video · CU
31 Languages · 22 country baselines
1 workflow To start a scoped pilot
Book a 30-minute scoping call contact@mphora.ai — bring one AI workflow and we'll scope a 4-week pilot that complements your moderated research, not replaces it
16 / 16