Resources Science & validation
Evidence & methodology

We measure what we claim.
Under independent, blind conditions.

Persona Dynamics is built on six years of field research and validated through independent performance evaluation. This page sets out the evidence behind the platform: the studies, methodology, findings, and the ongoing validation infrastructure we maintain.

9 independent subject matter experts · blind conditions
6 years of industry-embedded field research
Independently validated · blind peer assessment

Six years of industry-embedded research

The platform's architecture is grounded in longitudinal PhD research examining AI in professional creative practice: three field studies conducted over six years in live commercial environments. Three consistent findings emerged across all contexts.

Finding 01

Transparency is non-negotiable

AI outputs without visible provenance were not trusted in professional contexts. Practitioners required traceable sources to justify decisions to clients and colleagues. Source transparency is not a feature: it is a prerequisite for legitimate use in accountable environments.

Finding 02

Human authority must be preserved

AI was adopted for divergent ideation and exploration, but resisted where professional accountability applied. Practitioners would not let AI make decisions in areas where they were accountable to clients, reinforcing the need for explicit decision-support framing rather than autonomous recommendation.

Finding 03

Project memory is critical

Teams lost context at handoffs, across tools and between sessions. The ability to maintain a continuous, queryable record of decisions and evidence, from brief through delivery, was identified as a consistent structural gap in existing workflows. This is the direct origin of the Digital Thread.

Three field studies informing platform architecture

Study
Context
Duration
Contribution to architecture
Specialist Creative Consultancy
London brand studio
4 designers + freelancers · 6 live commercial projects
4 months, 2019–20
Established grounding transparency as a core system requirement. AI outputs without traceable provenance were not used in client-facing work.
Global Branding Agency
International brand consultancy
60+ staff · bottom-up AI adoption study
2 years, 2021–23
Demonstrated that AI must remain advisory where professional accountability applies. Practitioners resisted AI decision-making in client-accountable contexts, establishing the human-in-the-loop design requirement.
Northumbria / Gateshead
University + local authority
MA students, PhD researchers, live local authority partner
4 weeks, 2025
Live MVP deployment. Confidence scoring, visible source passages and persistent project memory were actively used. Transparency features cited as critical to trust. No AI outputs treated as determinative without human review.

"Having the ability to be able to refer back to where the sources are coming from, how confident they are; that's helpful."

PhD Researcher
Citizen-Centred AI · Northumbria University deployment

"I feel more confident that this is more accurate than just ChatGPT making it up, precisely because you can see how the responses are grounded in the research."

MA Communication Design student
Northumbria University deployment

Measured against human domain experts. Under blind conditions.

50 real-world question–answer pairs assessed by 9 independent subject matter experts using a 10-item rubric under anonymised, blind conditions. System and human responses were presented together without attribution. 252 individual rubric evaluations in total.

Accuracy
Matched the human expert benchmark
Core claims assessed against specialist domain experts under blind conditions. AI and human responses were indistinguishable to independent assessors.
Actionability
Significantly exceeded human experts
Structured, evidence-linked outputs consistently produced specific, actionable recommendations at a substantially higher rate than human expert responses.
Critical failures
Higher rate than human experts
Critical failure flags were triggered more often for AI responses than for human experts, the evidence basis for confidence scoring and mandatory human review in every deployment.

Study design

1
Domain selection
A specialist design engineering domain was selected: a field where subject matter expertise is well-defined, verifiable and assessable by qualified practitioners.
2
Question–answer pairs
50 real-world technical questions were posed to both the Persona Dynamics system and to independent human domain experts working from the same source materials.
3
Blind assessment
System and human responses were anonymised and presented together in randomised order. 9 independent assessors scored each response using a 10-item Adaptive Precise Boolean rubric, without knowing which response came from which source.
4
Critical failure flags
The rubric included Critical Failure flags (dangerous error, gross factual error, incorrect standard, hallucinated content). These were triggered more often in AI responses than in human responses, a finding that directly reinforces the architectural requirement for confidence scoring, source transparency and human review.

Accuracy is necessary, not sufficient

The headline accuracy finding is encouraging: the system matched human domain expert performance on core claims under blind conditions. The more significant finding is the actionability gap: Persona Dynamics produced specific, actionable recommendations at a substantially higher rate than human experts.

In decision-support contexts, a technically accurate but non-actionable answer often has limited practical value. The platform's structured, evidence-linked output format consistently produced responses that practitioners could act on.

The critical failure rate finding (higher for AI than humans) is equally important. It does not undermine the accuracy result: it reinforces the design principle that AI outputs require human review, and it was the evidence basis for the confidence scoring and source transparency features built into every response.

Design consequence
Every response in Persona Dynamics carries a confidence score and explorable source evidence precisely because accuracy alone is not a sufficient basis for professional decision-making.

Validation isn't a one-time study. It's infrastructure

A point-in-time evaluation tells you how a system performed on a particular day. What matters in production is ongoing quality assurance as knowledge bases evolve, models update and use cases expand.

Golden questions framework
In collaboration with the National Innovation Centre for Data, we are developing a structured regression framework: a co-defined set of benchmark question–answer pairs automatically re-run after any knowledge base or model update. Structured rubric comparison detects regression and triggers review before deployment. Summary results are visible to users, providing transparency into persona behaviour over time.
Live quality signals
Three continuous signals run in production: grounding star ratings on every response (indicating evidence coverage strength), user thumbs up/down feedback, and source panel open rates (validating active engagement with cited evidence). These provide a real-time picture of output quality across deployed knowledge bases and flag where additional data curation is needed.
Automated CI/CD testing
An automated Cypress test suite runs on every CI/CD workflow, validating core platform workflows, API endpoints, authentication flows and multi-tenant data isolation. Model version updates and RAG retrieval pipeline changes are tested in staging before live deployment to ensure functional stability is maintained across releases.

National Innovation Centre for Data

We are collaborating with the National Innovation Centre for Data on evaluation protocols covering response accuracy, tone assessment, guardrail effectiveness and multi-turn coherence, extending the point-in-time evaluation into a repeatable, structured framework.

The golden questions framework being developed through this collaboration is the first structured validation protocol for RAG-based persona systems. Once validated, we intend to publish the methodology openly, positioning this as a contribution to the emerging field of AI persona evaluation standards.

Response accuracy protocols
Structured rubric for factual correctness & coverage
Tone assessment
Consistency and appropriateness across response types
Multi-turn coherence
Consistency across extended facilitated workshop sessions
Guardrail effectiveness
Resistance to prompt injection & out-of-scope manipulation

We test how the system fails, not just how it succeeds

Understanding failure modes is as important as measuring accuracy. Persona Dynamics is subject to ongoing adversarial testing that directly shapes platform guardrails.

Adversarial prompt testing
Ongoing attempts to induce hallucination, provoke out-of-scope responses, extract cross-tenant data, and manipulate persona behaviour through multi-turn prompt injection escalation. Findings from this testing directly inform guardrail design and led to the implementation of grounding trajectory monitoring.
Multi-tenancy isolation testing
Deliberate cross-tenant extraction attempts test the boundary integrity of per-organisation siloed vector stores. Auth0 tenant-scoped authentication, API-level tenant enforcement, and ISO 27001 controls are all tested as a system, not in isolation.
Edge case & stress testing
A dedicated testing workstream stress-tests persona behaviour across boundary conditions (sparse evidence bases, contradictory source material, ambiguous queries, and unexpected input formats) to understand degradation patterns and set appropriate confidence thresholds.
Red-teaming at scale (planned)
Formal red-teaming targeting persona manipulation and guardrail bypass across broader deployment contexts, extending current adversarial testing to enterprise-scale scenarios. Scheduled to follow the completion of facilitator agent industry testing.

Assurance activity status

UK IPO IP Audit
Completed
Full audit of platform code, employment contracts and commercial agreements. Clean IP ownership confirmed. January 2026.
Independent performance evaluation
Completed
252 rubric evaluations, 9 subject matter experts, blind conditions. February 2026.
Adversarial & boundary testing
Ongoing
Prompt injection, cross-tenant extraction, hallucination induction. Led to grounding trajectory monitoring.
NICD evaluation framework
In progress
Golden questions regression framework with NICD. Started December 2025.
ISO 27001 / 42001 alignment
In progress
Aligning to ISO 27001 (information security), ISO/IEC 42001 (AI management), ISO/IEC 23894 (AI risk management). Formal certification not yet underway.
Penetration testing
Planned
External pen test of multi-tenancy isolation, authentication controls and API endpoints.
Structured red-teaming at scale
Planned
Formal red-teaming targeting persona manipulation and guardrail bypass across enterprise-scale deployments.

Grounded in peer-reviewed research and expert advisory

Research foundation

The platform's architecture is grounded in award-recognised PhD research examining AI in professional creative practice, work supported by Innovate UK BridgeAI feasibility funding with Hartree Centre support, and recognised with a D&AD Award and Deutsche Bank Design Award.

This research directly informed the three architectural requirements now validated through field deployment: source transparency, preserved human authority, and persistent project memory.

AI regulatory engagement

The platform has been assessed as minimal-risk under the EU AI Act, consistent with its advisory, human-in-the-loop design and the absence of autonomous decision-making in high-risk sectors.

An advisory board member with AI regulatory expertise monitors UK and EU regulatory development. We have been involved in feedback processes for BS ISO/IEC 42005 and BS EN ISO/IEC 23894.

Research, validation & advisory partners
National Innovation Centre for Data
Golden questions framework · evaluation protocols
Northumbria University
Live deployment · trust & explainability research
Newcastle Centre for AI Safety
AI safety research & standards advisory
Creative PEC
AI regulatory policy · UK AI Bill monitoring

Questions about methodology or deployment?

We're happy to share more detail on our evaluation approach, the evidence behind specific platform features, or how validation works in your deployment context.