Science & Validation | How Persona Dynamics measures what it claims

Field research foundation

Six years of industry-embedded research

The platform's architecture is grounded in longitudinal PhD research examining AI in professional creative practice: three field studies conducted over six years in live commercial environments. Three consistent findings emerged across all contexts.

Finding 01

Transparency is non-negotiable

AI outputs without visible provenance were not trusted in professional contexts. Practitioners required traceable sources to justify decisions to clients and colleagues. Source transparency is not a feature: it is a prerequisite for legitimate use in accountable environments.

Finding 02

Human authority must be preserved

AI was adopted for divergent ideation and exploration, but resisted where professional accountability applied. Practitioners would not let AI make decisions in areas where they were accountable to clients, reinforcing the need for explicit decision-support framing rather than autonomous recommendation.

Finding 03

Project memory is critical

Teams lost context at handoffs, across tools and between sessions. The ability to maintain a continuous, queryable record of decisions and evidence, from brief through delivery, was identified as a consistent structural gap in existing workflows. This is the direct origin of the Digital Thread.

Three field studies informing platform architecture

Study

Context

Duration

Contribution to architecture

Specialist Creative Consultancy
London brand studio

4 designers + freelancers · 6 live commercial projects

4 months, 2019–20

Established grounding transparency as a core system requirement. AI outputs without traceable provenance were not used in client-facing work.

Global Branding Agency
International brand consultancy

60+ staff · bottom-up AI adoption study

2 years, 2021–23

Demonstrated that AI must remain advisory where professional accountability applies. Practitioners resisted AI decision-making in client-accountable contexts, establishing the human-in-the-loop design requirement.

Northumbria / Gateshead
University + local authority

MA students, PhD researchers, live local authority partner

4 weeks, 2025

Live MVP deployment. Confidence scoring, visible source passages and persistent project memory were actively used. Transparency features cited as critical to trust. No AI outputs treated as determinative without human review.

"Having the ability to be able to refer back to where the sources are coming from, how confident they are; that's helpful."

PhD Researcher

Citizen-Centred AI · Northumbria University deployment

"I feel more confident that this is more accurate than just ChatGPT making it up, precisely because you can see how the responses are grounded in the research."

MA Communication Design student

Northumbria University deployment

Independent evaluation · February 2026

Measured against human domain experts. Under blind conditions.

50 real-world question–answer pairs assessed by 9 independent subject matter experts using a 10-item rubric under anonymised, blind conditions. System and human responses were presented together without attribution. 252 individual rubric evaluations in total.

Accuracy

Matched the human expert benchmark

Core claims assessed against specialist domain experts under blind conditions. AI and human responses were indistinguishable to independent assessors.

Actionability

Significantly exceeded human experts

Structured, evidence-linked outputs consistently produced specific, actionable recommendations at a substantially higher rate than human expert responses.

Critical failures

Higher rate than human experts

Critical failure flags were triggered more often for AI responses than for human experts, the evidence basis for confidence scoring and mandatory human review in every deployment.

Methodology

Study design

1

Domain selection

A specialist design engineering domain was selected: a field where subject matter expertise is well-defined, verifiable and assessable by qualified practitioners.

2

Question–answer pairs

50 real-world technical questions were posed to both the Persona Dynamics system and to independent human domain experts working from the same source materials.

3

Blind assessment

System and human responses were anonymised and presented together in randomised order. 9 independent assessors scored each response using a 10-item Adaptive Precise Boolean rubric, without knowing which response came from which source.

4

Critical failure flags

The rubric included Critical Failure flags (dangerous error, gross factual error, incorrect standard, hallucinated content). These were triggered more often in AI responses than in human responses, a finding that directly reinforces the architectural requirement for confidence scoring, source transparency and human review.

What the findings mean

Accuracy is necessary, not sufficient

The headline accuracy finding is encouraging: the system matched human domain expert performance on core claims under blind conditions. The more significant finding is the actionability gap: Persona Dynamics produced specific, actionable recommendations at a substantially higher rate than human experts.

In decision-support contexts, a technically accurate but non-actionable answer often has limited practical value. The platform's structured, evidence-linked output format consistently produced responses that practitioners could act on.

The critical failure rate finding (higher for AI than humans) is equally important. It does not undermine the accuracy result: it reinforces the design principle that AI outputs require human review, and it was the evidence basis for the confidence scoring and source transparency features built into every response.

Design consequence

Every response in Persona Dynamics carries a confidence score and explorable source evidence precisely because accuracy alone is not a sufficient basis for professional decision-making.

Continuous validation

Validation isn't a one-time study. It's infrastructure

A point-in-time evaluation tells you how a system performed on a particular day. What matters in production is ongoing quality assurance as knowledge bases evolve, models update and use cases expand.

Golden questions framework

In collaboration with the National Innovation Centre for Data, we are developing a structured regression framework: a co-defined set of benchmark question–answer pairs automatically re-run after any knowledge base or model update. Structured rubric comparison detects regression and triggers review before deployment. Summary results are visible to users, providing transparency into persona behaviour over time.

Live quality signals

Three continuous signals run in production: grounding star ratings on every response (indicating evidence coverage strength), user thumbs up/down feedback, and source panel open rates (validating active engagement with cited evidence). These provide a real-time picture of output quality across deployed knowledge bases and flag where additional data curation is needed.

Automated CI/CD testing

An automated Cypress test suite runs on every CI/CD workflow, validating core platform workflows, API endpoints, authentication flows and multi-tenant data isolation. Model version updates and RAG retrieval pipeline changes are tested in staging before live deployment to ensure functional stability is maintained across releases.

NICD collaboration · ongoing

National Innovation Centre for Data

We are collaborating with the National Innovation Centre for Data on evaluation protocols covering response accuracy, tone assessment, guardrail effectiveness and multi-turn coherence, extending the point-in-time evaluation into a repeatable, structured framework.

The golden questions framework being developed through this collaboration is the first structured validation protocol for RAG-based persona systems. Once validated, we intend to publish the methodology openly, positioning this as a contribution to the emerging field of AI persona evaluation standards.

Response accuracy protocols

Structured rubric for factual correctness & coverage

Tone assessment

Consistency and appropriateness across response types

Multi-turn coherence

Consistency across extended facilitated workshop sessions

Guardrail effectiveness

Resistance to prompt injection & out-of-scope manipulation

Safety & adversarial testing

We test how the system fails, not just how it succeeds

Understanding failure modes is as important as measuring accuracy. Persona Dynamics is subject to ongoing adversarial testing that directly shapes platform guardrails.

Adversarial prompt testing

Ongoing attempts to induce hallucination, provoke out-of-scope responses, extract cross-tenant data, and manipulate persona behaviour through multi-turn prompt injection escalation. Findings from this testing directly inform guardrail design and led to the implementation of grounding trajectory monitoring.

Multi-tenancy isolation testing

Deliberate cross-tenant extraction attempts test the boundary integrity of per-organisation siloed vector stores. Auth0 tenant-scoped authentication, API-level tenant enforcement, and ISO 27001 controls are all tested as a system, not in isolation.

Edge case & stress testing

A dedicated testing workstream stress-tests persona behaviour across boundary conditions (sparse evidence bases, contradictory source material, ambiguous queries, and unexpected input formats) to understand degradation patterns and set appropriate confidence thresholds.

Red-teaming at scale (planned)

Formal red-teaming targeting persona manipulation and guardrail bypass across broader deployment contexts, extending current adversarial testing to enterprise-scale scenarios. Scheduled to follow the completion of facilitator agent industry testing.

Assurance activity status

UK IPO IP Audit

Completed

Full audit of platform code, employment contracts and commercial agreements. Clean IP ownership confirmed. January 2026.

Independent performance evaluation

Completed

252 rubric evaluations, 9 subject matter experts, blind conditions. February 2026.

Adversarial & boundary testing

Ongoing

Prompt injection, cross-tenant extraction, hallucination induction. Led to grounding trajectory monitoring.

NICD evaluation framework

In progress

Golden questions regression framework with NICD. Started December 2025.

ISO 27001 / 42001 alignment

In progress

Aligning to ISO 27001 (information security), ISO/IEC 42001 (AI management), ISO/IEC 23894 (AI risk management). Formal certification not yet underway.

Penetration testing

Planned

External pen test of multi-tenancy isolation, authentication controls and API endpoints.

Structured red-teaming at scale

Planned

Formal red-teaming targeting persona manipulation and guardrail bypass across enterprise-scale deployments.

Academic & institutional context

Grounded in peer-reviewed research and expert advisory

Research foundation

The platform's architecture is grounded in award-recognised PhD research examining AI in professional creative practice, work supported by Innovate UK BridgeAI feasibility funding with Hartree Centre support, and recognised with a D&AD Award and Deutsche Bank Design Award.

This research directly informed the three architectural requirements now validated through field deployment: source transparency, preserved human authority, and persistent project memory.

AI regulatory engagement

The platform has been assessed as minimal-risk under the EU AI Act, consistent with its advisory, human-in-the-loop design and the absence of autonomous decision-making in high-risk sectors.

An advisory board member with AI regulatory expertise monitors UK and EU regulatory development. We have been involved in feedback processes for BS ISO/IEC 42005 and BS EN ISO/IEC 23894.

Research, validation & advisory partners

National Innovation Centre for Data

Golden questions framework · evaluation protocols

Northumbria University

Live deployment · trust & explainability research

Newcastle Centre for AI Safety

AI safety research & standards advisory

Creative PEC

AI regulatory policy · UK AI Bill monitoring

We measure what we claim.
Under independent, blind conditions.

Six years of industry-embedded research

Transparency is non-negotiable

Human authority must be preserved

Project memory is critical

Three field studies informing platform architecture

Measured against human domain experts. Under blind conditions.

Study design

Accuracy is necessary, not sufficient

Validation isn't a one-time study. It's infrastructure

National Innovation Centre for Data

We test how the system fails, not just how it succeeds

Assurance activity status

Grounded in peer-reviewed research and expert advisory

Questions about methodology or deployment?

We measure what we claim. Under independent, blind conditions.

Six years of industry-embedded research

Transparency is non-negotiable

Human authority must be preserved

Project memory is critical

Three field studies informing platform architecture

Measured against human domain experts. Under blind conditions.

Study design

Accuracy is necessary, not sufficient

Validation isn't a one-time study. It's infrastructure

National Innovation Centre for Data

We test how the system fails, not just how it succeeds

Assurance activity status

Grounded in peer-reviewed research and expert advisory

Questions about methodology or deployment?

We measure what we claim.
Under independent, blind conditions.