UX Pilot - AI Stakeholder Feedback

01 — Context

UX Pilot is an AI-powered design platform. Feedback was the missing piece.

UX Pilot lets designers generate high-fidelity screens, wireframes, and prototypes from natural language prompts. By 2026, the generation capabilities were strong. Users could go from idea to polished UI in minutes.

But generation was only half the loop. Designers kept asking the same question: Is this actually good enough to show stakeholders? The platform had no answer. Users would generate a design, export it to Figma, share it in a Slack message, and wait for async feedback - often arriving hours before a presentation, too late to act on.

The insight: UX Pilot's users weren't just trying to design faster. They were trying to ship with confidence. The bottleneck wasn't speed of creation - it was certainty that the design would survive contact with real stakeholders.

I was the product manager responsible for AI features on the Studio and Chat surfaces. When user research surfaced this pattern - designers running informal "sanity checks" with colleagues before presenting - I recognised it as a product gap, not a workflow quirk. The platform should close that loop itself.

My Role

AI Product Manager - UX Pilot

End-to-end ownership: user research, persona definition, prompt engineering specification, output structure design, acceptance criteria, and cross-surface rollout (Studio + Chat).

Platform Context

AI-native B2B SaaS design tool

Built for designers, product teams, and agencies. Core differentiator: AI that doesn't just generate - it understands design intent and provides intelligent feedback on outputs.

02 — Problem

Designers were flying blind into stakeholder reviews.

User interviews surfaced a consistent pattern. Designers using UX Pilot would create strong visual output, then spend significant time doing mental simulation: What will the CTO say? Will the client think this looks like their brand? Will the board ask about ROI?

This mental simulation was unstructured, inconsistent, and exhausting. Senior designers were better at it - they had years of stakeholder pattern-matching. Junior designers had almost no reference points. Both groups were doing it manually, every time.

01

The feedback gap in AI design tools

AI design tools had closed the generation gap: anyone could make something that looked good. But they hadn't touched the evaluation gap. No AI tool was simulating the human reactions that determine whether a design succeeds in the real world.

02

Generic UX critique was already solved - and wasn't the problem

Heuristic-based design review tools existed. "Your contrast ratio is 3.2:1" or "Consider adding alt text" - useful, but not the problem. Stakeholders don't critique heuristics. They say things like "This doesn't look like our brand" and "Where's the business case?" The gap was stakeholder simulation, not UX auditing.

03

The insight that shaped the brief

Every designer has encountered four archetypes in stakeholder reviews: the skeptical technical decision-maker, the confused first-time viewer, the opinionated client who compares everything to competitors, and the risk-averse executive. These personas are predictable. Their concerns are mappable. An LLM trained to embody them could generate feedback that felt real - because it was modelling real human perspectives.

Design constraint: The feedback had to feel like actual stakeholder pushback - opinionated, sometimes uncomfortable, reflecting real tension - not polished product critique. Generic or "helpful" output would miss the point entirely.

03 — Persona Architecture

LLM Persona Design

Four personas. Each with a distinct voice, focus, and type of tension.

The personas aren't aesthetic choices - they're prompt engineering primitives. Each one maps to a real stakeholder archetype designers encounter, with a defined focus area, tone, and set of typical concerns that shape how the LLM generates output. I defined all five of these as part of the product spec.

🔵

Skeptical CTO

Focus: Scalability · Complexity · Build Cost

Is this over-engineered for what we need?

What's the system impact of this component?

Why is this necessary at all?

🟠

Confused First-Time User

Focus: Clarity · Usability · Cognitive Load

What am I supposed to do here?

Why is this showing up on this screen?

I don't understand this step at all

🔴

Demanding Client

Focus: Brand · Polish · Competitor Parity

This doesn't match the brand guidelines we signed off on

Their competitor does it cleaner

Where's the data point? Everything reads the same weight

🟣

Conservative Board Member

Focus: Risk · ROI · Predictability

What's the business value being created here?

Is this safe to ship to enterprise customers?

How does this affect our KPIs?

The fifth persona: user-defined

I included a custom persona option from the start. Users can define their own stakeholder - specifying focus, tone, and concerns - and the system generates feedback accordingly. This covered the long tail of stakeholder types that don't fit neatly into four archetypes: a niche industry regulator, a specific CEO's known preferences, an accessibility auditor. The four built-in personas handle 80% of cases. The custom option handles everyone else.

Prompt engineering principle: Each persona was specified not just as a description but as a behavioural constraint on the LLM - telling it what to optimise for, what language register to use, and critically, what not to generate. Persona outputs had to reflect real stakeholder tension, not polite product feedback.

Product Screenshot - Persona Selection UI

UX Pilot Stakeholder Feedback - persona selection panel showing Skeptical CTO, Confused User, Demanding Client, and Board Member options with Run Review CTA

The Accessibility Review panel (renamed from the original "Red Team" spec label) - users select one or more personas and trigger the AI simulation with a single click. Demanding Client is checked; Board Member and Skeptical CTO are available.

04 — Feature Design

Surfaced where the design lives - on the canvas, in context.

The placement decision was important. I didn't want the feature to exist as a separate tab or modal that pulled users away from their work. Stakeholder feedback should appear where the design lives.

01

Entry point: Generate → Review & Audit → Red Team

I placed the feature inside the Generate menu, under "Review & Audit," to position it alongside other intelligent analysis tools. The keyboard shortcut (⌥Y) was specified to enable power users to invoke it mid-flow without breaking their momentum. The name "Red Team" in the spec was deliberately adversarial - to signal this wasn't a gentle review.

02

Right-panel output - contextual, dismissable, re-runnable

The output panel opens on the right, keeping the design canvas fully visible. This was a deliberate UX choice: the designer should be able to look at the design and the feedback simultaneously. The panel supports re-run (fresh simulation with same persona), export, and "Highlight All" - which attaches the feedback directly to canvas layers as a review overlay.

03

Multi-persona selection - parallel generation

Users can select multiple personas in a single run. I specified parallel generation so running three personas doesn't triple the wait time. Each persona's output appears as a separate panel state, switchable with a back/forward control. This was a key differentiator from Chat mode, which defaults to one persona at a time.

04

Canvas review layer - output attached to the design

Feedback attaches as a review layer on the canvas - not just a panel that disappears. Users can highlight specific issues, dismiss individual items, or re-run with updated designs. The goal was to make feedback a persistent part of the design artifact, not a transient notification.

05 — Output Structure

Four structured sections per persona. Designed for immediate action.

I specified the output format in detail - not as a UX guideline, but as a prompt engineering constraint. The LLM needed to know exactly what to generate for each section, and the UI needed to render it consistently across all personas.

Demanding Client - Feedback Output

5 Objections 8 Questions 3 Issues

Top Objections

3–5 items

Strong critique statements in persona voice. These are opinionated, sometimes uncomfortable - the things a demanding client says when they're not filtering. "This doesn't match the brand guidelines we signed off last month." "The navigation is different - their competitor does it cleaner." Not suggestions: objections. The LLM was instructed to surface tension, not offer advice at this stage.

Questions They'll Ask

3–8 items

Realistic questions the persona would ask in a live review. "Can we see a version with our brand colours?" "How does this look on mobile?" "Did you see what [Competitor] shipped last week?" These arm the designer with the questions before they're asked in the room - so they can prepare answers or preemptively fix the design.

To Improve

Severity-ranked

Each issue is tagged Critical Moderate or Low, with a specific issue, short explanation, and a clear actionable fix. The element count (e.g. "2 elements") links the issue to the actual canvas components. This is the only section that drives direct action - the "Fix X Issues" CTA at the bottom triggers design iteration. Critical issues are checked by default; lower severity items are opt-in.

Fix CTA

1 action

"Fix 3 Issues" - a single CTA that triggers AI-driven design iteration for the checked items. The issue count in the label reflects selected items, updating dynamically as users check/uncheck. This closes the loop: from feedback to improved design without leaving the feature.

Product Screenshot - Feedback Output Panel

UX Pilot Stakeholder Feedback - Demanding Client output panel showing Top Objections, Questions They'll Ask, and To Improve sections with Critical, Moderate, and Low severity items

The Demanding Client output - Top Objections, Questions They'll Ask, and the severity-ranked To Improve section. The "Fix 3 Issues" CTA at the bottom triggers AI design iteration for the selected items. Re-run, Export, and Highlight All controls appear at the top.

06 — Chat Mode

The same simulation engine - lighter, faster, conversational.

I built Chat mode as a distinct surface of the same capability, not an afterthought. The core logic is identical - same personas, same LLM behaviour - but the interaction model is different. Chat is faster, single-persona by default, and conversational in output format.

Design decision: Chat mode doesn't require screen selection. It auto-attaches to the most recently generated design and responds to natural language intent - "Review this as a demanding client" triggers the full simulation without any UI navigation. Speed and frictionlessness were the primary design goals for Chat mode.

Three trigger patterns - all intentional

Explicit Prompt

"Act as a CTO and critique this"

Intent detection: persona keyword + review intent. Runs immediately. No additional input needed. Most common trigger in production.

System Suggestion

Quick actions after generation

"Run Stakeholder Review" / "Test with Client" appears contextually after design generation. Reduces friction for users who want feedback but didn't think to ask.

Follow-up Trigger

"Will this pass client review?"

Softer signals that imply review intent. System suggests persona options before running - quick reply buttons, not a form.

Chat vs Studio - designed as complements, not duplicates

Dimension	Chat Mode	Studio Mode
Speed	Fast, lightweight, conversational	Structured, detailed, panel-based
Personas	1 persona at a time (default)	Multi-persona parallel generation
Canvas integration	No canvas layer	Attached review layer on canvas
Output format	Conversational, prose + bullets	Structured UI sections with severity tags
Primary goal	Insight and awareness	Action and issue tracking
Follow-up	"Try another persona" / "Generate improved version"	"Fix X Issues" → AI iteration

07 — AI & Prompt Engineering

LLM Product Design

The hardest part wasn't the AI. It was defining what "real" feedback meant.

The LLM capability to generate text in different voices exists. The PM challenge was specifying the output precisely enough that "Demanding Client feedback" consistently felt like a demanding client - not a polite UX audit in different clothes.

What I specified in the prompt engineering brief

01

Persona mindset, not generic critique

The system prompt for each persona had to encode the worldview of the stakeholder, not just their topic focus. A skeptical CTO doesn't just ask about scalability - they actively look for over-engineering, implicit cost assumptions, and unnecessary complexity. I wrote the persona behaviours as PM specs: what they optimise for, what language patterns they use, what makes them uncomfortable. Engineering translated these into prompt constraints.

02

Structured JSON output → mapped to UI sections

The LLM output was specified as structured JSON, not free text. Objections array, questions array, issues array (with severity, element references, explanation, and fix fields). This was critical for UI consistency and for the "Fix X Issues" CTA to work correctly. I defined the schema; engineering built the parser and UI renderer.

03

Tension and conflict - not polite feedback

Early test outputs were too constructive - the LLM defaulted to helpfulness. I added explicit instructions to surface realistic conflict: a demanding client who doesn't like the design, a CTO who questions the entire approach. The output needed to create the emotional experience of a difficult review - so the designer felt prepared, not coddled.

04

Input: screen image + prompt context

The system takes the selected screen image as the primary input, with the user's prompt context (if available) as supplementary signal. Multi-screen selection was supported from launch. For multi-persona runs, parallel generation ran each persona simultaneously - not sequentially - to keep total wait time acceptable.

PM ownership of AI output quality: I owned the qualitative bar for what "good" feedback looked like. This meant reviewing test outputs across all four personas, writing correction notes when outputs felt generic or too polite, and iterating the prompt spec until the output consistently passed the "does this feel like a real stakeholder?" test. Prompt quality is a product decision - not just an engineering task.

08 — Success Metrics

What success looked like - and how the feature exceeded it.

I defined the success criteria in the spec before a line of code was written. The feature launched, outperformed on every dimension, and became the most-used AI feature on the platform.

#1

Most-used AI feature post-launch, by session frequency and unique users

↑ Ret.

Users who ran Stakeholder Feedback showed measurably higher weekly return rates

3 types

Feedback types designed for immediate action - users acted on at least 1 issue per run as primary target

The three success criteria I wrote in the spec

Criterion 1 - Clarity

Users understand feedback instantly

No interpretation required. Feedback is structured, specific, and written in plain language. A designer shouldn't need to decode what a persona means - the output is immediately actionable.

Criterion 2 - Authenticity

Feedback feels "real"

The qualitative bar: feedback should feel like actual stakeholder pushback, not polished UX advice. Tested by having designers read outputs and rate whether they'd heard something similar in a real review.

Criterion 3 - Actionability

At least 1 issue acted on per run

Every run should produce at least one improvement the designer can act on immediately. This was the behavioral definition of "useful" - not engagement metrics, but design iteration triggered by the feature.

Metrics I tracked

Type	Metric	Why it matters
Adoption	Feature activation rate - % of active users running at least one simulation per week	Breadth of adoption across the user base. If only power users use it, the feature has a discoverability problem, not a value problem.
Quality	Persona realism score - user-rated "felt like real feedback" on a 5-point scale	The core quality bar. If users don't believe the feedback is realistic, the preparation value disappears.
Action	Fix trigger rate - % of runs where "Fix X Issues" CTA is used	Closes the loop between feedback and iteration. High fix rate means the output was specific enough to act on.
Retention	Return rate - weekly active rate for users who used the feature vs. those who didn't	If Stakeholder Feedback improves design outcomes, users should return more. This connects feature engagement to platform stickiness.
Business	Conversion uplift - free-to-paid rate for users who use Stakeholder Feedback in trial period	High-value features that solve a real problem should accelerate upgrade decisions. Stakeholder Feedback was positioned as a premium capability.

09 — Risks

What could go wrong - and how I designed against it.

High

Feedback too generic - LLM defaults to polite UX advice instead of stakeholder tension

The most likely failure mode. LLMs are trained to be helpful and constructive - the opposite of a skeptical CTO or demanding client. Mitigated by: explicit prompt constraints against heuristic-based language, persona behavioural specs that encode conflict, and iterative prompt refinement against a "does this feel real?" test before launch.

High

Output inconsistency across runs - same design getting different quality feedback

Non-deterministic LLM output means two runs on the same design could produce meaningfully different feedback. Mitigated by: structured JSON output schema constraining response format, temperature tuning to reduce variance on tone and structure while preserving variation in specific content.

High

Designers over-relying on AI feedback instead of real stakeholder validation

If designers treat AI simulation as a replacement for actual stakeholder feedback, the feature creates false confidence. Mitigated by: framing copy positioning the feature as "prepare for your review" not "replace your review," and output language that references the persona perspective rather than stating facts.

Medium

Persona feedback missing domain-specific context

A Skeptical CTO reviewing a fintech dashboard has very different concerns from one reviewing a consumer mobile app. Without domain context, feedback may be generically technical rather than specifically relevant. Mitigated by: prompt context input allowing users to provide industry and product type, and the custom persona option for highly specific stakeholder types.

Medium

Latency on multi-persona parallel runs degrading experience

Running four personas simultaneously against a high-resolution screen image creates real latency risk. Mitigated by: parallel generation architecture specification, progressive panel loading (show first persona output while others generate), and loading state design that manages perceived wait time.

Low

Canvas review layer creating visual clutter that obscures the design

If the review overlay makes the design difficult to see and evaluate, users will dismiss it immediately. Mitigated by: toggleable layer visibility, issue highlights as subtle indicators rather than large overlays, and the default panel-only view keeping the canvas clean unless the user explicitly activates canvas highlights.

10 — Lessons

What building this taught me about AI product design.

01

The hardest AI product problem is defining "good" - not building the capability

The LLM could simulate any persona from day one. What it couldn't do without precise specification was simulate them well. Defining what "Demanding Client feedback" should feel like - opinionated, branded, competitive - required deep thought about real stakeholder psychology. The most important PM contribution to this feature wasn't a roadmap or a PRD. It was a clear qualitative bar for what output quality meant, and the discipline to keep testing against it.

02

Structured output schema is a product decision, not just an engineering one

I could have left the output format as "natural language feedback in the persona voice." Instead I specified a JSON schema: objections array, questions array, issues with severity and element references. This decision determined what the UI could render, what the "Fix X Issues" CTA could trigger, and how consistently users could act on feedback. The output structure is where PM and prompt engineering meet - and it's entirely a product call.

03

Surface-specific design matters more than feature parity

The instinct was to build Studio mode and then port it to Chat. Instead, I designed Chat mode as its own surface from the start - same underlying capability, fundamentally different interaction model. Studio mode is about structured tracking. Chat mode is about fast insight. Treating them as the same feature on different surfaces would have made both worse. Treating them as distinct surfaces that share an engine made both better.

04

When AI is the product, framing determines trust

Identical output framed as "AI suggests you improve this" vs. "A demanding client would say this" produces different user behaviour. The second framing created more engagement - not because the content was different, but because it was relatable. Designers had mental models for demanding clients. The persona frame made abstract AI output feel grounded and real. Framing is a product decision with measurable impact on feature adoption.

05

Close the loop in the feature itself - don't hand off to another tool

The "Fix X Issues" CTA was the product decision I'm most proud of. It would have been easy to stop at feedback - show the issues, let users go figure out the fixes. Instead, I specified that the feature should trigger AI design iteration directly, keeping the entire feedback-to-improvement loop inside UX Pilot. Features that close their own loops retain users. Features that hand off to other tools create friction that compounds over time.

ID	As a…	I want…	So that…
US.1	Designer preparing for a client presentation	to simulate how a demanding client will react to my design before the meeting	I can anticipate objections and either fix the design or prepare a response in advance
US.2	Product designer working on a B2B SaaS dashboard	CTO-perspective feedback on my design's complexity and build implications	I can identify over-engineered elements before the technical review reveals them in front of stakeholders
US.3	Junior designer without much stakeholder experience	realistic examples of the questions a board member would ask about my design	I can prepare answers and feel confident walking into a review I haven't done many times before
US.4	Designer iterating quickly in Chat	to get stakeholder feedback by just typing "review this as a client" after generating a design	I can get simulation feedback without switching to Studio and manually selecting screens

#	Criteria	Owner	Verified by
AC.1	Each persona must generate output that reflects its defined mindset - not generic UX critique. Reviewers should be unable to swap persona labels without the content becoming inconsistent.	PM / AI	Qualitative review: PM and one designer each label outputs blind; persona identification accuracy target >90%
AC.2	Output must be structured JSON with defined schema: objections[], questions[], issues[] (severity, elementRef, explanation, fix). No free-text-only response accepted.	AI / BE	Automated schema validation on all outputs; parser error rate <1% in QA
AC.3	Multi-persona runs must use parallel generation. Total wait time for 3 personas must not exceed 1.5× single persona time.	BE / AI	Performance test: 3-persona run measured against single-persona baseline; ratio verified
AC.4	Every issue in the "To Improve" section must include severity tag, element reference, specific explanation, and actionable fix. Issues without all four fields are rejected.	AI / FE	QA: sample 20 runs across all 4 personas; verify 100% of issues have all required fields rendered
AC.5	Chat mode must detect review intent from natural language without requiring explicit commands. "Review this as a CTO", "Act as a board member", "Give stakeholder feedback" must all trigger the feature.	AI / FE	Intent detection test suite: 20 natural language variations per persona; target >95% correct trigger rate

AI StakeholderFeedback

UX Pilot is an AI-powered design platform. Feedback was the missing piece.

AI Product Manager - UX Pilot

AI-native B2B SaaS design tool

Designers were flying blind into stakeholder reviews.

Four personas. Each with a distinct voice, focus, and type of tension.

The fifth persona: user-defined

Surfaced where the design lives - on the canvas, in context.

Four structured sections per persona. Designed for immediate action.

The same simulation engine - lighter, faster, conversational.

Three trigger patterns - all intentional

"Act as a CTO and critique this"

Quick actions after generation

"Will this pass client review?"

Chat vs Studio - designed as complements, not duplicates

The hardest part wasn't the AI. It was defining what "real" feedback meant.

What I specified in the prompt engineering brief

What success looked like - and how the feature exceeded it.

The three success criteria I wrote in the spec

Users understand feedback instantly

Feedback feels "real"

At least 1 issue acted on per run

Metrics I tracked

What could go wrong - and how I designed against it.

What building this taught me about AI product design.

Functional Spec Excerpts

AI Stakeholder
Feedback