01 — Context
UX Pilot is an AI-powered design platform. Feedback was the missing piece.
UX Pilot lets designers generate high-fidelity screens, wireframes, and prototypes from natural language prompts. By 2026, the generation capabilities were strong. Users could go from idea to polished UI in minutes.
But generation was only half the loop. Designers kept asking the same question: Is this actually good enough to show stakeholders? The platform had no answer. Users would generate a design, export it to Figma, share it in a Slack message, and wait for async feedback - often arriving hours before a presentation, too late to act on.
The insight: UX Pilot's users weren't just trying to design faster. They were trying to ship with confidence. The bottleneck wasn't speed of creation - it was certainty that the design would survive contact with real stakeholders.
I was the product manager responsible for AI features on the Studio and Chat surfaces. When user research surfaced this pattern - designers running informal "sanity checks" with colleagues before presenting - I recognised it as a product gap, not a workflow quirk. The platform should close that loop itself.
My Role
AI Product Manager - UX Pilot
End-to-end ownership: user research, persona definition, prompt engineering specification, output structure design, acceptance criteria, and cross-surface rollout (Studio + Chat).
Platform Context
AI-native B2B SaaS design tool
Built for designers, product teams, and agencies. Core differentiator: AI that doesn't just generate - it understands design intent and provides intelligent feedback on outputs.
02 — Problem
Designers were flying blind into stakeholder reviews.
User interviews surfaced a consistent pattern. Designers using UX Pilot would create strong visual output, then spend significant time doing mental simulation: What will the CTO say? Will the client think this looks like their brand? Will the board ask about ROI?
This mental simulation was unstructured, inconsistent, and exhausting. Senior designers were better at it - they had years of stakeholder pattern-matching. Junior designers had almost no reference points. Both groups were doing it manually, every time.
01
The feedback gap in AI design tools
AI design tools had closed the generation gap: anyone could make something that looked good. But they hadn't touched the evaluation gap. No AI tool was simulating the human reactions that determine whether a design succeeds in the real world.
02
Generic UX critique was already solved - and wasn't the problem
Heuristic-based design review tools existed. "Your contrast ratio is 3.2:1" or "Consider adding alt text" - useful, but not the problem. Stakeholders don't critique heuristics. They say things like "This doesn't look like our brand" and "Where's the business case?" The gap was stakeholder simulation, not UX auditing.
03
The insight that shaped the brief
Every designer has encountered four archetypes in stakeholder reviews: the skeptical technical decision-maker, the confused first-time viewer, the opinionated client who compares everything to competitors, and the risk-averse executive. These personas are predictable. Their concerns are mappable. An LLM trained to embody them could generate feedback that felt real - because it was modelling real human perspectives.
Design constraint: The feedback had to feel like actual stakeholder pushback - opinionated, sometimes uncomfortable, reflecting real tension - not polished product critique. Generic or "helpful" output would miss the point entirely.
03 — Persona Architecture
LLM Persona Design
Four personas. Each with a distinct voice, focus, and type of tension.
The personas aren't aesthetic choices - they're prompt engineering primitives. Each one maps to a real stakeholder archetype designers encounter, with a defined focus area, tone, and set of typical concerns that shape how the LLM generates output. I defined all five of these as part of the product spec.
🔵
Skeptical CTO
Focus: Scalability · Complexity · Build Cost
Is this over-engineered for what we need?
What's the system impact of this component?
Why is this necessary at all?
🟠
Confused First-Time User
Focus: Clarity · Usability · Cognitive Load
What am I supposed to do here?
Why is this showing up on this screen?
I don't understand this step at all
🔴
Demanding Client
Focus: Brand · Polish · Competitor Parity
This doesn't match the brand guidelines we signed off on
Their competitor does it cleaner
Where's the data point? Everything reads the same weight
🟣
Conservative Board Member
Focus: Risk · ROI · Predictability
What's the business value being created here?
Is this safe to ship to enterprise customers?
How does this affect our KPIs?
The fifth persona: user-defined
I included a custom persona option from the start. Users can define their own stakeholder - specifying focus, tone, and concerns - and the system generates feedback accordingly. This covered the long tail of stakeholder types that don't fit neatly into four archetypes: a niche industry regulator, a specific CEO's known preferences, an accessibility auditor. The four built-in personas handle 80% of cases. The custom option handles everyone else.
Prompt engineering principle: Each persona was specified not just as a description but as a behavioural constraint on the LLM - telling it what to optimise for, what language register to use, and critically, what not to generate. Persona outputs had to reflect real stakeholder tension, not polite product feedback.
Product Screenshot - Persona Selection UI
The Accessibility Review panel (renamed from the original "Red Team" spec label) - users select one or more personas and trigger the AI simulation with a single click. Demanding Client is checked; Board Member and Skeptical CTO are available.
04 — Feature Design
Surfaced where the design lives - on the canvas, in context.
The placement decision was important. I didn't want the feature to exist as a separate tab or modal that pulled users away from their work. Stakeholder feedback should appear where the design lives.
01
Entry point: Generate → Review & Audit → Red Team
I placed the feature inside the Generate menu, under "Review & Audit," to position it alongside other intelligent analysis tools. The keyboard shortcut (⌥Y) was specified to enable power users to invoke it mid-flow without breaking their momentum. The name "Red Team" in the spec was deliberately adversarial - to signal this wasn't a gentle review.
02
Right-panel output - contextual, dismissable, re-runnable
The output panel opens on the right, keeping the design canvas fully visible. This was a deliberate UX choice: the designer should be able to look at the design and the feedback simultaneously. The panel supports re-run (fresh simulation with same persona), export, and "Highlight All" - which attaches the feedback directly to canvas layers as a review overlay.
03
Multi-persona selection - parallel generation
Users can select multiple personas in a single run. I specified parallel generation so running three personas doesn't triple the wait time. Each persona's output appears as a separate panel state, switchable with a back/forward control. This was a key differentiator from Chat mode, which defaults to one persona at a time.
04
Canvas review layer - output attached to the design
Feedback attaches as a review layer on the canvas - not just a panel that disappears. Users can highlight specific issues, dismiss individual items, or re-run with updated designs. The goal was to make feedback a persistent part of the design artifact, not a transient notification.
05 — Output Structure
Four structured sections per persona. Designed for immediate action.
I specified the output format in detail - not as a UX guideline, but as a prompt engineering constraint. The LLM needed to know exactly what to generate for each section, and the UI needed to render it consistently across all personas.
Strong critique statements in persona voice. These are opinionated, sometimes uncomfortable - the things a demanding client says when they're not filtering. "This doesn't match the brand guidelines we signed off last month." "The navigation is different - their competitor does it cleaner." Not suggestions: objections. The LLM was instructed to surface tension, not offer advice at this stage.
Questions They'll Ask
3–8 items
Realistic questions the persona would ask in a live review. "Can we see a version with our brand colours?" "How does this look on mobile?" "Did you see what [Competitor] shipped last week?" These arm the designer with the questions before they're asked in the room - so they can prepare answers or preemptively fix the design.
To Improve
Severity-ranked
Each issue is tagged Critical Moderate or Low, with a specific issue, short explanation, and a clear actionable fix. The element count (e.g. "2 elements") links the issue to the actual canvas components. This is the only section that drives direct action - the "Fix X Issues" CTA at the bottom triggers design iteration. Critical issues are checked by default; lower severity items are opt-in.
"Fix 3 Issues" - a single CTA that triggers AI-driven design iteration for the checked items. The issue count in the label reflects selected items, updating dynamically as users check/uncheck. This closes the loop: from feedback to improved design without leaving the feature.
Product Screenshot - Feedback Output Panel
The Demanding Client output - Top Objections, Questions They'll Ask, and the severity-ranked To Improve section. The "Fix 3 Issues" CTA at the bottom triggers AI design iteration for the selected items. Re-run, Export, and Highlight All controls appear at the top.
06 — Chat Mode
The same simulation engine - lighter, faster, conversational.
I built Chat mode as a distinct surface of the same capability, not an afterthought. The core logic is identical - same personas, same LLM behaviour - but the interaction model is different. Chat is faster, single-persona by default, and conversational in output format.
Design decision: Chat mode doesn't require screen selection. It auto-attaches to the most recently generated design and responds to natural language intent - "Review this as a demanding client" triggers the full simulation without any UI navigation. Speed and frictionlessness were the primary design goals for Chat mode.
Three trigger patterns - all intentional
Explicit Prompt
"Act as a CTO and critique this"
Intent detection: persona keyword + review intent. Runs immediately. No additional input needed. Most common trigger in production.
System Suggestion
Quick actions after generation
"Run Stakeholder Review" / "Test with Client" appears contextually after design generation. Reduces friction for users who want feedback but didn't think to ask.
Follow-up Trigger
"Will this pass client review?"
Softer signals that imply review intent. System suggests persona options before running - quick reply buttons, not a form.
Chat vs Studio - designed as complements, not duplicates
| Dimension |
Chat Mode |
Studio Mode |
| Speed |
Fast, lightweight, conversational |
Structured, detailed, panel-based |
| Personas |
1 persona at a time (default) |
Multi-persona parallel generation |
| Canvas integration |
No canvas layer |
Attached review layer on canvas |
| Output format |
Conversational, prose + bullets |
Structured UI sections with severity tags |
| Primary goal |
Insight and awareness |
Action and issue tracking |
| Follow-up |
"Try another persona" / "Generate improved version" |
"Fix X Issues" → AI iteration |
07 — AI & Prompt Engineering
LLM Product Design
The hardest part wasn't the AI. It was defining what "real" feedback meant.
The LLM capability to generate text in different voices exists. The PM challenge was specifying the output precisely enough that "Demanding Client feedback" consistently felt like a demanding client - not a polite UX audit in different clothes.
What I specified in the prompt engineering brief
01
Persona mindset, not generic critique
The system prompt for each persona had to encode the worldview of the stakeholder, not just their topic focus. A skeptical CTO doesn't just ask about scalability - they actively look for over-engineering, implicit cost assumptions, and unnecessary complexity. I wrote the persona behaviours as PM specs: what they optimise for, what language patterns they use, what makes them uncomfortable. Engineering translated these into prompt constraints.
02
Structured JSON output → mapped to UI sections
The LLM output was specified as structured JSON, not free text. Objections array, questions array, issues array (with severity, element references, explanation, and fix fields). This was critical for UI consistency and for the "Fix X Issues" CTA to work correctly. I defined the schema; engineering built the parser and UI renderer.
03
Tension and conflict - not polite feedback
Early test outputs were too constructive - the LLM defaulted to helpfulness. I added explicit instructions to surface realistic conflict: a demanding client who doesn't like the design, a CTO who questions the entire approach. The output needed to create the emotional experience of a difficult review - so the designer felt prepared, not coddled.
04
Input: screen image + prompt context
The system takes the selected screen image as the primary input, with the user's prompt context (if available) as supplementary signal. Multi-screen selection was supported from launch. For multi-persona runs, parallel generation ran each persona simultaneously - not sequentially - to keep total wait time acceptable.
PM ownership of AI output quality: I owned the qualitative bar for what "good" feedback looked like. This meant reviewing test outputs across all four personas, writing correction notes when outputs felt generic or too polite, and iterating the prompt spec until the output consistently passed the "does this feel like a real stakeholder?" test. Prompt quality is a product decision - not just an engineering task.
08 — Success Metrics
What success looked like - and how the feature exceeded it.
I defined the success criteria in the spec before a line of code was written. The feature launched, outperformed on every dimension, and became the most-used AI feature on the platform.
#1
Most-used AI feature post-launch, by session frequency and unique users
↑ Ret.
Users who ran Stakeholder Feedback showed measurably higher weekly return rates
3 types
Feedback types designed for immediate action - users acted on at least 1 issue per run as primary target
The three success criteria I wrote in the spec
Criterion 1 - Clarity
Users understand feedback instantly
No interpretation required. Feedback is structured, specific, and written in plain language. A designer shouldn't need to decode what a persona means - the output is immediately actionable.
Criterion 2 - Authenticity
Feedback feels "real"
The qualitative bar: feedback should feel like actual stakeholder pushback, not polished UX advice. Tested by having designers read outputs and rate whether they'd heard something similar in a real review.
Criterion 3 - Actionability
At least 1 issue acted on per run
Every run should produce at least one improvement the designer can act on immediately. This was the behavioral definition of "useful" - not engagement metrics, but design iteration triggered by the feature.
Metrics I tracked
| Type | Metric | Why it matters |
| Adoption |
Feature activation rate - % of active users running at least one simulation per week |
Breadth of adoption across the user base. If only power users use it, the feature has a discoverability problem, not a value problem. |
| Quality |
Persona realism score - user-rated "felt like real feedback" on a 5-point scale |
The core quality bar. If users don't believe the feedback is realistic, the preparation value disappears. |
| Action |
Fix trigger rate - % of runs where "Fix X Issues" CTA is used |
Closes the loop between feedback and iteration. High fix rate means the output was specific enough to act on. |
| Retention |
Return rate - weekly active rate for users who used the feature vs. those who didn't |
If Stakeholder Feedback improves design outcomes, users should return more. This connects feature engagement to platform stickiness. |
| Business |
Conversion uplift - free-to-paid rate for users who use Stakeholder Feedback in trial period |
High-value features that solve a real problem should accelerate upgrade decisions. Stakeholder Feedback was positioned as a premium capability. |
09 — Risks
What could go wrong - and how I designed against it.
High
Feedback too generic - LLM defaults to polite UX advice instead of stakeholder tension
The most likely failure mode. LLMs are trained to be helpful and constructive - the opposite of a skeptical CTO or demanding client. Mitigated by: explicit prompt constraints against heuristic-based language, persona behavioural specs that encode conflict, and iterative prompt refinement against a "does this feel real?" test before launch.
High
Output inconsistency across runs - same design getting different quality feedback
Non-deterministic LLM output means two runs on the same design could produce meaningfully different feedback. Mitigated by: structured JSON output schema constraining response format, temperature tuning to reduce variance on tone and structure while preserving variation in specific content.
High
Designers over-relying on AI feedback instead of real stakeholder validation
If designers treat AI simulation as a replacement for actual stakeholder feedback, the feature creates false confidence. Mitigated by: framing copy positioning the feature as "prepare for your review" not "replace your review," and output language that references the persona perspective rather than stating facts.
Medium
Persona feedback missing domain-specific context
A Skeptical CTO reviewing a fintech dashboard has very different concerns from one reviewing a consumer mobile app. Without domain context, feedback may be generically technical rather than specifically relevant. Mitigated by: prompt context input allowing users to provide industry and product type, and the custom persona option for highly specific stakeholder types.
Medium
Latency on multi-persona parallel runs degrading experience
Running four personas simultaneously against a high-resolution screen image creates real latency risk. Mitigated by: parallel generation architecture specification, progressive panel loading (show first persona output while others generate), and loading state design that manages perceived wait time.
Low
Canvas review layer creating visual clutter that obscures the design
If the review overlay makes the design difficult to see and evaluate, users will dismiss it immediately. Mitigated by: toggleable layer visibility, issue highlights as subtle indicators rather than large overlays, and the default panel-only view keeping the canvas clean unless the user explicitly activates canvas highlights.
10 — Lessons
What building this taught me about AI product design.
01
The hardest AI product problem is defining "good" - not building the capability
The LLM could simulate any persona from day one. What it couldn't do without precise specification was simulate them well. Defining what "Demanding Client feedback" should feel like - opinionated, branded, competitive - required deep thought about real stakeholder psychology. The most important PM contribution to this feature wasn't a roadmap or a PRD. It was a clear qualitative bar for what output quality meant, and the discipline to keep testing against it.
02
Structured output schema is a product decision, not just an engineering one
I could have left the output format as "natural language feedback in the persona voice." Instead I specified a JSON schema: objections array, questions array, issues with severity and element references. This decision determined what the UI could render, what the "Fix X Issues" CTA could trigger, and how consistently users could act on feedback. The output structure is where PM and prompt engineering meet - and it's entirely a product call.
03
Surface-specific design matters more than feature parity
The instinct was to build Studio mode and then port it to Chat. Instead, I designed Chat mode as its own surface from the start - same underlying capability, fundamentally different interaction model. Studio mode is about structured tracking. Chat mode is about fast insight. Treating them as the same feature on different surfaces would have made both worse. Treating them as distinct surfaces that share an engine made both better.
04
When AI is the product, framing determines trust
Identical output framed as "AI suggests you improve this" vs. "A demanding client would say this" produces different user behaviour. The second framing created more engagement - not because the content was different, but because it was relatable. Designers had mental models for demanding clients. The persona frame made abstract AI output feel grounded and real. Framing is a product decision with measurable impact on feature adoption.
05
Close the loop in the feature itself - don't hand off to another tool
The "Fix X Issues" CTA was the product decision I'm most proud of. It would have been easy to stop at feedback - show the issues, let users go figure out the fixes. Instead, I specified that the feature should trigger AI design iteration directly, keeping the entire feedback-to-improvement loop inside UX Pilot. Features that close their own loops retain users. Features that hand off to other tools create friction that compounds over time.