Bias Detection Prompt
Bias Detection Prompt
AI Safety & Governance
The Prompt
The Logic
1. Mapping Bias Surfaces Prevents Tunnel Vision on “Training Data Only”
WHY IT WORKS: Many bias audits focus only on datasets and ignore other bias surfaces: prompts that encode stereotypes, UX that nudges certain groups, decision thresholds that hurt minorities, or outputs that frame groups negatively. Bias is a system property, not a dataset property. A bias surface map forces a comprehensive audit across the full pipeline (data → model → prompt → policy → interface → human review). This reduces missed risks and prevents teams from declaring “we checked fairness” while only checking one component. In practice, many production harms come from non-data surfaces (e.g., a prompt instructing “prioritize candidates from top schools” or UI hiding appeal options).
EXAMPLE: Resume screening tool: the dataset may be balanced, but the prompt tells the model to prefer “polished writing,” which disadvantages non-native speakers. Or the rule “reject if employment gap > 6 months” disproportionately affects caregivers and people with disabilities. A bias surface map would flag: prompt bias (style preference), rule bias (gap threshold), UX bias (no appeal for rejected candidates), and HITL bias (reviewers only see a summary that removes context). Audits that include surface mapping typically find 2-4× more actionable issues than data-only audits because they capture where bias is introduced during deployment.
2. Harm Scenarios Translate Abstract Fairness Into Concrete Human Impact
WHY IT WORKS: Fairness metrics are necessary but not sufficient. Teams may optimize a metric while still causing real harm (e.g., equal opportunity met, but outputs contain humiliating stereotypes). Harm scenario analysis focuses on how bias manifests in real user journeys: who is affected, what they experience, and what downstream consequences occur. This broadens the audit from “numbers” to “outcomes” and reveals dignity harms, access harms, and procedural justice harms (lack of explanation, lack of appeal). In high-stakes settings, scenario-based testing is a core safety technique because it catches failures not captured by aggregate metrics.
EXAMPLE: Loan pre-qualification assistant: Harm scenario 1: a user with accented English is misclassified as “high risk” due to writing style; consequence: discouragement from applying, potential financial exclusion. Scenario 2: a user mentions disability-related income; model assumes instability and recommends denial; consequence: discriminatory guidance. Scenario 3: model suggests illegal reasons (“because you are older”); consequence: legal liability and user harm. By documenting 5-10 scenarios, you can create targeted tests and controls: neutral language rewrites, feature removal (writing style), confidence thresholds with human review, and an appeal flow. Teams that use scenario-based fairness audits typically reduce complaint rates faster because they address what users actually experience, not just what dashboards show.
3. Counterfactual Testing Detects Disparate Treatment Even When Metrics Look Fine
WHY IT WORKS: Aggregate fairness metrics can hide individual-level discrimination. Counterfactual testing—holding all factors constant except a protected attribute—detects disparate treatment directly. It’s a practical method when protected attributes are not stored or are legally sensitive: you can simulate changes in names, pronouns, age references, disability mentions, or nationality while keeping qualifications identical. This reveals whether the system treats equivalent users differently. Counterfactual tests are also explainable to stakeholders (“same resume, different name → different outcome”), making them powerful for governance and remediation.
EXAMPLE: Create pairs: “John” vs. “Fatima,” “he/him” vs. “she/her,” “US-born” vs. “immigrant,” “native speaker” vs. “non-native,” “no disability” vs. “wheelchair user,” with identical qualifications. If outcomes differ, you have direct evidence of disparate treatment. In practice, teams often discover that “communication skills” scores drop 10–30% for non-native language variants or that “culture fit” ratings shift based on demographic cues. Counterfactual testing can reduce hidden discriminatory behavior by guiding prompt changes (remove “polish” preference), feature changes (remove name), and training data augmentation (include diverse writing styles). A robust audit includes at least 10 counterfactual cases per decision category and tracks pass/fail over time.
4. Proxy Variable Audits Catch “Legal Discrimination by Indirection”
WHY IT WORKS: Many systems avoid explicit protected attributes but still discriminate through proxies: ZIP code ↔ race/income, education pedigree ↔ socioeconomic status, device type ↔ age/income, gaps in employment ↔ caregiving/disability, language proficiency ↔ nationality. Proxy audits identify features that encode sensitive attributes and recommend mitigation: drop features, reduce weight, bucketize, or add fairness constraints. Without proxy audits, teams can pass “we don’t use race” checks while still producing racially disparate outcomes—creating compliance and reputational risk.
EXAMPLE: Insurance pricing model uses “credit score” and “home ownership.” These correlate strongly with socioeconomic status and, in many contexts, race. A proxy audit flags them as high-risk proxies and recommends: use alternative risk indicators (driving history), apply fairness constraints, and monitor group outcomes. In NLP systems, writing style and grammar are proxies for education and language background; you can mitigate by focusing scoring on factual content and job-relevant evidence, not style. Proxy audits are often the difference between “good intentions” and real fairness outcomes.
5. Ranked Remediation Options Convert Findings Into an Action Plan
WHY IT WORKS: Bias audits often end with a list of problems but no prioritized fixes. A ranked remediation plan (impact vs. effort) turns audit results into execution. It also supports governance: leaders can decide what to ship now and what requires further work. Effective remediation isn’t always “retrain the model”—often it’s prompt changes, UX disclosures, adding human review on sensitive cases, or changing thresholds. Ranking also makes trade-offs explicit: improving fairness might reduce accuracy; leadership must sign off on acceptable trade-offs.
EXAMPLE: For a resume screening assistant, a ranked plan might be: (1) Immediate: remove “polished writing” criterion; add disclaimer and appeal link; require human review for borderline cases (confidence 0.70–0.85). (2) Short-term (2–4 weeks): add counterfactual test suite to CI; build fairness dashboard; adjust thresholds by subgroup (if legal). (3) Medium-term (1–2 months): collect representative data and fine-tune; retrain with diverse writing styles; add structured scoring rubric. This sequencing can reduce disparate outcomes quickly while longer-term fixes are built. Teams that prioritize fixes typically reduce fairness incident backlog by 50–70% within 60 days versus audits that produce unprioritized lists.
6. Monitoring Makes Fairness Sustainable as Data and Users Change
WHY IT WORKS: Fairness is not static. Data drift, new user segments, policy changes, and model updates can reintroduce bias. Monitoring fairness metrics by slice (group, region, language, device) detects emerging problems early. Alert thresholds and incident runbooks ensure rapid response. Without monitoring, teams discover bias only after external complaints or audits—late and costly. With monitoring, you catch issues when the effect is small and reversible.
EXAMPLE: Set alerts: “If adverse impact ratio < 0.85 for 7 days → SEV-2, investigate; if < 0.80 → SEV-1, suspend automated decisions and route to human review.” Track: decision rates by group, false positive/negative rates by group (where labels exist), complaint rates by demographic proxies (language, region). In content moderation, monitor false positives on dialect and reclaimed slurs. Teams with monitoring often cut time-to-detection from months to days and reduce severity by containing issues early. This is governance as an operational system, not a one-time audit.
Example Output Preview
Sample: Bias Audit of Resume Screening Prompt
Artifact: “Rank candidates by communication skills, leadership, culture fit, and professionalism. Prefer candidates with polished writing and top-tier universities.”
Top Risks: (1) “Culture fit” invites stereotyping (High). (2) “Polished writing” penalizes non-native speakers (High). (3) “Top-tier universities” proxies socioeconomic privilege (High). (4) Unclear appeal process (Med). (5) No monitoring (Med).
Metric Recommendations: adverse impact ratio ≥ 0.80; equal opportunity difference ≤ 0.05; calibration gap ≤ 0.03; subgroup pass-through rates tracked weekly.
Counterfactual Test Cases (10): Identical resume pairs differing only by: name (John/Fatima), pronouns (he/she), disability mention (none/wheelchair), nationality (US-born/immigrant), age cue (“graduated 1998/2018”), language variant (native/non-native grammar), caregiving gap (none/1-year gap), address (ZIP A/ZIP B), school pedigree (state school/Ivy), accent mention (none/“English as second language”).
Remediation Plan: remove “culture fit” and “polished writing”; replace with structured rubric; allow “communication” scoring only on job-relevant clarity; require human review when confidence 0.70–0.85; add appeal link; add fairness dashboard and CI tests.
Decision: Block release until prompt rewritten and counterfactual suite passes (0 failures across 10 tests).
Prompt Chain Strategy
Step 1: Bias Audit & Risk Diagnosis
Prompt: Use the main Bias Detection Prompt on the artifact you want to evaluate.
Expected Output: A comprehensive bias audit with risks, scenarios, metrics, counterfactual tests, remediation plan, and ship/block decision criteria.
Step 2: Generate a Counterfactual Test Suite + Evaluation Harness
Prompt: “Generate 50 counterfactual test cases in JSON for this context. Then propose an evaluation harness: how to run the model on pairs, compare outputs, define pass/fail, and log failures. Include thresholds and a weekly reporting template.”
Expected Output: A repeatable test suite that can be run in CI/CD and as a recurring audit to catch regressions.
Step 3: Build a Fairness Monitoring Dashboard & Incident Runbook
Prompt: “Design a fairness monitoring dashboard for this system: metrics, slices, alert thresholds, investigation steps, and incident response playbook. Include example SQL queries and a weekly executive summary format.”
Expected Output: An operational governance package that sustains fairness over time, not just pre-launch.
Human-in-the-Loop Refinements
Collect 100–300 Human-Labeled “Disputed Cases” for Calibration
Most bias issues surface in ambiguous cases. Collect a set of disputed examples and get multiple reviewers to label. Use this to calibrate confidence thresholds and reduce subjective bias. Technique: measure inter-rater agreement; if kappa < 0.6, your rubric is unclear and needs revision.
Run “Bias Bug Bounties” With Diverse Reviewers
Invite diverse internal teams or external testers to find biased outputs. Incentivize reporting. Technique: provide categories (stereotyping, exclusion, disrespect, disparate treatment) and require reproduction steps. This catches issues your team won’t anticipate.
Use Structured Rubrics Instead of Free-Form “Quality Scores”
Replace subjective labels like “professionalism” with measurable criteria. Technique: define rubrics with 1–5 scales and examples, and limit the model to rubric-based scoring. This reduces stereotyping and inconsistency.
Implement “Sensitive Attribute Masking” During Evaluation
Test whether outcomes change when names, pronouns, or demographic cues are masked. Technique: evaluate both masked and unmasked versions to identify proxy reliance and reduce it.
Require Executive Sign-Off for Fairness Trade-offs
If improving fairness reduces accuracy or increases cost, require documented approval. Technique: create a one-page “Fairness Trade-off Memo” capturing options and implications.
Review User Journeys for Procedural Justice
Fair systems must be explainable and appealable. Technique: ensure users can request human review and receive a reason code. This reduces perceived unfairness even when outcomes are correct.